Enrichment-Based Proteogenomics Identifies Microproteins, Missing

Jun 13, 2018 - ... Plant Resources, School of Life Sciences, Sun Yat-Sen University, ... However, the annotation and identification of microproteins i...
0 downloads 0 Views 2MB Size
Subscriber access provided by Kaohsiung Medical University

Article

Enrichment-based proteogenomics identifies microproteins, missing proteins, and novel smORFs in Saccharomyces cerevisiae Cuitong He, Chenxi Jia, Yao Zhang, and Ping Xu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00032 • Publication Date (Web): 13 Jun 2018 Downloaded from http://pubs.acs.org on June 14, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Enrichment-based proteogenomics identifies microproteins, missing proteins, and novel smORFs in Saccharomyces cerevisiae Cuitong He1,2, Chenxi Jia2, Yao Zhang3*, Ping Xu1,2,4* 1

Anhui Medical University, Hefei 230032, China.

2

State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences

(Beijing), Beijing Institute of Lifeomics, Beijing 102206, China. 3

State Key Laboratory of Biocontrol and Guangdong Provincial Key Laboratory of Plant Resources, School of

Life Sciences, Sun Yat-Sen University, Guangzhou 510275, China. 4

Key Laboratory of Combinatorial Biosynthesis and Drug Discovery of Ministry of Education, School of

Pharmaceutical Sciences, Wuhan University, Wuhan, 430071, P. R. China.

* Corresponding Authors

Ping Xu

Beijing Proteome Research Center, No. 38 Science Park Road, Changping District, Beijing, China.

Tel: 86-10-61777113; Fax: 86-10-61777050; E-mail: xuping @mail.ncpsb.org.

Yao Zhang

Sun Yat-Sen University, No. 135 Xingang west Road, Haizhu District, Guangzhou, China.

Tel and Fax: 86-20-84111727; E-mail: [email protected].

1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT

Microproteins are peptides composed of 100 amino acids (AA) or less, encoded by small open reading frames (smORFs). It has been demonstrated that microproteins participate in and regulate a wide range of functions in cells. However, the annotation and identification of microproteins is challenging in part owing to their low molecular weight, low abundancy, and hydrophobicity. These factors have led to the un-annotation of smORFs in genome processing and have made their identification at the protein level difficult. Large-scale enrichment of microproteins in proteogenomics has made it possible to efficiently identify microproteins and discover unannotated smORFs in Saccharomyces cerevisiae. Here, we integrated four microprotein-specific enrichment strategies to enhance coverage. We identified 117 microproteins, verified 31 missing proteins (MPs), and discovered 3 novel smORFs. In total, 31 proteins were confirmed as MPs by spectrum quality checking. Three novel smORFs (YKL104W-A, YHR052C-B, and YHR054C-B) were reserved after spectrum quality checking, peptide synthesizing, homologue matching, etc. This study not only demonstrates that there are potential smORF candidates to be annotated in an extensively studied organism, but also presents an efficient strategy for the discovery of small MPs. All MS datasets have been deposited to the ProteomeXchange with identifier PXD008586 (Username: [email protected]; Password: UNEbNk3j).

KEYWORDS: proteogenomics, microproteins, smORFs, missing proteins, enrichment

INTRODUCTION Microproteins are peptides with ≤100 amino acids (AA) encoded by small open reading frames (smORFs)1-2 and play a key role in many biological events, including tumor progression3, stress 2

ACS Paragon Plus Environment

Page 2 of 28

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

response4, signal regulating5, and metabolism6, among others. The small proteome remains largely unexplored because current methods for genome annotation breakdown smORFs, with a cut-off of 100 AA in all three domains of life7-9. Moreover, complex sampling disturbs low abundance microproteins10. Recent advances in computational tools and high-throughput sequencing have allowed for the identification of a myriad of smORFs. Many bioinformatics-based prediction strategies and tools have been employed for mining smORFs, including usage of alternative start codons11, comparison of sequence conservation12-15, codon usage15, and functional domain analysis16. Many experimental methods have been used to predict or validate smORFs, including genetic research17, DNA microarray, RNA-seq18-19, green fluorescent protein20, and tandem affinity purification (TAP)-based proteomics21. Among these methods, identification of smORFs by analyzing mass spectrometry-derived proteomic data can greatly improve smORF confidence and coverage22. Proteogenomics is an important field that has been applied in recent studies for genome reannotation 23 and is widely used in all kinds of organisms24-26. One of the major challenges for the analysis of microproteins is a result of their characteristic low abundance, which are often missed in favor of high-abundance proteins during complex sampling and identification. Therefore, most annotated microproteins are currently lacking evidence of expression at the protein level, and are referred to as missing proteins (MPs) in the UniProt database27-28. Research on MPs has been highlighted as one of the important missions in the chromosome-centric Human Proteome Project of 2012 but has not been the focus in other living organisms. To minimize the interference from high-abundance proteins, we used microprotein-specific 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

enrichment strategies to search for missing microproteins and un-annotated smORFs in Saccharomyces cerevisiae. Strict spectrum filtering and peptide synthesizing were used to increase the reliability of novel peptides. Combining these techniques, we achieved an in-depth analysis of microproteins, which sheds light on the smORF research based on proteogenomics in S. cerevisiae.

MATERIALS AND METHODS Cell culture The isogenic S. cerevisiae JMP02429 was cultured at 30°C in YPD medium (1% yeast extract, 2% Bacto-peptone, and 2% dextrose) and harvested at the mid-exponential phase (OD600 =1.5). Microproteins enrichment methods Microproteins of JMP024 were enriched using four complementary enrichment methods: (1) Urea_Tricine method. A total of 30 OD cells were re-suspended in lysis buffer (8 M Urea, 50 mM NH4HCO3, 50 mM iodoacetamide (IAA), and 1 mM phenylmethanesulfonyl fluoride (PMSF)) and lysed by vortex mixing with glass beads (work 30 s, iced 30 s, 10 cycles). The debris was eliminated by centrifugation (17,000 ×g) at 4°C for 20 min. The protein concentration was determined by a gel-assisted method as described previously30, and the image was analyzed by Scion Image (4.0.3.2) software (National Institutes of Health, Bethesda, MD, USA). Briefly, the samples were run on SDS-PAGE (0.2 cm) and were stained with Coomassie Brilliant Blue G250 to quantify the amount of protein, based on the dye absorbance signal. A total of 80 µg of proteins was subjected to disulfide reduction with 5 mM dithiothreitol (DTT) (45°C, 30 min) and alkylation with 20 mM IAA (room temperature, 30 min in the dark). The alkylated proteins were 4

ACS Paragon Plus Environment

Page 4 of 28

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

resolved by a 12% Tricine-SDS-PAGE (6.5 cm) and stained with Coomassie Brilliant Blue G250. Subsequently, the gel lane below the 30 kDa region was excised into 13 fractions. (2) Urea_MWCO method. The proteins extracted as described above were processed by a 30 kDa MWCO filter (Millipore, Amicon Ultra, USA), and the flow-through was subjected to disulfide reduction and alkylation as above. (3) HCl_Tricine method. A total of 60 OD cells were suspended in lysis buffer (50 mM HCl, 0.1% β-mercaptoethanol (β-ME), 0.05% Triton X-100)10 and lysed by vortex mixing with glass beads. After centrifugation (17,000 ×g, 4°C, 20 min), the supernatant was transferred to a new tube. The debris was further extracted with the same buffer three times to collect the remaining proteins. The supernatant was collected after centrifugation (17,000 ×g, 4°C, 20 min). Subsequently, the four extracts were pooled, lyophilized, and then suspended with 1 × SDS gel-loading buffer (50 mM Tris-HCl, pH 6.8, 2% SDS, 10% glycerol, 0.1% Bromophenol Blue). The sample was subjected to disulfide reduction and alkylation as above. Proteins were resolved by 12% Tricine-SDS-PAGE (3 cm) and stained with Coomassie Brilliant Blue G250. Further, the gel lane below the 30 kDa region was excised into 5 fractions. (4) HCl_MWCO method. The multiple extracts from the HCl_Tricine method were passed through a 30 kDa MWCO filter and the flow-through was subjected to disulfide reduction and alkylation as above. Digestion and sample preparation for LC-MS/MS Enriched proteins from the Urea_Tricine and HCl_Tricine methods were digested in-gel with trypsin (12.5 ng/µL) at 37°C for 12 h, as described previously31. The extracted peptides were dried for MS detection. For the Urea_MWCO and HCl_MWCO samples, the 3 kDa MWCO filter aided sample preparation (FASP) method was performed as described previously32. Briefly, for the Urea_MWCO enrichment method, enriched sample was added to the filter unit and centrifuged at 5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

14,000 ×g for 20 min. Subsequently, the samples were washed three times with 50 mM NH4HCO3 (300 µL). Trypsin was loaded onto the filter at a protease to protein substrate ratio of 1:50 and the mixture was incubated for 12 h at 37°C. Peptides were then desalted using a C18 column as described previously31 and vacuum dried before LC-MS/MS analysis. The HCl_MWCO sample was added to the filter unit and centrifuged at 14,000 ×g for 20 min. The samples were washed three times with 300 µL of substitution buffer (8 M urea in 0.1 M Tris-HCl, pH 8.5). Subsequently, the samples were washed three times with 300 µL of 50 mM NH4HCO3 and digested with trypsin as above. LC-MS/MS analysis Prior to LC-MS/MS analysis, the digested peptides were re-suspended in loading buffer (1% ACN, 1% formic acid (FA)) and analyzed using an ultraperformance liquid chromatography (UPLC) system (nanoAcquity, Waters, Milford, MA, USA) equipped with self-packed capillary column (75 µm i.d. × 15 cm, 3 µm C18 reverse-phase fused-silica), with a 60 min nonlinear gradient at a flow rate of 300 nL/min. The elution gradient was as follows: 2-4% B for 6 min, 4-10% B for 2 min, 10-25% B for 27 min, 25-35% B for 20 min, 35-80% B for 5 min (Buffer A, 2% ACN and 0.1% FA in dd H2O; Buffer B, 0.1% FA in ACN). Eluted peptides were ionized under high voltage (1.5 kV) and analyzed using an LTQ-Orbitrap Velos mass spectrometer (Thermo Electron, San Jose, CA, USA). The initial MS spectrum (MS1) was analyzed over a mass range of 300-1600 Da with a resolution of 30,000 at m/z 400. The automatic gain control (AGC) was set to 1 × 106, and the maximum injection time (MIT) was 150 ms. The subsequent MS spectrum (MS2) was analyzed using a data-dependent mode to search for the 20 most intense ions fragmented in the linear ion trap. For each scan, the threshold for triggering MS2 was set at 2,000 counts, the 6

ACS Paragon Plus Environment

Page 6 of 28

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

isolation width was 2 m/z, the normalized collision energy was 35%, the AGC was 1×104, and the MIT was 30 ms. Precursor ion charge state screening was enabled and all unassigned charge states, as well as singly charged species, were rejected. The dynamic exclusion was set at 40 s to avoid the repeated detection of the same peaks. Identification of annotated and unannotated smORFs To identify annotated smORFs, the raw files were searched for using MaxQuant (v1.5.6.0) against the Swiss-Prot reviewed database (2017.11, 6,721 proteins). The search parameters were set as follows: trypsin was set as protease with two missed cleavages permitted; cysteine carbamidomethylation was set as fixed modification; methionine oxidation as variable modifications; precursor mass tolerance was set at 20 ppm; that of the MS2 fragments was set at 0.5 Da; peptide matches were filtered by a minimum length of seven amino acids; false discovery rate (FDR) was set at 1% at the peptide and protein levels, which was estimated using a target-decoy search strategy. Only the proteins satisfying the following criteria were listed as confirmed MPs: (1) unique peptide; (2) the peptide length ≥7 AA; (3) higher quality spectra filtered by less impure peaks and three pair of continuous b/y ions matching; (4) the isobaric sequence filtering, evaluating whether I = L, Q [Deamidated] = E, GG = N existed. To search for unannotated smORFs, data files were searched using pFind (v3.1.2)33-34 against a specified composite of target/decoy S. cerevisiae S288C database35 containing a six-frame translation

database

of

the

S288C

genome

(https://www.ncbi.nlm.nih.gov/genome/15?genome_assembly_id=22535), an N-terminal peptides database which comprised fully tryptic peptides beginning with methionine, and a common contaminants database for proteogenomics. ORFs in the six-frame database were translated from 7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

stop codon to stop codon ensuring at least 6 AA were in one ORF translated by pAnno software in pFind. A total of 875,609 ORFs were generated. The novel peptides were strictly filtered by spectrum checking and peptide synthesizing. Verification of the identified novel peptides To evaluate the authenticity of novel peptides identified in this study, the spectra quality of novel peptides was manually checked by observing the intensity of fragment ions and b/y ions matching. To select the more credible novel peptides, our criteria included: (1) unique peptide; (2) the peptide length ≥9 AA; (3) higher quality spectra filtered by less impure peaks and three pair of continuous b/y ion matching; (4) the isobaric sequence filtering, evaluating whether I = L, Q[Deamidated] = E, GG = N existed. High quality spectra were selected for peptide synthesizing. The pFind and pBuild within the pFind software36 were used for matching spectra between the original peptides and synthesized peptides. The cosine similarity score for two spectra was calculated as follows: (1) peaks corresponding to precursors were removed from the initial spectrum; (2) theoretical m/z of b+, y+, b–H2O+, y–H2O+, b–NH3+, y–NH3+, b++, and y++ ions were calculated, and peaks matching these ions in the experimental spectrum were extracted with their theoretical m/z assigned; (3) the m/z of these selected peaks were converted to integers, and the intensity of a peak with integral m/z was defined as the summation of the peaks in that window; (4) the dimensions of the synthetic and endogenous spectra were aligned, with zero fixed if missing values were met; (5) a cosine similarity score between the two spectra was calculated. Bioinformatics analysis of identified proteins The

protein

property

parameters

were

obtained

8

ACS Paragon Plus Environment

from

the

SGD

database

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(https://www.yeastgenome.org/), including grand average of hydropathicity (GRAVY), instability index, and codon adaption index, among others. The protein subcellular location information was obtained from the UniProt database (http://www.uniprot.org/). Proteins with positive GRAVY values were considered to be hydrophobic, and ones with negative values were hydrophilic37. DAVID 6.7 was used for Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways analysis (http://david.abcc.ncifcrf.gov/).

RESULTS AND DISCUSSION

Microprotein enrichment-based proteogenomic analysis It is difficult to predict smORFs in genome annotation and to visualize microproteins in a complex sample10. The proteogenomic approach is a powerful tool for validating annotated and unannotated smORFs directly at the protein level based on microprotein-specific enrichment. We investigated the microproteins of S. cerevisiae using four different enrichment strategies: Urea_Tricine, HCl_Tricine, Urea_MWCO, and HCl_MWCO. Efficient extraction and separation are two critical steps in microprotein enrichment. Considering the complementarity of solubility between chaotropes and detergents, we used two types of extraction buffer. One was the classic urea strategy, based on neutral chaotropic agent, 8 M Urea/50 mM NH4HCO3/50 mM IAA31. The other strategy combined detergent Triton X-100 as 50 mM HCl/ 0.1% βME/ 0.05% Triton X-100. The above Triton X-100 buffer showed good extraction efficiency in previous microprotein studies10. Regarding separation mechanisms, MWCO filtering38 and Tricine-SDS-PAGE39 are based on molecular size and charge, respectively. In general, we combined two extraction and two separation strategies for enriching microproteins, 9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

including Urea_Tricine, HCl_Tricine, Urea_MWCO, and HCl_MWCO. In total, 20 fractions were obtained from four strategies and digested with trypsin. All raw files were searched against the UniProt protein database of S. cerevisiae S288C for detecting annotated microproteins and MPs, and six-frame database of S. cerevisiae S288C. The workflow is summarized in Figure S-1. Four enrichments improve microproteins identification We identified 1967 proteins, among which 117 proteins were microproteins. These microproteins accounted for about 60% of the total annotated microproteins at the PE1 level (Table S1). The Urea_Tricine method showed the best enrichment efficiency of microproteins, which was more than double of the other three methods (Figure 1A). The cumulative curves suggested that the different methods could promote microprotein identification both at the peptide and protein level (Figure 1B&C). The curve was steeper at the Urea_MWCO site, which indicated that the Urea_MWCO method shows more complementary with the Urea_Tricine method than others for microprotein identification. To observe the molecular weight (MW) bias of these methods, we compared their MW distribution. Figure 1D shows that the trend of 81-100 AA microproteins was consistent with total microproteins identification; Urea_Tricine identified the largest number of microproteins in three AA ranges; Urea_MWCO had a bias to 61-80 AA microproteins compared with HCl_Tricine; HCl_Tricine identified almost the same number of 1-60 AA microproteins as Urea_Tricine; unlike other three methods, HCl_MWCO showed the same identification between the 61-80 AA and 81-100 AA microproteins. The 1-60 AA range was the lowest microproteins group of the four strategies, which also accounted for the lowest amount of total annotated microproteins. To further understand the characteristics of microproteins, we compared the hydrophobic 10

ACS Paragon Plus Environment

Page 10 of 28

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

relationship of microproteins of different lengths, both in total annotated and identified microproteins. Our protein hydrophobicity analysis showed that hydrophobic proteins accounted for 43.4% of the total annotated microproteins, and 76.1% identified microproteins were hydrophilic proteins (Figure 1E). The percentage of hydrophobic microproteins significantly decreased with increased protein length in the total annotated microproteins. The identified microproteins showed the lowest percentage of hydrophobic microproteins in the 1-60 AA range. Subcellular localization analysis showed that more total microproteins were primarily localized in the membrane compared with identified microproteins, which is consistent with previous hydrophobic analyses (Figure 1F). The identified microproteins were mainly localized in the membrane, followed by the cytoplasm, the nucleus, and the mitochondria. While the majority of membrane proteins in the annotated microproteins were not detected, these results imply that greater hydrophobic characteristics might be one of important factors in microproteins identification40.

Microproteins

identification

might

be

optimized

when

merging

membrane-specific extraction in microproteins enrichment. The codon adaptation index (CAI) is an index for evaluating codon usage bias and protein expression degree, where a higher value denotes stronger codon usage and higher expression41. The CAI of annotated microproteins was lower than total annotated proteins as a whole, which is consistent with microproteins’ low-abundance properties. Identified microproteins showed higher CAI values compared with annotated microproteins, suggesting that higher-expression proteins in total annotated microproteins were identified (Figure 1G). In E. coli, the codon usage was significantly positively correlated to gene length. Eyre-Walker speculated that this positive correlation was because of selection constraints to avoid incorporation errors during translation42. 11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Optimization of microprotein expression conditions may be an efficient strategy for promoting the identification of microprotein. Protein stability was estimated using the instability index calculated by ProtParam (http://www.expasy.ch/tools/protparam-doc.html). An index ≤40 denoted a stable protein43. The instability index distribution analysis showed that identified microproteins were slightly more stable than annotated microproteins (Figure 1H). The result showed that about half of the identified and total annotated microproteins belonged to unstable proteins. The instability might be another important feature for microprotein identification. KEGG analysis found that the identified microproteins were mainly clustered in oxidative phosphorylation, ribosome, spliceosome, and protein export pathways (Figure 1I), which might play an important role in the biological process of protein translation, post-translational translocation, and energy metabolism. Microprotein enrichment strategies identified 31 confirmed MPs The UniProt Consortium lists protein existence (PE) at 5 levels44: PE1 (experimental evidence at protein level) to PE5 (uncertain protein). The MPs identified here were PE2 (experimental evidence at the transcript level), PE3 (protein inferred from homology), and PE4 (protein predicted). We identified 35 MP candidates with at least one unique peptide. To evaluate the authenticity of MPs, we manual checked their spectra quality. After a strict filtering process, 31 MPs (3 PE2, 17 PE3, and 11 PE4) were ranked and confirmed as MPs, and 4 MPs (PE3) were filtered out (Figure 2A & Figure S-4). Among 31 MPs, 12 MPs had two or more unique peptides of 9 or more AA. The majority of the MPs (25) were from the Urea_Tricine method, followed by HCl_Tricine 12

ACS Paragon Plus Environment

Page 12 of 28

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(10), HCl_MWCO (9), and Urea_MWCO (7). A total of 16 and 4 MPs were uniquely identified in the Urea_Tricine and HCl_MWCO datasets, respectively. Among 31 MPs, 11 proteins were shared amongst at least two strategies, implying that these MPs were authentic (Figure 2B). Protein length analysis of MPs showed that the largest number of MPs was distributed in the 0-100 AA group, which might contribute to microprotein-specific enrichment strategies in this study (Figure 2C). The number of MPs with