Probabilistic Generation of Mass Spectrometry Molecular Abundance

May 30, 2017 - User Resources. About Us · ACS Members · Librarians · Authors & Reviewers · Website Demos · Privacy Policy · Mobile Site ...
0 downloads 0 Views 751KB Size
Subscriber access provided by Binghamton University | Libraries

Article

Probabilistic generation of mass spectrometry molecular abundance variance for case and control replicates John T Prince, and Rob Smith J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 30 May 2017 Downloaded from http://pubs.acs.org on May 31, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Probabilistic generation of mass spectrometry molecular abundance variance for case and control replicates

John T. Prince Independent Scholar Provo, UT 84604

Rob Smith* Department of Computer Science University of Montana Missoula, Montana 59812 Email: [email protected]

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

I. A BSTRACT Shotgun differential mass spectrometry–the untargeted discovery of statistically significant differences between two or more samples–is a popular application with potential to advance biomarker detection, disease diagnostics, and other health objectives. Though many methods have been proposed, few have been quantitatively evaluated. The lack of ground truth data for shotgun difference detection limits quantitative evaluation and algorithmic advancement. While public mass spectrometry data sets of single samples abound, data sets with more than one sample are rare, and data sets with the thousands of samples necessary to capture the complexity of real world populations are non-existent due to technological and cost limitations. We present MSabundanceSIM, novel software for simulating any number of molecular samples based on one or a few real world data sets. The software uses a probabilistic model to generate case and control populations, with intuitive user parameters for tuning. We demonstrate variability by comparing to a real world data set over a range of abundances with differing biological and experimental variation coefficients. MSabundanceSIM is implemented in Ruby, is freely available, requires no external dependencies, and is suitable for a range of applications. II. K EYWORDS Mass Spectrometry; Proteomics; Mass Spectrometry Data Processing; Shotgun Proteomics; Differential Mass Spectrometry; Shotgun Differential Proteomics; Mass Spectrometry Simulation; Proteomics Simulation;

2 ACS Paragon Plus Environment

Page 2 of 16

Page 3 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

III. I NTRODUCTION Mass spectrometry provides a mechanism to measure the abundances of molecules in liquid, solid, or gas samples using their physicochemical properties. Mass spectrometry data can be used to identify differences between samples, either through direct signal analysis or indirectly through molecular identities and quantities yielded from data processing tools that interpret mass spectrometry data 1 . These differences can be used, among other things, to correlate molecules with certain experimental conditions, making possible a variety of experimental outcomes including biomarker detection, disease diagnostics, and other health objectives. While many mass spectrometry data processing methods have been proposed for shotgun differential proteomics, few have been quantitatively evaluated 2 , due in large part to the lack of ground truth data 3 (all true molecular identities and abundances in a single sample, or the full correspondence between molecules across multiple samples). Without ground truth data, it is not possible to quantitatively measure the accuracy or coverage of current methods. Despite an increasing call for mass spectrometry ground truth datasets 3 , only the most rudimentary data sets exist. Chemical ground truth mass spectrometry data is expensive to produce, and the considerably limited coverage and accuracy of data processing methods 4 create a race condition: Ground truth is required to evaluate and improve data processing algorithms, and improved data processing algorithms are required to produce ground truth. Recently, several MS data simulators have been proposed 5;6;7 . Though not a replacement for ground truth, simulated MS data can be used as a first pass to develop incrementally improved data processing algorithms, providing a proof of concept and surrogate quantitative evaluations for new and existing data processing methods. Simulated data can also be essential for providing the preliminary results required to obtain or justify the necessary funding to generate real shotgun proteomics ground truth, a considerable cost for shotgun proteomics. Current shotgun data processing shortfalls are particularly abundant in differential detection, where inaccuracies and low coverage contribute to low reproducibility and dubious value in published biomarkers for certain conditions 8 . Standard statistical approaches, such as power analysis and the t-test, are readily available to measure the significance of the abundance differences in a protein between runs. However, raw mass spectrometry requires data processing in order to generate lists of molecule quantities by type. Without this first step, there are no

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

abundance differences to test statistically. With modern mass spectrometry only capable of identifying 10-15% of the molecules in a set of runs 4;9 , it is evident that increasing coverage of differential testing requires data processing methods that can link corresponding molecule abundances without first identifying them. Quantitative validation of shotgun proteomics difference detection requires multiple .mzML files for both case and control samples, where the abundances of all peptides are known, all peptides have natural variance, and where a subset of all present proteins have significant abundance differences. Such a data set would be prohibitively expensive to generate for preliminary analysis. Simulation provides a realistic alternative. While published MS simulators focus on emulating a mass spectrometer (simulating, for example, digestion, chromatography, and MS1/MS2 raw data output), inputs to these emulations (such as multiple case and control proteomes with abundances) have not yet been simulated. While MS output simulation provides surrogate quantitative data for evaluation purposes, as the molecular provenance of all points are known, as well as the identities and quantities of all molecules, MS input simulation would provide a test bed for data processing capabilities on large populations of case and controls without the time, cost, and logistical constraints of gathering real samples. Moreover, simulation allows the isolation of a single step in the MS data processing pipeline to limit the effect of confounding variables on experimental outcomes 8 . Though MS input simulation is not a replacement for benchwork, it can be used to provide proof of concept to motivate the expenditures necessary for real world demonstration. We present a probabilistic method for generating simulated MS input–to our knowledge the first such method proposed. Though this approach would be equally suited to metabolites and lipids, this manuscript uses the shotgun proteomics domain for illustration. The proposed method takes as input a list of proteins and abundances (with one or more samples), creating probabilistically generated abundance variance within experimental replicates for both case and control samples (see Figure 1). The method can produce both lists of sample proteins and abundances (for testing difference detection statistical analysis) or .fasta files to be used on MS output simulators (for full difference detection MS data processing pipeline testing). We explain the generative method, provide rationale for design decisions, explain user parameter options, and demonstrate output characteristics against a real world data set.

4 ACS Paragon Plus Environment

Page 4 of 16

Page 5 of 16

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 6 of 16

Page 7 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

TABLE I S IMULATION PARAMETERS .

Symbol

Name

Purpose

β

Biological Variation Coefficient

ǫ

Experimental Variation Coefficient

Tunes experimentally differing case/control molecular abundance variation.

δ

Differential Expression Coefficient

Controls case/control differentially expressed molecule percentage.

φ

Fold Change

ν

Normalizing Coefficient

δ

Positive Skew Rate

Controls natural (non-meaningful) abundance variation.

A continuous value determined by β, ǫ, and δ. Creates higher probability of larger fold change at lower abundances. Modifies the Poisson to have a positive skew.

present in a Poisson distribution, accounting for the additional variability between biological and experimental replicates 14 . In both case and control samples, simulated abundances are overdispersed in two ways. First, the Poisson is modified by applying a sign and a progressive random perturbation factor (inversely proportional to abundance), extending the range of the Poisson to include decimal and negative values while preserving mass spectrometry data’s tendency to have reduced variance at higher abundances. Second, by modifying the positive skew of the distribution probabilistically, providing a mechanism to generate a sharper dropoff in probability after zero than a normal Poisson. This degree of positive skew is controlled by the parameter δ. The higher δ is, the more the Poisson will be positively skewed (see Results). The simulator uses two parameters–β and ǫ–to model different types of variance as λ values for biological variation and experimental variation, respectively (see Table I). Non-differential (natural) variance is sampled using a Poisson distribution parameterized with a biological variation coefficient β (see Equation 1).

φβ Poisson(β) ∗ ν

(1)

In order to provide a greater amount of variance to reflect experimental differences between case and control samples, the experimental variation coefficient ǫ is used to simulate abundances for a user-specified subset of the total molecules in a sample (see Equation 2).

φǫ Poisson(ǫ) ∗ ν

7 ACS Paragon Plus Environment

(2)

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 16

In order to simulate the well-known property that non-meaningful larger fold changes are more likely among lower abundance molecules, abundances samples include a normalization factor ν that adjusts high fold change probability as an inverse factor of abundance (see Equation 3).

ν =1−

ai amax

(3)

As a result of ν, β and ǫ should be regarded as an upper limit on fold change, as the maximum fold change will be very unlikely except at the lowest abundances. The fold change (φ) of each simulated abundance compared to the original abundance is a parameter determined from the settings of all other parameters. In other words, it is not directly tunable by the user. C. Rejection Sampling The statistical sampling described above would be possible in libraries that can generate Poisson-distributed random numbers. Unfortunately, software depending on external software libraries can cease to function if (or, more often, when) those libraries are no longer updated or supported. To avoid the need for external libraries, MSabundanceSIM implements its own sampling methods that emulate Poisson random number generation using a technique called rejection sampling 15;16 . Rejection sampling allows the use of a more accessible random number generator–such as the uniform distribution random number generator included in Ruby–to generate other distributions such as the Poisson. The rejection sampling algorithm begins with a uniformly generated probability y (see Figure 3). A second value k is randomly selected from all reals and used to generate a Poisson probability pk parameterized on λ. A new k is sampled until y < pk . Here, β or ǫ is used depending on whether we are sampling a experimentally similar or experimentally differential molecule. We have implemented several modifications to reduce runtime without substantially affecting the simulation results. Since the Poisson distribution has maximum probability around µ = λ, we can constrain y from (0,Poisson(λ, λ)) to prevent wasted samples (no y > Poisson(λ,λ) will be accepted). We also constrain possible k values to (0,4*λ), as the vast majority of probability is included in this range for most values of λ. With these efficiencies, we both limit the size of 8 ACS Paragon Plus Environment

Page 9 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Fig. 3. 1: 2: 3:

Rejection Sampling

procedure P OISSON y ← Uniform(0, 1) sample k:

4:

k ← Uniform(0, 4λ)

5:

p k ← λk ∗

6:

goto loop

7: 8:

e(−λ) k!

loop: if y > pk then goto sample k

9: 10:

end if

11:

return y

12:

end procedure

all k values while guaranteeing that the initial y value will yield an accepted k value within the Poisson distribution. D. Input The various simulators for generating MS output take protein lists as input in the form of .fasta files, text files that contain lists of protein entries consisting of each protein’s amino acid chain. Some simulators allow for specification of protein abundance in the .fasta file as an addendum to the first line of each entry. The .fasta type input is obtainable from mining public repositories (see posted code for example code to pull a human plasma proteome from pax-DB). Our simulator accepts .fasta files for those applications where users intend to create MS output files from the simulated sample populations to enable downstream testing such as the ability for a workflow to detect biomarkers from raw MS data. It also accepts input files that consist only of a label and quantity for each protein, for those who wish to evaluate methods for statistical significance measure directly. There are a multitude of freely available public abundance datasets such as pax-db 17 , uniprot 18 , PeptideAtlas 19 , and others. The software’s ability to accept simple lists of abundances makes it 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

easy to use the readily available abundance lists available from public repositories. Those who would like to create .fasta files must a) create lists of .fasta files and b) append corresponding abundances to each molecule. This combination can be non-trivial and might require navigation of large, complex, and possibly deprecated SQL databases, building custom web crawlers to combine information, etc. For example, to pull protein lists and abundances from the Human Plasma August 2013 PeptideAtlas experiments 20 , it was necessary to parse a protein list from pax-db, crawl uniprot.org, ink the ID to the .fasta entry, and append an abundance for each entry. This script is included in the published codebase to provide an example for those who wish to do the same. E. Output The software produces a user-specified number of case and control samples with a userspecified number of differentially expressed molecules. If the input is provided in .fasta format, the output will also be produced as .fasta files that can be used in an MS simulator to generate MS output files. Users can also generate simple .txt files with text labels and abundances for each molecule by providing an input file in the same format. V. R ESULTS A successful MS input simulator will produce experimental replicates with abundance variance that matches real world observations. For validation purposes, we demonstrate that the natural variance, fold change versus abundance, and experimental variance of MSabundanceSIM all meet expectations from observed real world data. A. Real World Dataset Public real world data sets consisting of multiple runs from a single individual and from multiple individuals are limited in availability (many are pooled instead of analyzed individually) and limited in coverage, with state of the art MS/MS identification methods capable of identification of only some of the most abundant peptides within a sample. A recent study provided MS/MS analysis of plasma samples from ten individuals, where 347 protein groups were identified total, with 285 detected in all individuals 21 . We use these experiments as a benchmark to show that MSabundanceSIM is capable of representing realistic variance. 10 ACS Paragon Plus Environment

Page 10 of 16

Page 11 of 16

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 12 of 16

Page 13 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

real data with high positive skew ( ǫ=2, δ=0.99) and ǫ=2 (see Figure 4b). The natural variance present in the real plasma data is replicated with high fidelity, with high positive skew ( δ=0.99) and β=1 (see Figure 4d). C. Fold change versus abundance It is well known that fold changes occur at a higher rate among lower abundances molecules than higher abundances 14;22 . The software achieves this relationship by using the normalization term ν described above, which makes higher fold changes more likely at lower abundances. Using β and ǫ to tune the distribution of fold changes allows the simulator to assign sufficiently large fold change depending on whether the simulation is of technical replicates (Figure 5a) or experimental replicates (Figure 5b), in conjunction with fine tuning with the δ parameter. VI. C ONCLUSION In the absence of ground truth data, simulated MS data provides a mechanism for quantitative evaluation of modules in the MS data processing pipeline. Although several MS output simulators have been proposed, this work presents the first MS input simulator of which we are aware. Software to generate any number of case and control molecule abundances from one or more real world samples allow for the development and initial testing of novel methods for differential detection of mass spectrometry samples. Our software creates case and control abundances with variances matching well known properties of real world data. This project is available at https://github.com/optimusmoose/MSabundanceSIM. ACKNOWLEDGMENT This material is based upon work supported by the National Science Foundation under Grant No. 1552240. R EFERENCES [1] Rob Smith, Andrew D. Mathis, Dan Ventura, and John T Prince. Proteomics, Lipidomics, Metabolomics: A Mass Spectrometry Tutorial from a Computer Scientist’s Point of View. BMC Bioinformatics, 15(Suppl 7):S9, 2014.

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

[2] Rob Smith, Dan Ventura, and John T Prince.

Page 14 of 16

Novel algorithms and the benefits of

comparative validation. Bioinformatics, 29(12):1583–1585, 2013. [3] Anne-Laure Boulesteix. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by smith et al. Bioinformatics, page btt458, 2013. [4] Annette Michalski, Juergen Cox, and Matthias Mann. More than 100,000 Detectable Peptide Species Elute in Single Shotgun Proteomics Runs but the Marjority is Inaccessible to DataDependent LC-MS/MS. Journal of Proteome Research, 10:1785–1793, 2011. [5] Andrew B Noyce, Rob Smith, James Dalgliesh, Ryan M. Taylor, K.C. Erb, Nozomu Okuda, and John T. Prince. Mspire-Simulator: LC-MS Shotgun Proteomic Simulator for Creating Realistic Gold Standard Data. Journal of Proteome Research, 12(12):5742–5749, 2013. [6] Rob Smith and John T Prince. Jamss: Proteomics mass spectrometry simulation in java. Bioinformatics, page btu729, 2014. [7] Chris Bielow, Stephan Aiche, Sandro Andreotti, and Knut Reinert. MSSimulator: Simulation of mass spectrometry data. Journal of Proteome Research, 10(7):2922–2929, 2011. [8] Rob Smith, Dan Ventura, and John T. Prince. Controlling for Confounding Variables in MS-omics Protocol: Why Modularity Matters. Briefings in Bioinformatics, 2013. [9] Fahad Saeed, Trairak Pisitkun, Mark A Knepper, and Jason D Hoffert.

An efficient

algorithm for clustering of large-scale mass spectrometry data. In Bioinformatics and biomedicine (BIBM), 2012 IEEE International Conference on, pages 1–4. IEEE, 2012. [10] Richard Berk and John M MacDonald. Overdispersion and poisson regression. Journal of Quantitative Criminology, 24(3):269–284, 2008. [11] Thang V Pham, Sander R Piersma, Marc Warmoes, and Connie R Jimenez. On the betabinomial model for analysis of spectral count data in label-free tandem mass spectrometrybased proteomics. Bioinformatics, 26(3):363–369, 2010. [12] James H Bullard, Elizabeth Purdom, Kasper D Hansen, and Sandrine Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC bioinformatics, 11(1):1, 2010. [13] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth.

edger: a bioconductor

package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010. 14 ACS Paragon Plus Environment

Page 15 of 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

[14] Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Genome biology, 11(10):1, 2010. [15] Luc Devroye. Sample-based non-uniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM, 1986. [16] George Casella, Christian P Robert, and Martin T Wells. Generalized accept-reject sampling schemes. Lecture Notes-Monograph Series, pages 342–347, 2004. [17] Mingcong Wang, Manuel Weiss, Milan Simonovic, Gabriele Haertinger, Sabine P Schrimpf, Michael O Hengartner, and Christian von Mering. Paxdb, a database of protein abundance averages across all three domains of life. Molecular & Cellular Proteomics, 11(8):492–500, 2012. [18] UniProt Consortium et al. The universal protein resource (uniprot). Nucleic acids research, 36(suppl 1):D190–D195, 2008. [19] Frank Desiere, Eric W Deutsch, Nichole L King, Alexey I Nesvizhskii, Parag Mallick, Jimmy Eng, Sharon Chen, James Eddes, Sandra N Loevenich, and Ruedi Aebersold. The peptideatlas project. Nucleic acids research, 34(suppl 1):D655–D658, 2006. [20] Terry Farrah, Eric W Deutsch, Gilbert S Omenn, Zhi Sun, Julian D Watts, Tadashi Yamamoto, David Shteynberg, Micheleen M Harris, and Robert L Moritz. State of the human proteome in 2013 as viewed through peptideatlas: comparing the kidney, urine, and plasma proteomes for the biology-and disease-driven human proteome project. Journal of proteome research, 13(1):60–75, 2013. [21] Philipp E Geyer, Nils A Kulak, Garwin Pichler, Lesca M Holdt, Daniel Teupser, and Matthias Mann. Plasma proteome profiling to assess human health and disease. Cell systems, 2(3):185–195, 2016. [22] Guoshuai Cai, Hua Li, Yue Lu, Xuelin Huang, Juhee Lee, Peter M¨uller, Yuan Ji, and Shoudan Liang. Accuracy of rna-seq and its dependence on sequencing depth. BMC bioinformatics, 13(13):1, 2012.

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 16 of 16