Identifying high-priority proteins across the human diseasome using

Sep 26, 2018 - We describe here a method to identify and rank protein-topic relationships by calculating the semantic similarity between a protein and...
0 downloads 0 Views 4MB Size
Subscriber access provided by UNIV OF TASMANIA

Article

Identifying high-priority proteins across the human diseasome using semantic similarity Edward Lau, Vidya Venkatraman, Cody T Thomas, Joseph C. Wu, Jennifer E. Van Eyk, and Maggie Lam J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00393 • Publication Date (Web): 26 Sep 2018 Downloaded from http://pubs.acs.org on October 2, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Identifying high-priority proteins across the human diseasome using semantic similarity Edward Lau1 , Vidya Venkatraman2 , Cody T Thomas3 , Joseph C Wu1 , Jennifer E Van Eyk2,* , Maggie PY Lam3,* 1 Stanford Cardiovascular Institute, Stanford University, Palo Alto, CA. 2 Advanced Clinical Biosystems Research Institute, Department of Medicine and The Heart Institute, Cedars-Sinai Medical Center, Los Angeles, CA. 3 Department of Medicine, Division of Cardiology, Consortium for Fibrosis Research and Translation, Anschutz Medical Campus, University of Colorado Denver, CO. * Correspondence Jennifer E Van Eyk, PhD Cedars-Sinai Medical Center Advanced Health Sciences Pavilion, 9th Floor 127 S. San Vicente Boulevard Los Angeles, CA 90048, USA Email: [email protected] Maggie Pui Yu Lam, PhD University of Colorado Denver - Anschutz Medical Campus Mail Stop B139, Research Complex 2 12700 E. 19th Avenue Aurora, CO 80045, USA Email: [email protected] Abbreviations: NCD: Normalized Co-Publication Distance; WCD: Weighted Co-Publication Distance.

1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

Abstract Identifying the genes and proteins associated with a biological process or disease is a central goal of the biomedical research enterprise. However, relatively few systematic approaches are available that provide objective evaluation of the genes or proteins known to be important to a research topic, and hence researchers often rely on subjective evaluation of domain experts and laborious manual literature review. Computational bibliometric analysis, in conjunction with text mining and data curation, attempts to automate this process and return prioritized proteins in any given research topic. We describe here a method to identify and rank protein-topic relationships by calculating the semantic similarity between a protein and a query term in the biomerical literature while adjusting for the impact and immediacy of associated research articles. We term the calculated metric the weighted co-publication distance (WCD) and show that it compares well to related approaches in predicting benchmark protein lists in multiple biological processes. We used WCD to extract prioritized “popular proteins” across multiple cell types, sub-anatomical regions, and standardized vocabularies containing over 20,000 human disease terms. The collection of protein-disease associations across the resulting human “diseasome” supports data analytical workflows to perform reverse protein-to-disease queries and functional annotation of experimental protein lists. We envision the described improvement to the popular proteins strategy will be useful for annotating protein lists and guiding method development efforts, as well as generating new hypotheses on under-studied disease proteins using bibliometric information. Keywords

High-priority proteins, semantic similarity, popular proteins, diseasome,

normalized co-publication distance, weighted co-publication distance, bibliometric analysis, targeted proteomics

2

ACS Paragon Plus Environment

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction The corpus of scientific literature can be represented as a network of structured associations among genes/proteins, diseases, and researchers 1 . Identifying the prioritized proteins within a biomedical topic (e.g., a disease, cell type, or organ) can yield insights into the underlying biological processes and can also help guide research direction and resource allocation. For instance, where the wherewithal for producing protein-specific assays or reagents is finite, a judicious strategy would be to prioritize efforts to those protein targets with maximal potential return and interest in the relevant research community. One example application that has garnered the recent attention of proteomics researchers is to identify important disease proteins to guide the development of targeted proteomics assays, in order to promote the adoption of proteomics across broad research fields.

Towards this goal, we and others have applied data science approaches to analyze the scientific literature and identify highly-studied “popular proteins” from multiple tissues and research topics 2,3 . The rationale for popular proteins as a proxy for biologically important proteins is that over time researchers in a field would be expected to prioritize more research efforts on promising protein targets, such that proteins associated with more publications within a topic are also more likely to have bona fide biological significance. To identify popular proteins across research topics, we showed that the relationship between a protein and a particular topic in literature publication may be estimated by their semantic similarity within the PubMed corpus, a metric which measures the likeness of meaning between the a protein term and a topic term within a corpus of documents, as opposed to the similarity in their syntactic representation. Using the PubMed query function and the publicly available Gene2PubMed reference data provided by The National Center for Biotechnology Information (NCBI), we previously determined popular proteins based on their semantic similarity with six organ systems 2 . In recent work, the Biology/Disease Human Proteome Project (B/D-HPP) initiatives within the Human Proteome Organization (HUPO) have adopted this approach to discover critical proteins in the heart 4 , the liver 5 , and other organs 3 , the results of which are being leveraged to analyze research trends and expedite bioassay development. Despite progress,

3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

whether the popular protein approach can distinguish prioritized protein lists across systematic collections of disease terms remains to be examined, whereas further refinements to bibliometric methods have the potential to yield more accurate prioritized protein lists.

Here we extend the popular protein strategy to incorporate additional data annotation sources as well as introduce a weighted co-publication distance (WCD) metric, which takes into account the immediacy and impact of individual publications to adjust the contributions of single publications to popularity scores. We find that WCD outperforms unadjusted semantic similarity scores over identical queries. We demonstrate its utility to identify popular proteins across cell types and across common disease phenotypes (inflammation, fibrosis, metabolic syndrome, protein misfolding, and cell death), and to further identify significant proteins across the human “disasome”, a term used to refer to the set of known human diseases/disorders and their association networks 6,7 . By querying a vast collection of over 23,000 biomedical terms compiled in standardized vocabularies of human disease processes including 10,129 diseases, and 10,642 phenotypes, and 2,370 pathways, we find that diseasome search terms are associated with specific prioritized protein lists that inform on disease relationships. Finally, we have implemented a reverse protein search strategy over the precompiled terms, which associates an input list of genes/proteins with the diseases and disease phenotypes in which they are intensively investigated, e.g., querying the protein cardiac troponin would return the disease term cardiomyopathy and the phenotype chest pain.

Materials and Methods Calculation of semantic distance between protein and topics The popular protein strategy 2 performs large-scale bibliometric analysis from research articles indexed on PubMed. Research topics are used to query PubMed via the NCBI EUtils Application Programming Interface (API) 8 to retrieve associated articles, and an annotation table that houses known PubMed ID (PMID)-Gene associations. To measure the semantic distance between a gene/protein with a topic of interest in the literature, we previously devised a semantic similarity metric, the normalized co-publication distance

4

ACS Paragon Plus Environment

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(NCD), defined as:

N CDP,T =

[max(log10 |T |, log10 |P |) [log10 |A|

log10 |T \ P |]

min(log10 |T |, log10 |P |)]

(1)

where T denotes the set of articles associated with any protein in the set of articles contained in the annotation table and that are retrieved from the PubMed query; P is the set of articles associated with a particular protein in the annotation table; A is the set of all articles associated with any protein in any topic in the annotation, i.e., all PubMed ID (PMID) entries in the annotation table; such that T ✓ A and P ✓ A. The significance of association between a term and a particular protein is estimated by the Z score of the N CDP,T over all associated proteins under a normal distribution. We devised a weighted variant of NCD by introducing weighted adjustments to each article’s contribution by immediacy and impact metrics. In the unadjusted NCD, each associated article ai carries an equal weight of 1. The weight is adjusted in WCD such that each annotated article in the association table i carries a weight of wi :



max log10 W CDP,T =

log10 (

P

P

i|ai 2T

i|ai 2A

wi



P

P wi , log10 log10 i|ai 2P wi i|ai 2(T \P ) wi ✓ ◆ P P w w min log10 , log 10 i|ai 2T i i|ai 2P i

(2)

To model the impact of an article ai , we retrieved citation counts programmatically by querying PubMed IDs through the Europe PubMed Central (PMC) web API 9 . We then applied logistic transformation to the base 10 logarithm of the number of citations of an article plus one ni , where the scale a, shape b and steepness c are 1, 6 and 2, respectively.

mi (a, b, c) =

a 1 + b · exp ( c · ni )

(3)

To model the immediacy of a paper, we applied weibull transformation to the distance in decades since the publication date of the article to the present year yi , with the shape

5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

parameters

Page 6 of 30

and k heuristically set to 1 and 1.25.

ni ( , k) =

⇣ k ⌘ ⇣ y i ⌘k ·

1

·e

(yi / )k

(4)

The final weighted publication count of the protein is calculated as the sum of the associated publication counts plus each associated publication’s immediacy and impact.

wi = 1 + mi + ni

(5)

Determination of popular protein lists With the above method, we retrieved popular protein lists with search terms as described in the results section. Protein-PMID associations were retrieved on 2018-04-22 from the manually curated NCBI Gene2Pubmed 8 , and data downloaded from Pubtator 10 , the latter of which containing text-mined relationships between biomedical concepts and entities. A union of the two sets of relationships was used in the analysis below.

We performed PubMed queries on 23,141 defined topics retrieved from publicly available vocabularies, including 10,129 disease definitions from Disease Ontology (DOID) 11 (version 2018-03-02; retrieved 2018-03-15), 10,642 phenotypic descriptions from Human Phenotype Ontology (HPO) 12 (version 2018-03-08; retrieved 2018-03-15), and 2,370 biochemical and signaling pathways from Pathway Ontology (PWO) 13 (version 7-4-2; retrieved 2018-03-15). PWO contains only a collection of standardized terms for pathways and is distinct from Gene Ontology (GO). From the retrieved PubMed IDs from each query, protein-term associations are ranked according to NCD as previously described (popularity index). Moreover, the popularity index for all terms and proteins that are significantly associated with each individual topics (P < 0.05) have been uploaded to the PubPular web app and are made searchable.

To evaluate the similarity in associated protein lists between two disease terms with 50 or more significantly associated proteins, we consider (i) the proportion of shared proteins in the top 50 ranks ✓50 of the

log10 P of protein-term associations between two

terms, and (ii) the Cohen’s kappa  in two-way classification of significantly associated

6

ACS Paragon Plus Environment

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins (P  0.05), and non-significantly associated proteins (P > 0.05) among the intersect of proteins with one or more associated publications in each term. Two disease terms are considered to be similar in their protein-term associations if ✓50 

0.8 and

0.

Web application and user interface We provide a web app PubPular at http://pubpular.net that allows users to query the popular protein lists of custom topics. The PubPular web app automatically analyzes the occurrences of each protein being referenced to the retrieved papers using the Gene2PubMed 8 and Pubtator 10 resources, and performs calculation of WCD between a protein and the queried topic. We created an extended module to the PubPular web application named FABIAN (Functional Annotation by Bibliometric Analysis) which provides the functionality for gene/protein lists to be uploaded and compared to results from curated terms. The web module builds on the precompiled results from search terms on PubPularDB and uses parametric gene set enrichment analysis 14 to discover terms for which the list of associated proteins (P  0.05) are significantly enriched or depleted with reference to the ranks of the uploaded gene/protein list. Code for calculating WCD and R package are available at https://github.com/ed-lau/calcWCD.

Comparison of prioritized gene lists against curated standards Curated benchmark protein lists were retrieved as follows: Proteins associated with three Gene Ontology (GO) 15 terms for disease processes namely apoptosis (apoptotic process, GO:0006915; 762 proteins), cell adhesion (GO:0007155; 800 proteins), DNA repair (463 genes; GO:0006281), and mitochondrial inner membrane (GO:0005743; 549 proteins) were retrieved from European Bioinformatics Institute (EBI) QuickGO interface 16 and filtered to include only human Entrez Gene IDs that exist in the annotation source. Proteins associated with complex disease terms namely brain infarction (40 proteins), hypertension (180 proteins), insulin resistance (62 proteins), macular degeneration (40 proteins), Parkinson disease (92 proteins), obesity (202 proteins), schizophrenia (170 proteins), and Tetralogy of Fallot (12 proteins) were retrieved from Comparative Toxicogenomics Database (CTD) 17 (downloaded on 2018-08-13).

7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 30

Precision, recall, sensitivity, and specificity of positive/negative classification are calculated based on true positives T P , false positives F P , true negatives T N , and false negatives F N . Sensitivity and Recall are defined as T P/(T P + F N ). Specif icity is defined as T N/(T N + F P ) P recision is defined as T P/(T P + F P ), and F is defined as (1 +

2

) · (P recision · Recall)/( · P recision + Recall).

Results from GLAD4U 18 and PURPOSE 3 were retrieved on their respective web services on 2018-08-13 after accessing the web tools and entering the exact search terms as shown and the protein lists were retrieved in entirety using their download functions. Searches were performed using default settings on the web services, with the exception that score threshold is set to 0 for GLAD4U such that the list of all predicted proteins and their scores could be retrieved. Protein lists were ranked from best to worst using standard scores output from each method, namely GLAD4U score for GLAD4U (highest is better), PURPOSE Score for PURPOSE (higher is better), NCD for Pubpular NCD (lower is better), and WCD for Pubpular WCD (this study) (lower is better). Receiver operating characteristic (ROC) analysis was performed by ranking all gene/protein prediction based on the score of each method. Area-under-ROC (AUROC) was calculated by integration using the trapezoid method over all Sensitivity and 1

Specif icity

values in a particular query.

Results Evaluation of prioritized protein lists We previously devised a metric NCD for the semantic similarity between a protein and a topic of interest. NCD normalizes the count of query-specific publications by the count of total publications on PubMed that are associated with the protein, so that a query will not be populated only by proteins that are broadly studied in many fields (e.g., p53 or APOE). WCD modifies NCD by modeling the immediacy and impact of an article to weight its contribution to the overall protein-term association (Figure 1)

8

ACS Paragon Plus Environment

Page 9 of 30

(see Methods). Overall, the publication year and citation counts of publications are poorly correlated (Supplementary Figure S1). Their incorporation in WCD allows recent and/or high-cited publications to carry additional evidence for protein-term relationships.

a

1

User input topic

b

“hypertension”

Histogram of Pub. Year

1000000

4e+05

Curated Protein-PMIDs Mined Protein-PMIDs

PubTator

2e+05 1e+05

250000

Publication Year Citation Count

[WCD1(P,T), ...]

Compute WCD

Output Prioritized Proteins

0.413 0.479 0.488 0.713 0.752 0.516 0.621 0.551 0.595 0.752

P 2.10E-07 2.70E-06 3.90E-06 3.30E-03 8.00E-03 1.00E-05 3.00E-04 3.40E-05 1.40E-04 7.90E-03

0 1 2 3 4 log10 (Citations + 1)

Immediacy vs. Year

Impact vs. Citations

Name

...

Renin (EC 3.4.23.15) (Angiotensinogenase) Angiotensinogen (Serpin A8) [Cleaved into: ... Angiotensin-converting enzyme (ACE) (EC ... Insulin [Cleaved into: Insulin B chain; Insulin ... C-reactive protein [Cleaved into: C-reactive ... Selenium-binding protein 1 (56 kDa selenium ... Endothelin-1 (Preproendothelin-1) (PPET1) ... D site-binding protein (Albumin D box-binding ... Type-1 angiotensin II receptor (AT1AR) ... CD59 glycoprotein (1F5 antigen) (20 kDa ...

0.6

0.8

0.4

0.6

0.2 0.0 1970

...

5

WCD

REN AGT ACE INS CRP SELENBP1 EDN1 DBP AGTR1 CD59

1990 2010 Pub Year

[Immediacy1, ...] [Impact1, ...]

GN

0 1970

Europe PMC

Impact

Get Publication Immediacy and Impact

[Count1, ...]

4

c

500000

0e+00 3

Histogram of Citations

750000

3e+05

Count

Gene2PubMed

Count

NCBI EUtils

PubMed IDs

Get Protein-Topic Relationship

Immediacy

PubpularWCD.R/Shiny

2

0.4 0.2

1990 2010 Pub Year

0 50 100150200250 Citations

Prioritized proteins in example disease terms Cystic fibrosis

Diabetes mellitus

Hypertrophic cardiomyopathy

1.00 CFTR

0.75 1-WCD

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INS HBA1

0.50

DNASE1 SLC26A9 SLC9A3R1 PRSS1 CLCN2 SPINK1 CLCA1 ELANE ANO1

HBA1 GLP1R SLC5A2 DPP4 PTPRN GCG SLC30A8 GAD2

0.25

MYBPC3 MYH7 TPM1 MYL3 TNNC1 TNNI3 TNNT2 PRKAG2 ACTC1 MYL2

−log10(P) 0 2 4 6 8

0.00 1

2

3

4

5

1

2 3 4 5 log10 # associated publications

1

2

3

4

5

Figure 1. Modeling the immediacy and impact of protein-associated publications. a The immediacy of a publication is modeled using a weibull distribution such that recent publications published within the past decade are given greater weights than older publications that are associated with a protein. b The impact of a publication is modeled using a logistic transformation of the log10 citation count of the publication retrieved via the Europe PubMed Central (PMC) API. c Scatterplot of weighted co-publication distance (WCD) vs. publication counts. The top 10 prioritized proteins in three diseases (cystic fibrosis, diabetes mellitus, and hypertrophic cardiomyopathy) as measured by WCD are given as examples (labeled in red).

The resulting metrics prioritizes top proteins in PubMed query search terms including searches for inherited and complex diseases (Figure 1). A lower WCD for a protein within a disease query suggests higher semantic similarity between the protein term and the disease term in the literature, and is overall correlated with greater number of publications for the proteins within that topic.

9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

To compare the performance of WCD to retrieve relevant gene lists, we compared the results to a list of benchmark curated terms in public resources. These include four curated biological process terms from Gene Ontology (apoptosis, cell adhesion, DNA repair, and mitochondrial inner membrane) as well as eight complex disease terms (brain infarction, hypertension, insulin resistance, macular degeneration, obesity, Parkinson disease, schizophrenia, and Tetralogy of Fallot). Gene Ontology contains manually curated relationship between terms and human genes as well as annotations automatically transferred from homologs of other species 15 , whereas the Comparative Toxicogenomics Database (CTD) 17,19 contains manual annotation and further collages annotations from Gene Ontology, Reactome, PubMed, and other sources. Both databases contain curated lists of good quality and are well utilized by researchers and hence provide a benchmark for automated methods. We compared the performance to the NCD from the PubPular2 webapp NCD 2 and to two related methods GLAD4U 18 and PURPOSE 3 . The test terms were chosen to avoid biasing towards the present approach with specific terms as six of the 12 terms were used as benchmarks in the GLAD4U publication and the CTD database was a data source used to benchmark PURPOSE in its publication (Figure 2).

10

ACS Paragon Plus Environment

Page 11 of 30

a

WCD (This study)

Receiver Operating Characteristic of GO/CTD Benchmark Prediction

a

NCD (Lam et al. 2015)

Number: Area−under−curve for each predictor

a

PURPOSE (Yu et al. 2018)

a

GLAD4U (Jourquin et al. 2012)

Apoptosis (GO)

Brain Infarction (CTD)

Cell Adhesion (GO)

DNA repair (GO)

1.00

1.00

1.00

1.00

0.75

0.75

0.75

0.75

0.50

0.50

0.50

WCD: 0.82 0.25

NCD: 0.8 GLAD4U: 0.78

0.00 0.00

0.25

0.50

0.75

0.50

WCD: 0.36 0.25

NCD: 0.18

PURPOSE: 0.61

WCD: 0.88 0.25

NCD: 0.84

PURPOSE: 0.36 GLAD4U: 0.23

0.00

1.00

0.00

Hypertension (CTD)

sensitivity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

0.25

0.50

0.75

WCD: 0.92 0.25

GLAD4U: 0.83

0.00

1.00

0.00

Insulin Resistance (CTD)

0.25

0.50

0.75

PURPOSE: 0.65

1.00

0.00

Macular Degeneration (CTD) 1.00

1.00

0.75

0.75

0.75

0.75

0.50

0.50

0.25

NCD: 0.56

NCD: 0.79

PURPOSE: 0.7 GLAD4U: 0.61

0.00 0.00

0.25

0.50

0.75

WCD: 0.88 0.25

NCD: 0.87

PURPOSE: 0.8 GLAD4U: 0.79

0.00

1.00

0.00

Obesity (CTD)

0.25

0.50

0.75

GLAD4U: 0.85

0.00 0.00

Parkinson Disease (CTD)

0.25

0.50

0.75

PURPOSE: 0.27

1.00

0.00

Schizophrenia (CTD) 1.00

0.75

0.75

0.75

0.75

0.50

NCD: 0.54

0.50

NCD: 0.6

PURPOSE: 0.58 GLAD4U: 0.59

0.00 0.00

0.25

0.50

0.75

1.00

WCD: 0.66 0.25

NCD: 0.61

PURPOSE: 0.68 GLAD4U: 0.65

0.00 0.00

0.25

0.50

0.75

1.00

0.50

0.75

1.00

0.50

WCD: 0.68 0.25

0.25

Tetralogy of Fallot (CTD)

1.00

0.25

GLAD4U: 0.21

0.00

1.00

WCD: 0.65

1.00

NCD: 0.59

PURPOSE: 0.73

1.00

0.75

WCD: 0.66 0.25

1.00

0.50

0.50

0.50

WCD: 0.83 0.25

0.25

Mitochondrial Inner Membrane (GO)

1.00

WCD: 0.66

GLAD4U: 0.89

0.00

1.00

0.50

NCD: 0.9

PURPOSE: 0.54

WCD: 0.91 0.25

NCD: 0.78

PURPOSE: 0.5 GLAD4U: 0.62

0.00 0.00

0.25

0.50

0.75

1.00

PURPOSE: 0.58 GLAD4U: 0.63

0.00 0.00

0.25

0.50

0.75

1 − specificity

Figure 2. Receiver operating characteristic (ROC) analysis of protein list prediction. Area-underROC (AUROC) metric is used to compare the performance of weighted co-publication distance (WCD) vs. un-adjusted normalized co-publication distance (NCD) (Lam et al. 2015) 2 and two published approaches GLAD4U (Jourquin et al. 2012 18 ) and PURPOSE (Yu et al. 2018 3 ) on 12 query terms. The results are compared against curated benchmark protein lists retrieved from the Comparative Toxicogenomics Database (CTD) or Gene Ontology (GO).

11

ACS Paragon Plus Environment

1.00

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

From comparison of the areas-under-curve of receiver operating characteristics (AUROC) of predicted gene/protein lists from each method against the manually-curated benchmark dataset, We found that WCD consistently outperforms NCD in overall sensitivity and specificity, and also compares well to GLAD4U and PURPOSE. Each method differs in data annotation source and prioritization algorithm (see also Discussion), and we saw some evidences that the approaches are complementary. No method performed better than WCD in 11 out of 12 terms tested. PURPOSE tied with WCD in 2 terms and was the top performer in one term (hypertension). Among the 12 tested terms, brain infarction appeared to be the only term where none of the method performed well, possibly because this CTD term was not usually employed in the relevant literature. Beside brain infarction, mitochondrial inner membrane was another query term where the compared methods did not appear to reach high sensitivity, suggesting some of the curated proteins were not in the predicted lists at all regardless of confidence or score cutoff.

We complemented the analysis by comparing the performance of WCD over NCD using F2 -measure, which is a function of recall and precision of the returned protein list at a particular threshold. Compared to the F1 score, the F2 score places twice the emphasis on recall over precision and is preferred in our comparison because the cost of false positive is adjudged to be lower than the cost of false negative. At two separate significance thresholds (P  0.01 and P  0.05), WCD consistently outperforms Pubpular NCD 2 in 9 of the 12 terms tested, and in 10 of the 12 terms tested, respectively (Supplementary Figure S2). Altogether these comparisons suggest that WCD offers excellent performance in identifying important gene/proteins across topics.

Catalogs of popular proteins across cell types and diseases Using the devised prioritization method, we set out to identify prioritized proteins in several individual sub-anatomical regions and cell types. We previously demonstrated that queries of six major organ systems revealed a preferential affinity of each organ with a specific set of proteins. Here we asked whether the sub-anatomical regions and cell types can also be shown to be preferentially associated with different proteins.

12

ACS Paragon Plus Environment

Page 13 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

For the heart we queried the anatomical regions “left atrium”, “left ventricle”, “right atrium”, and “right ventricle” as well as the cell types “cardiomyocytes”, “smooth muscle cells”, “endothelial cells”, and “fibroblasts”, using the search terms ”cardiac OR heart AND left AND atrium”, etc. For the lung, we queried the anatomical regions “alveolar sac”, “bronchiole”, “capillaries”, and “trachea”, as well as the cell types “pneumocytes”, “smooth muscle”, “epithelial”, and “fibroblasts”. From the brain, we queried proteins associated with the anatomical regions “cerebellum”, “cerebrum”, “brain stem”, and “thalamus”, and the cell types “neurons”, “astrocyte”, “glial cell”, and “oligodendrocyte”.

The analysis led to several general observations. Firstly, we found that queries of different sub-anatomical regions were sufficiently specific; for instance, the top five proteins in each query readily returned associations with region-specific proteins (Table 1). For example, in the heart connexin-40 (GJA5) is preferentially associated with the atria but not ventricles, consistent with the known involvement of the protein in the pathogenesis of atrial fibrillation 20 . In the brain, ataxins (ATXN1/2), associated with progressive ataxias, are preferentially associated with the cerebellum but not the cerebrum. Cell types from each tissues were also associated with different lists of prioritized proteins. For instance, surfactant proteins are preferentially associated with pneumocytes, which form the alveolar linings, whereas fibroblast growth factors (FGFs) populate the prioritized list for lung fibroblasts. Notably, the fibroblasts and smooth muscle cells in the heart and in the lung are found to be associated with different sets of proteins, e.g., FGF23 and FGF21 for heart fibroblasts vs. FGF10 and FGF7 for lung fibroblasts, suggesting the prioritized protein lists may help shed light into the gene expression and properties of similar cell types found across multiple organs, such as fibroblasts and endothelial cells, that may be implicated in common disease processes, e.g., fibrosis and endothelial disorders, that accompany diverse human diseases.

The majority of known human diseases can be grouped into subnetworks within a disease network sometimes referred to as the “diseasome”, in which known diseases can be grouped into clusters based on their shared disease phenotypes 7 . To determine how the prioritized protein lists intersect with common disease processes that occur in complex human diseases including those that are the thematic focuses of HUPO B/D

13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

Table 1. Prioritized proteins across cell types and anatomical regions. Heart

Lung

Brain

Cell Types Cardiomyocytes TTN NKX2-5 GATA4 RYR2 SCN5A Pneumocytes SFTPC SFTPA1 SFTPB SFTPD NKX2-1 Neuron CA1 CA3 BDNF PVALB RBFOX3

Smooth muscle cell MYOCD TAGLN SRF ACTA1 KCNJ8 Smooth muscle cell ACTA1A BMPR2 EDN1 KCNA5 KCNS3 Astrocyte GFAP SLC1A2 AQP4 SLC1A2 AIF1

Endothelial cell NOS3 PECAM1 KDR VCAM1 CDH5 Epithelial cell CFTR SFTPC CDH1 SCGB1A1 CXCL8 Glia GFAP GDNF OLIG2 AIF1 SLC1A2

Fibroblast FGF23 FGF21 FGF2 MYOCD GATA4 Fibroblasts CFHR1 FGF10 FGF7 TGFB1 FGFR1 Oligodendrocyte OLIG2 MOG MBP CNP CSPG4

Anatomical Left atrium NPPB TP53INP2 ADRB1 MYH7 MYBPC3 Alveolar sac FOXQ1 FOXJ1 FOXF1 ATP1A1 ABCA3 Cerebellum CBLN1 CACNA1A ATXN1 ATXN2 GRID2

Regions Left ventricle NPPA NPPB GJA5 MYL4 HCN4 Bronchiole SCGB1A1 CYP2F1 KCNRG SAA2-SAA4 SFTPB Cerebrum CA1 CA3 PVALB GRIA1 BDNF

Right atrium PKP2 NPPB TBX1 DSG2 HAND2 Capillaries PLSCR2 PLEK2 GPR182 COL15A1 PIANP Brainstem TH OLIG2 PROM1 GFAP BDNF

Right ventricle NPPA GJA5 ADRB1 HCN4 GJC1 Trachea SCGB3A2 MUC5AC MUC5B CYP2S1 SCGB1A1 Thalamus SLC17A6 SLC6A4 PVALB SMN1 CALB1

HPP initiatives, we queried the popular proteins in six specific disease processes, namely fibrosis, cell death, inflammation, metabolic syndrome, oxidative stress, and protein misfolding (Table 2).

Table 2. Top popular proteins for common disease phenotypes. Symbols for top 20 proteins represented by their gene name for each common disease process are shown and ranked by WCD and their P values. Rank Fibrosis Cell Death Inflammation Metabolic Syndrome Oxidative Stress Protein Misfolding 1 TGFB1 CASP3 IL6 ADIPOQ NFE2L2 PRNP 2 CTGF BCL2 TNF INS CAT HTT 3 ACTA1 BAX CRP SHBG SOD1 SNCA 4 SMAD3 CASP9 IL1B HSD11B1 HMOX1 TTR 5 SMAD2 CASP8 CXCL8 CRP SOD2 IAPP 6 PNPLA3 ANXA5 CCL2 SLC12A3 GSR TARDBP 7 KIF21A BCL2L1 IL10 RARRES2 KEAP1 CANX 8 SLC17A5 CYCS IL17A LEP GPX1 SOD1 9 GPT PARP1 TLR4 ETV3 TXN ATXN3 10 FN1 FASLG IL1A FABP4 NOX4 P4HB 11 TIMP1 TP53 ICAM1 APOB GCLC DNAJB2 12 SAMSN1 TNFSF10 VCAM1 NAMPT OGG1 HSF1 13 SMAD7 MCL1 NLRP3 CLCNKB PON1 ATXN1 14 IFNL3 BECN1 IL18 RETN NQO1 DNAJB1 15 COL1A1 FAS IL1RN PPARA GSTK1 RNF5 16 HPS1 XIAP HMGB1 PPARG PARK7 HSPA4 17 TM6SF2 AKT1 IL33 GPT CYBB CFTR 18 LGALS3 TNFRSF10B SELE RBP4 SOD3 HSPA5 19 HPS4 PDCD1 RNASE3 SAMSN1 PRDX5 ZFAND2A 20 POSTN CASP7 CCL5 APOA5 CASP3 HSPA8

We find that the top five proteins in “fibrosis” are transforming growth factor beta

14

ACS Paragon Plus Environment

Page 15 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1 (TGFB1), followed by connective tissue growth factor (CTGF), actin alpha skeletal muscle (ACTA1), mothers against decapentaplegic homolog 3 (SMAD3), and mothers against decapentaplegic homolog 2 (SMAD2). Another molecular phenotype common in multiple diseases is “cell death”. The top 5 proteins in our popular protein search using the key cell death returned caspase-3 (CASP3), apoptosis regulator Bcl-2 (BCL2), apoptosis regulator BAX (BAX), caspase-9 (CASP9), and caspase-8 (CASP8). The query for inflammatory response returned common cytokines including interleukin-6 (IL6), tumor necrosis factor (TNF), C-reactive protein (CRP), interleukin-1 beta (IL1B), and interleukin-8 (CXCL8); the query for metabolic syndrome returned lipid metabolism proteins including adiponectin (ADIPOQ), insulin (INS), and leptin (LEP); oxidative stress queries returned nuclear factor erythroid 2-related factor 2 (NRF2/NEF2L2), catalase (CAT), superoxide dismutases (SOD1/2), and kelch-like ECH-associated protein 1 (KEAP1). Finally protein misfolding returned tauopathy and neurodegenerative proteins as well as amyloidosis proteins including alternative prion protein (PRNP), alphasynuclein (SNCA), huntingtin (HTT), transthyretin (TTR), and superoxide dismutase (SOD1).

Protein-disease networks across the human diseasome Although the individual results on common disease processes are not entirely surprising, the prioritized protein lists could be useful for identifying proteins and pathways that are preferentially studied in particular disease processes such that reagent development efforts could be prioritized towards these topics (e.g., fibrosis in the heart). Building on this effort, we systematically queried over 25,000 search terms in comprehensive vocabularies that describe virtually the entirety of known human diseases. In total, we performed individual PubMed queries then calculated protein association scores for 23,141 defined topics retrieved from publicly available vocabularies, including proteins for 10,129 disease definitions from Disease Ontology (DO), 10,642 phenotypic descriptions from Human Phenotype Ontology (HPO), and 2,370 biochemical and signaling pathways from Pathway Ontology (PWO). Among the vocabularies, 7,897 search terms in DO were associated with at least one significant (P  0.05) protein; along with 7,076 terms in HPO and 1,798 terms in PWO.

15

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 30

We explored the network representation of the relationships between 832 DO disease terms that are each significantly associated with 50 or more proteins at P  0.05. Manual inspection showed that disease terms are clustered together by their prioritized protein lists (Figure 3). We compared the derived protein “diseasome” with a previous disease network generated using Online Mendelian Inheritance of Man (OMIM) data 21 . Despite differences in data source and methodology, we observe comparable properties in the derived human disease networks. Firstly, we observe a network topology organized into hubs, where the majority of disease terms are linked to few neighbors but a few hub diseases are linked to many neighbors. Hubs are occupied by top level or near-top level terms in DO categories, e.g., DOID:5295 intestinal diseases is a hub disease and linked to DOID:0060810 colitis, DOID:0050589 inflammatory bowel disease, DOID:8577 ulcerative colitis, and DOID:0060190 ileocolitis. Secondly, we see prominent and interconnected clusters of cancer terms as represented in the network, as also noted by Goh et al. 21 . Thirdly, related disease terms are clearly connected via their semantic similarity despite little verbal or syntactic similarities, e.g., DOID:1242 globe disease is the neighbor of DOID:10871 age-related macular degeneration and DOID:3612 retinitis. Although these observations may be expected, they nevertheless show that the protein-association approach is able to distinguish relevant pathogenic processes across human diseases. All 11,428,397 protein-disease associations can be found in Supplementary Data 1.

Reverse query from proteins to significantly-associated topics Using the compiled lists of prioritize proteins across multiple human diseases and phenotypes, we explored whether reverse queries could be made from proteins to retrieve information on disease vocabulary terms. In other words, given a protein name, a proteinto-topic reverse query returns all the disease areas in which this protein is intensively studied based on literature records. This is distinguished from the forward query where the user inputs a disease term, and retrieves all the proteins that are intensively studied in the disease. For instance, one of the most highly investigated proteins in the heart is troponin I (TNNI3). Reverse query with TNNI3 against the pre-compiled popular

16

ACS Paragon Plus Environment

Page 17 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

protein lists of DO and HPO terms indicates that, as expected, TNNI3 is also highly associated with a cluster of cardiovascular-related topics, ranging from “myocardial infarction” (DO accession DOID:5844; P : 9.6e (DO accession DOID:11984; P : 4.1e

5) to “hypertrophic cardiomyopathy”

3). Utilizing the reverse query stratefgy on the list

of popular disease phenotype proteins above, we find that the top fibrosis protein TGFB1 is significantly associated with “mesenchymal cell neoplasm” (DO accession DOID:3350; P : 0.059), “collagen diseases” (DO accession DOID:854, P : 0.0016), as well as a number of fibrotic diseases including “pulmonary fibrosis” (DO accession DOID:3770; P : 0.0045), “renal fibrosis” (DO accession DOID:50855; P : 0.0042), and “liver cirrhosis” (DO accession DOID:5082; P : 0.031), consistent with its involvement in common disease processes. Moreover, we asked with which other disease terms is another top fibrosis protein CTGF also popularly associated, and identified a broad spectrum of disease terms including “connective tissue benign neoplasm”, “connective tissue cancer”, “renal fibrosis”, “liver cirrhosis”, and “scleroderma”. In the HPO dataset, TGFB1 is further associated with phenotypes including “cirrhosis”, “beta-cell dysfunction”, and hepatic, pulmonary, and renal fibrosis. The pathways associated with TGFB1 includes transforming growth factor beta signaling pathway, cell-extracellular matrix signaling pathway, peptide and protein metabolic process. Importantly, this strategy is generalizable to other collections of popular protein lists not detailed here. For example, the Brenda tissue ontology (BTO) contains a collection of terms on tissue and cell types, reverse query against which shows that TGFB1 is preferentially associated with a number of fibroblast-related publications in the literature, including in myofibrolasts and lung fibroblasts.

One application for the reverse query strategy is that the curated protein lists across human diseases allows popular proteins to be used as an annotation source for gene list functional analysis. For instance, given a list of differentially expressed proteins found in a quantitative transcriptomics or proteomics experiment comparing two biological samples, one may examine whether the significantly up/down-regulated proteins are enriched in proteins that are intensively researched in a particular disease or disease phenotypes. We implemented a new module (FABIAN) to perform gene enrichment analysis against precompiled popular protein lists. To evaluate the potential utility of this approach, we retrieved a publicly available transcriptomics dataset on cardiac

17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 30

failure, which encompasses five replicates each of control vs. failing hearts from a rodent model of transverse aortic constriction with apical myocardial infarction (GSE56348) 22 . We performed a hypergeometric test to identify enriched annotation terms among differentially expressed protein (defined as having limma 23 adjusted P  0.01) against Gene Ontology biological process terms and the precompiled DOID disease-gene associations (Figure 4). The results show that reverse popular protein queries provide complementary annotations to GO Process terms, e.g., we find significant enrichment of differentially regulated genes that are intensively researched in DOID “collagen disease” and “cartilage disease” terms (hypergeometric test adjusted P  0.05), corresponding to enrichment of GO “extracellular matrix organization” term; as well as enrichment of the DOID “mitochondrial disease” term which corresponds to the GO “mitochondrial electron transport, NADH to ubiquinone” term. Moreover, the enrichment analysis against DOID shows a significant involvement of genes highlighted in “atrial fibrillation” which was not readily apparent among the top enriched GO terms (Figure 4), highlighting the potential utility of combining multiple annotation sources in large-scale data interpretation.

Lastly, WCD-ranked popular proteins may also be useful for identifying important target proteins that are currently “under-studied” in the literature (Supplementary Figure S3), which may help counter researcher bias and also highlight high-value future research avenues 24 . It has been suggested that biomedical research is overly focused on only a subset of genes and hence may create a ”rich gets richer” scenario that leaves important genes/proteins under-studied 24,25 . This has been attributed to various factors including availability of reagents 26 and risk-averse funding mechanisms 25 , but few solutions have been proposed and the advent of omics data alone did not appear to correct gene research biases 25 . One approach we propose is to focus on proteins that interact closely with highly popular proteins but are themselves associated with relatively few publications. As a proof of concept, we mapped WCD values to predicted protein-protein association from STRING 27 to create a directed graph connecting interacting pairs from low to high popularity scores. We then redistributed protein popularity scores using the PageRank algorithm implemented in the igraph package in R. We used three example search terms (“heart failure”, “obesity”, and “Parkinson disease”) to discern proteins which receive

18

ACS Paragon Plus Environment

Page 19 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the most gains in popularity ranks, i.e., under-studied proteins that occupy important hub positions around highly-studied proteins. Notable up-ranked Heart Failure proteins include HEY2, GJA1, and NACA2. Notable up-ranked Obesity proteins include PPY and NPY2R. Notable up-ranked Parkinson Disease proteins include RING1, SMPD1, and COX6A2 (Supplementary Figure S3). Although this hypothesis generation approach shows promise, we caution that a potential limitation is that disease interactomes are not specific to cell types or models. We suggest that future work may seek to refine the approach outlined here and experimentally validate whether implicated proteins may be critical for regulating pathological phenotypes in various disease models.

19

ACS Paragon Plus Environment

Journal of Proteome Research

a

b

Vocabularies Disease Ontology

1

400 No. of associated proteins

600

DOID:0001816 Angiosarcoma DOID:002116 Pterygium DOID:0014667 Disease of metaboism

...

Pathway Ontology

2

WCD(P,T)

2,370 terms with associated proteins PW:0002640 Acrolein response pathway PW:0002639 Heme transport pathway PW:0002638 Cofactor transport pathway ...

3

Distribution of total proteins per term across vocabularies DO HPO PWO

WCD(P,T)

10,129 terms with associated proteins

Human Phenotype Ontology

WCD(P,T)

10,642 terms with associated proteins

Precompiled protein lists for 16,771 terms with significantly associated proteins (P ≤ 0.05)

HP:0000011 Neurogenic bladder HP:0000012 Urinary urgency HP:0000013 Hypoplasia of the uterus ...

200 0 0

1500

1

2

3

4

0

1

2

3

4

0

1

2

3

4

Distribution of significant proteins per term across vocabularies DO HPO PWO

1000 500 0

(7,897 DO; 1,798 PWO; 7,076 HPO)

0

1

2

3 0

1

2

3 0

1

2

3

log10 (number of terms)

c

d

DOID:12610 adrenal hemorrhage of fetus or newborn

DOID:655 inherited metabolic disorder DOID:2487 hypercholesterolemia

832 DO terms with ≥ 50 significantly associated proteins (P ≤ 0.05) D1 D2

Dn

...

DOID:12612 gastrointestinal hemorrhage of fetus or newborn DOID:637 metabolic brain disease

DOID:3146 lipid metabolism disorder

DOID:5844 myocardial infarction DOID:2349 arteriosclerosis DOID:3211 lysosomal storage disease DOID:9252 amino acid metabolic disorder

DOID:0060050 autoimmune disease of blood DOID:3849 adult intracranial neoplasm

κ1,1κ1,2

D1

DOID:3393 coronary artery disease DOID:114 heart disease

κn,n

Dn

DOID:12930 dilated

DOID:326 ischemia

DOID:10619 lymph node cancer DOID:1265 genitourinary cancer DOID:1100 ovarian disease DOID:2334 metastatic carcinoma DOID:0050624 gastrointestinal system benign neoplasm DOID:9942 lymph node disease

κ>0 θ50 ≥0.1

Distribution in node degrees 100

DOID:1287 cardiovascular system disease DOID:178 vascular disease

DOID:0060072 benign neoplasm DOID:1112 neck cancer

DOID:1749 squamous cell carcinoma DOID:120 female reproductive organ cancer DOID:0080010 bone structure disease DOID:1074 kidney failure DOID:171 neuroectodermal tumor DOID:1936 atherosclerosis DOID:0060052 neurological disorder DOID:14221 metabolic syndrome X DOID:3093 nervous system cancer DOID:0050685 small cell carcinoma DOID:3146 lipid metabolism disorder DOID:440 neuromuscular disease DOID:374 nutrition disease DOID:0014667 disease of metabolism

DOID:1579 respiratory system disease DOID:4194 glucose metabolism disease DOID:1289 neurodegenerative disease

10

DOID:0050329 mental disorder DOID:1242 globe disease

1 3

k

10

DOID:0050830 peripheral artery disease DOID:1058 amino acid transport disease cardiomyopathy

DOID:9408 acute myocardial infarction DOID:0110425 dilated cardiomyopathy 1A

DOID:409 liver disease

...

...

DOID:3213 demyelinating disease DOID:0060094 bone benign neoplasm

1

DOID:272 hepatic vascular disease

DOID:4079 heart valve disease

κ2,1κ2,2

D2

Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

DOID:2316 brain ischemia

e

DOID:722 spontaneous abortion DOID:10591 pre−eclampsia DOID:4085 trophoblastic neoplasm DOID:9653 Pre−eclampsia or eclampsia superimposed on pre−existing hypertension DOID:14072 neoplastic pregnancy complications DOID:13593 eclampsia DOID:2021 placenta cancer DOID:1257 Transient hypertension of pregnancy DOID:780 placenta disease

DOID:1614 male breast cancer

DOID:120 female reproductive organ cancer

DOID:0060087 male reproductive organ benign neoplasm DOID:5041 esophageal cancer

DOID:0060095 uterine benign neoplasm DOID:2893 cervix carcinoma DOID:193 reproductive organ cancer DOID:0060086 female reproductive organ benign neoplasm DOID:3856 male reproductive organ cancer DOID:0050622 reproductive organ benign neoplasm

30 DOID:2998 testicular cancer

Figure 3. Prioritized proteins across the human diseasome. a Schematics for precompiling popular protein terms from three standard vocabularies related to human diseases and disease processes. b Distribution of total proteins per term in three vocabularies. The distribution of number of total (left) and significantly associated (right) proteins per term in each vocabulary (P  0.05). c Correlation matrix of protein associations for 832 Disease Ontology (DO) terms with 50 or more proteins associated at P  0.05. A minimal spanning tree of DO terms based on similarity of associated proteins. A protein network is constructed using all DO terms associated with any proteins as nodes. Edges connect pairs of DO terms with  0 or ✓50 0.2, from which a minimal spanning tree is constructed. d-e Zoomed-in views of selected disease labels around two network nodes.

20

ACS Paragon Plus Environment

Page 21 of 30

a

b 1

Results

Query Find protens associated with “Acute myocardial infarction” (DOID: 9408)

Find proteins associated with “Pulmonary embolism” (DOID: 9477 )

F5 F10 SERPINC1 TNNI3 ...

(P: 4.0e-4) (P: 1.5e-3) (P: 2.8e-3) (P: 1.2e-2) ...

Find proteins associated with “Kidney failure” (DOID: 1074 )

EPO CST3 LCN2 TNNI3 ...

(P: 1.8e-5) (P: 5.3e-5) (P: 6.0e-5) (P: 4.6e-2) ...

2

Reverse Query (Protein to Topic)

Query Find diseases associated with “TNNI3” (Cardiac troponin I)

Find diseases associated with “PECAM1” (Platelet endothelial cell adhesion molecule)

Find diseases associated with “KRAS” (GTPase Kristen rat sarcoma ongogene)

c

(P: 9.8e-5) (P: 2.9e-4) (P: 4.5e-3) (P: 4.9e-4) ...

TNNI3 TNNT2 PLAT P2RY12 ...

Terms enrichment analysis across vocabularies

Among differentially enriched transcripts in heart failure

Forward Query (Topic-to-Protein)

Results Acute myocardial infarction (P: 9.8e-5) Cardiomyopathy (P: 1.2e-3) Pulmonary embolism (P: 1.2e-2) Kidney failrue (P: 4.6e-2) ... ... Vascular cancer Hemangioma Chorioangioma Angiosarcoma ...

(P: 1.7e-5) (P: 1.3e-3) (P: 1.4e-3) (P: 1.5e-3) ...

Colorectal cancer Bowel dysfunction Lung carcinoma Hepatitis ...

(P: 2.6e-6) (P: 8.4e-6) (P: 1.3e-4) (P: 2.3e-3) ...

hyperinsulinism nerve compression syndrome vascular cancer cardiac arrest vascular disease articular cartilage disease bone benign neoplasm mesenchymal cell neoplasm Bowman's membrane folds or rupture musculoskeletal system disease congestive heart failure osteoarthritis amino acid metabolic disorder cartilage disease atrial fibrillation familial atrial fibrillation aortic aneurysm mitochondrial metabolism disease mitochondrial disease collagen disease congestive heart failure bone pain dudden death deficiency /absence of cytochrome b(−245) death in adolescence decreased mitochondrial number vascular neoplasm increased adipose tissue decreased adipose tissue osteoarthritis atrial fibrillation paroxysmal atrial fibrillation atrial arrhythmia aortic aneurysm arrhythmia increased mitochondrial number cirrhosis mitochondrial inheritance increased connective tissue increased muscle lipid content immune system process receptor−mediated endocytosis response to drug neutrophil degranulation platelet degranulation integrin−mediated signaling pathway mito. respiratory chain complex I assembly muscle contraction fatty acid beta−oxidation collagen fibril organization innate immune response fatty acid metabolic process cell adhesion mito. electron transport, NADH to ubiquinone angiogenesis extracellular matrix organization cell migration signal transduction oxidation−reduction process metabolic process

DOID disease terms

fold proteins enriched 10.0

10

7.5

20 30

5.0

40 4

8

12

16

HPO phenotype terms

fold proteins enriched 50

10

40

20

30

30

20

40

10 6

8

10

12

GO biological process terms

fold proteins enriched 10

6

8

−log10(P)

10

40

8

80

6

120

4

160

12

Differentially expressed genes/proteins vs. top enriched annotations

(count)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

amino acid metabolic disorder aortic aneurysm

200

bone benign neoplasm

Down-regulated

angiogenesis apoptotic process

Arrhythmia

cell adhesion cell migration

Arthritis Bleeding with minor or no trauma Bone pain

bone disease cardiac arrest cartilage disease cell type benign neoplasm

150

Aortic aneurysm

Cirrhosis

extracellular matrix organization

Congestive heart failure Decreased adipose tissue

fatty acid metabolic process heart development innate immune response lipid metabolic process

collagen disease

Deficiency or absence of cytochrome b(−245)

congestive heart failure

Elevated pulmonary artery pressure Entrapment neuropathy Hypertension

familial atrial fibrillation

100

hyperinsulinism joint disorder mesenchymal cell neoplasm

Up-regulated

mitochondrial disease musculoskeletal system disease nerve compression syndrome obesity osteoarthritis protein deficiency

50

vascular cancer vascular disease

0

Direction

DOID

Increased connective tissue Increased muscle lipid content Mitochondrial inheritance

Neoplasm by histology Osteoarthritis Pulmonary arterial hypertension Vascular neoplasm

HPO

metabolic process mitochondrion organization muscle contraction negative regulation of cell proliferation neg. regulation of TGFβ receptor signaling pathway neutrophil degranulation oxidation−reduction process platelet degranulation receptor−mediated endocytosis regulation of cardiac muscle contraction regulation of heart rate by cardiac conduction regulation of transcription, DNA−templated response to drug

signal transduction skeletal system development

GO

Figure 4. Enriched terms in reverse protein-to-disease query (DO and HPO) vs. Gene Ontology a Schematics for performing reverse (protein-to-term) queries using precompiled popular protein lists in the human diseasome. b The enriched terms (hypergeometric test P  0.05) from top DO, middle HPO, and bottom GO Biological Processes were associated with differentially expressed genes (limma adjusted. P  0.01) in a microarray dataset from a rodent model of heart failure. c Relationship between assigned DO, HPO, and GO terms. Top associated terms are shown for each significantly up-regulated (blue) or down-regulated (red) transcripts (limma adjusted P  0.01) in microarray dataset from a rodent model of heart failure. The alluvial streams link the top enriched term of DO to the corresponding terms in HPO and GO for each transcript. For example, a number of up-regulated transcripts are associated with the “familial atrial fibrillation” term in DO, corresponding in part to the “arrhythmia” term in HPO, and to the “regulation of heart rate by cardiac conduction” term in GO. 21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

Discussion Gene/protein prioritization is a recurring informatics task in biomedical research 28 which can be generally stated as follows: given a collection of gene names, identify a subset that is preferentially associated with a topic or disease in question. For instance, given a list of genes residing at a locus implicated in a genetic mapping study, one may wish to find the causal disease genes or variants responsible for the observed phenotypes. More recently, there has been interest in protein prioritization efforts to guide prioritized development of research reagents or the distributive fairness in biocuration efforts. The Biology/Disease Human Proteome Project (B/D-HPP) initiative within the Human Proteome Organization (HUPO) has a mission to popularize proteomics assays and reagents, an objective that requires prioritization of genes and proteins to nominate most attractive assays for development. These needs have spawned text-mining and network approaches 28–30 as well as literatury-based strategies for in silico gene/protein prioritization.

Literature-based gene/protein prioritization is predicated on the hypothesis that over time, researchers will choose to work and publish preferentially on proteins relevant to a disease or topic and hence a popular protein will tend also to play bona fide significant roles in a biological phenomenon of interest. We and others have previously shown that publication popularity yields accurate predictions of curated gene lists, and is distinguished by being amenable to any search terms one may think of provided they return PubMed results. Several related approaches exist to estimate publication popularity within topics, which differ by their sources of annotated gene/protein-term relationships and by their information retrieval algorithms. In previous work, we have utilized the NCBI curated Gene2Pubmed file without removal of publications, and calculated the unweighted semantic similarity between a search term and a protein to prioritize term-specific proteins over non-specific ones 2 . GLAD4U utilizes the NCBI curated Gene2PubMed file, after removing publications that are associated with 500 or more proteins, and applies a hypergeometric test to identify proteins that appear more often than expected 18 . More recently, Yu et al. 3 read from the text-mined PubTator file 10 and rank proteins by term-frequency inverse document-frequency (TF-IDF) modified

22

ACS Paragon Plus Environment

Page 23 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

by the citation index of each publication. In the present study, we use a union of the curated Gene2PubMed file and the text-mined PubTator file and calculate the weighted co-publication distance, which introduces weighting factors based on the transformed impact (number of citations) and immediacy (number of years since publication) of linked publications in the calculation of protein-term association. In prior work, we observed that the trend of protein popularity in research can change over time 2 , e.g., the popularity of brain-type natriuretic peptide (BNP) surged following its adoption as a clinical marker for heart failure in 2003 4 . Hence we hypothesize that by assigning more weight to more recent papers we can better capture the direction and interest of a field of research associated with a given topic. In parallel, it has been suggested that widely-cited publications carry more influence to the direction of a field and hence may be given higher significance in literature analysis including gene prioritization 3 and text-mining 30 approaches. Pubpular WCD performs comparatively well over related methods including PURPOSE and GLAD4U in sensitivity and specificity of prediction against benchmarked gene lists.

The present study is also the first to demonstrate the utility of popular proteins in three applications, namely to analyze: (i) cell types and anatomical regions within a region, (ii) disease processes underlying multiple disorders, and (iii) systematically cataloged disease terms within curated vocabularies. Our results suggest that cell types from each organ are preferentially associated with investigations of different proteins and can lend higher resolution to the identification of proteins associated with critical disease processes. Identifying popular proteins in common disease processes may be useful for guiding prioritization methods for protein assays that are not specific to a particular field and so may have wider appeal, e.g., to develop a panel of multiple reaction monitoring (MRM) assays for fibrosis that can be applicable to ongoing research in the heart, the lung, as well as the liver. Upon extracting popular proteins from over 23,000 disease and disease phenotype definitions, we found that the similarity of associated proteins can be used to cluster disease terms and create a representation of the human “diseasome”, a network medicine concept that supposes human diseases are interrelated via underpinning processes and can be used to identify the wiring diagram of how perturbations in key genes and modules can influence pathogenic processes 21 . The popular

23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

protein lists provide a potential alternative route towards generating disease-disease and disease-gene association networks, which have been explored previously using other data sources (e.g., genetic mutation knowledge 21 and text-mining approaches 6 ) to reveal phenotypic homogeneity of related diseases. A comparison of network structure between popularity-based disease network here and phenotype-based networks may help discern deep-lying commonalities and differences in disease features. In more immediate applications, pre-compiled popular proteins across large vocabularies of disease terms enabled a “reverse query” strategy to identify disease phenotypes that have been associated with a query protein in the literature. Applying this strategy to re-analyze differentially expressed genes in a public dataset on heart failure, we suggest that the enrichment of DO and HPO disease terms among differentially regulated transcripts could provide complementary information over commonly utilized GO analysis.

In summary, we describe here a method to prioritize intensively researched proteins associated with cell types, sub-anatomical regions, and molecular phenotypes common across human diseases. Several limitations to the current study exist. The annotation sources linking gene to PubMed IDs do not distinguish gene-level and protein-level experimental evidence in the associated studies. We also saw evidence of bias in protein annotation (Supplementary Figure S4). Proteins with uncertain existence evidence at the protein level (neXtProt PE2-5) 31 and proteins with no known functions (uPE1) are both associated with lower publication counts (Supplementary Figure S4). Some proteins are therefore “unpopular” because their expression and function have not been thoroughly investigated but nevertheless may have undiscovered importance in disease processes. Future work may address this by identifying proteins that interact with intensively research proteins but are themselves under-studied in the literature.

Associated Content Available Figure S1 - Correlation between immediacy and impact value Figure S2 - Comparison of WCD and NCD against benchmark gene/protein list Figure S3 - Identifying under-studied proteins by popularity overlaid on protein associa-

24

ACS Paragon Plus Environment

Page 25 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

tion graphs Figure S4 - Number of associated publications per protein across protein evidence and functional categories Supplementary Data 1 - Popular Protein in the Human Disasesome

Acknowledgments This work was supported in part by U.S. Department of Defense grant 16W81XWH-16-10592 (J.V.E.), The Barbara Streisand Women’s Heart Center (J.V.E.), The Smidt Heart Institute at Cedars-Sinai Medical Center (J.V.E.), The Erika Glazer Endowed Chair in Women’s Heart Health (J.V.E), National Institutes of Health (NIH) research grants P01 HL112730 (J.V.E.), R01 HL141371 (J.C.W.), R00 HL127302 (M.P.L.), R01 HL141278 (M.P.L.), F32 HL139045 (E.L.), NIH Big Data to Knowledge (BD2K) Program Cloud Credits Model Pilot CCREQ-2017-03-00060 (M.P.L.), and The University of Colorado Consortium for Fibrosis Research and Translation Pilot Grant (M.P.L.).

25

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 30

References 1. Fortunato, S.; Bergstrom, C. T.; Borner, K.; Evans, J. A.; Helbing, D.; Milojevi, S.; Petersen, A. M.; Radicchi, F.; Sinatra, R.; Uzzi, B.; Vespignani, A.; Waltman, L.; Wang, D.; Barabasi, A. L. Science of science. Science 2018, 359 . 2. Lam, M. P.; Venkatraman, V.; Xing, Y.; Lau, E.; Cao, Q.; Ng, D. C.; Su, A. I.; Ge, J.; Van Eyk, J. E.; Ping, P. Data-Driven Approach To Determine Popular Proteins for Targeted Proteomics Translation of Six Organ Systems. J. Proteome Res. 2016, 15, 4126–4134. 3. Yu, K. H.; Lee, T. M.; Wang, C. S.; Chen, Y. J.; Re, C.; Kou, S. C.; Chiang, J. H.; Kohane, I. S.; Snyder, M. Systematic Protein Prioritization for Targeted Proteomics Studies through Literature Mining. J. Proteome Res. 2018, 17, 1383–1396. 4. Lam, M. P.; Venkatraman, V.; Cao, Q.; Wang, D.; Dincer, T. U.; Lau, E.; Su, A. I.; Xing, Y.; Ge, J.; Ping, P.; Van Eyk, J. E. Prioritizing Proteomics Assay Development for Clinical Translation. J. Am. Coll. Cardiol. 2015, 66, 202–204. 5. Mora, M. I.; Molina, M.; Odriozola, L.; Elortza, F.; Mato, J. M.; Sitek, B.; Zhang, P.; He, F.; Latasa, M. U.; Avila, M. A.; Corrales, F. J. Prioritizing Popular Proteins in Liver Cancer: Remodelling One-Carbon Metabolism. J. Proteome Res. 2017, 16, 4506–4514. 6. Hoehndorf, R.; Schofield, P. N.; Gkoutos, G. V. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Sci Rep 2015, 5, 10888. 7. Zhou, X.; Menche, J.; Barabasi, A. L.; Sharma, A. Human symptoms-disease network. Nat Commun 2014, 5, 4212. 8. Agarwala, R. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2018, 46, D8–D13. 9. Gou, Y. et al. Europe PMC: a full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res. 2015, 43, D1042–1048.

26

ACS Paragon Plus Environment

Page 27 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

10. Wei, C. H.; Kao, H. Y.; Lu, Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013, 41, W518–522. 11. Kibbe, W. A.; Arze, C.; Felix, V.; Mitraka, E.; Bolton, E.; Fu, G.; Mungall, C. J.; Binder, J. X.; Malone, J.; Vasant, D.; Parkinson, H.; Schriml, L. M. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015, 43, D1071–1078. 12. Kohler, S. et al. The Human Phenotype Ontology in 2017. Nucleic Acids Res. 2017, 45, D865–D876. 13. Petri, V.; Jayaraman, P.; Tutaj, M.; Hayman, G. T.; Smith, J. R.; De Pons, J.; Laulederkind, S. J.; Lowry, T. F.; Nigam, R.; Wang, S. J.; Shimoyama, M.; Dwinell, M. R.; Munzenmaier, D. H.; Worthey, E. A.; Jacob, H. J. The pathway ontology - updates and applications. J Biomed Semantics 2014, 5, 7. 14. Kim, S. Y.; Volsky, D. J. PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 2005, 6, 144. 15. Consortium., T. G. O. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 2017, 45, D331–D338. 16. Binns, D.; Dimmer, E.; Huntley, R.; Barrell, D.; O’Donovan, C.; Apweiler, R. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics 2009, 25, 3045–3046. 17. Grondin, C. J.; Davis, A. P.; Wiegers, T. C.; Wiegers, J. A.; Mattingly, C. J. Accessing an Expanded Exposure Science Module at the Comparative Toxicogenomics Database. Environ. Health Perspect. 2018, 126, 014501. 18. Jourquin, J.; Duncan, D.; Shi, Z.; Zhang, B. GLAD4U: deriving and prioritizing gene lists from PubMed literature. BMC Genomics 2012, 13 Suppl 8, S20. 19. Davis, A. P.; Wiegers, T. C.; Rosenstein, M. C.; Murphy, C. G.; Mattingly, C. J. The curation paradigm and application tool used for manual curation of the scientific literature at the Comparative Toxicogenomics Database. Database (Oxford) 2011, 2011, bar034.

27

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 30

20. van der Velden, H. M.; Jongsma, H. J. Cardiac gap junctions and connexins: their role in atrial fibrillation and potential as therapeutic targets. Cardiovasc. Res. 2002, 54, 270–279. 21. Goh, K. I.; Cusick, M. E.; Valle, D.; Childs, B.; Vidal, M.; Barabasi, A. L. The human disease network. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 8685–8690. 22. Lai, L.; Leone, T. C.; Keller, M. P.; Martin, O. J.; Broman, A. T.; Nigro, J.; Kapoor, K.; Koves, T. R.; Stevens, R.; Ilkayeva, O. R.; Vega, R. B.; Attie, A. D.; Muoio, D. M.; Kelly, D. P. Energy metabolic reprogramming in the hypertrophied and early stage failing heart: a multisystems approach. Circ Heart Fail 2014, 7, 1022–1031. 23. Ritchie, M. E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C. W.; Shi, W.; Smyth, G. K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. 24. Haynes, W. A.; Tomczak, A.; Khatri, P. Gene annotation bias impedes biomedical research. Sci Rep 2018, 8, 1362. 25. Stoeger, T.; Gerlach, M.; Morimoto, R. I.; Nunes Amaral, L. A. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018, 16, e2006643. 26. Edwards, A. M.; Isserlin, R.; Bader, G. D.; Frye, S. V.; Willson, T. M.; Yu, F. H. Too many roads not taken. Nature 2011, 470, 163–165. 27. Szklarczyk, D.; Franceschini, A.; Wyder, S.; Forslund, K.; Heller, D.; HuertaCepas, J.; Simonovic, M.; Roth, A.; Santos, A.; Tsafou, K. P.; Kuhn, M.; Bork, P.; Jensen, L. J.; von Mering, C. STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2015, 43, D447–452. 28. Guala, D.; Sonnhammer, E. L. L. A large-scale benchmark of gene prioritization methods. Sci Rep 2017, 7, 46598. 29. Yin, T.; Chen, S.; Wu, X.; Tian, W. GenePANDA-a novel network-based gene prioritizing tool for complex diseases. Sci Rep 2017, 7, 43258.

28

ACS Paragon Plus Environment

Page 29 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

30. Liem, D. A.; Murali, S.; Sigdel, D.; Shi, Y.; Wang, X.; Shen, J.; Choi, H.; Caufield, J. H.; Wang, W.; Ping, P.; Han, J. Phrase Mining of Textual Data to Analyze Extracellular Matrix Protein Patterns Across Cardiovascular Disease. Am. J. Physiol. Heart Circ. Physiol. 2018, 31. Gaudet, P. et al. The neXtProt knowledgebase on human proteins: 2017 update. Nucleic Acids Res. 2017, 45, D177–D182.

29

ACS Paragon Plus Environment

Journal of Proteome Research 1

Topic of Interest

e.g., “hypertension”

5

Output WCD Prioritized Proteins GN

2

PubMed IDs

Get Protein-Topic Relationship

NCBI EUtils Gene2PubMed

Curated Protein-PMIDs Mined Protein-PMIDs

Query ... PMID ...

3

REN AGT ACE INS CRP SELENBP1 EDN1 DBP AGTR1 CD59

Get Publication Immediacy and Impact

Publication Year Citation Count

PubTator

Europe PMC 6

[Count1, ...]

Compute WCD

[WCD1(P,T), ...]

0.413 0.479 0.488 0.713 0.752 0.516 0.621 0.551 0.595 0.752

P 2.10E-07 2.70E-06 3.90E-06 3.30E-03 8.00E-03 1.00E-05 3.00E-04 3.40E-05 1.40E-04 7.90E-03

Precompiled WCD for 10,000+ Disease Terms DO

4

WCD

Name

...

Renin (EC 3.4.23.15) (Angiotensinogenase) Angiotensinogen (Serpin A8) [Cleaved into: ... Angiotensin-converting enzyme (ACE) (EC ... Insulin [Cleaved into: Insulin B chain; Insulin ... C-reactive protein [Cleaved into: C-reactive ... Selenium-binding protein 1 (56 kDa selenium ... Endothelin-1 (Preproendothelin-1) (PPET1) ... D site-binding protein (Albumin D box-binding ... Type-1 angiotensin II receptor (AT1AR) ... CD59 glycoprotein (1F5 antigen) (20 kDa ...

...

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Page 30 of 30

[Immediacy1, ...] [Impact1, ...]

7

Protein-to-Topic Reverse queries “TNNI3”

PWO HPO

ACS Paragon Plus Environment

Acute MI Cardiomyopathy Pulmonary embolism Kidney failrue ...