Annotation of the Domestic Pig Genome by Quantitative

Jun 19, 2017 - The pig is one of the earliest domesticated animals in the history of human civilization and represents one of the most important lives...
6 downloads 0 Views 6MB Size
Subscriber access provided by NEW YORK UNIV

Article

Annotation of the domestic pig genome by quantitative proteogenomics Harald Marx, Hannes Hahne, Susanne E. Ulbrich, Angelika Schnieke, Oswald Rottmann, Dimitrij Frishman, and Bernhard Kuster J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00184 • Publication Date (Web): 19 Jun 2017 Downloaded from http://pubs.acs.org on June 19, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Annotation of the domestic pig genome by quantitative proteogenomics Harald Marx 1,2,&,$,*, Hannes Hahne 1,3,$,*, Susanne E. Ulbrich 4,#, Angelika Schnieke 5, Oswald Rottmann 5, Dmitrij Frishman 2, 6, 7 and Bernhard Kuster 1, 8, *

1 Chair of Proteomics and Bioanalytics, Technische Universität München, Freising, Germany 2 Department of Genome-Oriented Bioinformatics, Technische Universität München, Freising, Germany 3 OmicScouts GmbH, Freising, Germany 4 Chair of Physiology, Technische Universität München, Freising, Germany 5 Chair of Livestock Biotechnology, Technische Universität München, Freising, Germany 6 Institute of Bioinformatics and Systems Biology, German Research Center for Environmental Health, Neuherberg, Germany 7 St Petersburg State Polytechnical University, St Petersburg, Russia 8 Center for Integrated Protein Science Munich, Munich, Germany & Present address: Department of Chemistry, University of Wisconsin-Madison, Madison, Wisconsin, United States of America # Present address: Animal Physiology, Institute of Agricultural Sciences, Department of Environmental Systems Science, ETH Zurich, Zurich, Switzerland * Corresponding authors $ These authors contributed equally to this work.

To whom correspondence should be addressed Prof. Dr. Bernhard Kuster, Dr. Harald Marx and Dr. Hannes Hahne Chair of Proteomics and Bioanalytics Technische Universität München Emil Erlenmeyer Forum 5

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 42

85354 Freising Germany Email: [email protected], [email protected], [email protected] Phone: +49-8161715696 Fax:

+49-8161715931

ACS Paragon Plus Environment

2

Page 3 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ABSTRACT The pig is one of the earliest domesticated animals in the history of human civilization and represents one of the most important livestock animals. The recent sequencing of the Sus scrofa genome was a major step towards the comprehensive understanding of porcine biology, evolution

and

its

utility

as

a

promising

large

animal

model for

biomedical

and

xenotransplantation research. However, the functional and structural annotation of the Sus scrofa genome is far from complete. Here, we present mass spectrometry-based quantitative proteomics data of nine juvenile organs and six embryonic stages between 18 to 39 days after gestation. We found that the data provides evidence for and improves the annotation of 8,176 protein-coding genes, including 588 novel and 321 refined gene models. The analysis of tissuespecific proteins and the temporal expression profiles of embryonic proteins provides an initial functional characterization of expressed protein interaction networks and modules including as yet uncharacterized proteins. Comparative transcript and protein expression analysis to human organs reveals a moderate conservation of protein translation across species. We anticipate that this resource will facilitate basic and applied research on Sus scrofa as well as its porcine relatives.

KEYWORDS Pig, quantitative proteomics, mass spectrometry, proteogenomics, mammal development

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 42

INTRODUCTION The pig is one of the earliest domesticated animals in the history of human civilization and represents one of the most important livestock animals. In addition to being the predominant source of meat in large parts of the world, with an annual consumption of more than 100 million tons of pork in 2011, the domestic pig also possesses an increasingly important role in biomedical research. The morphological and physiological similarities between of Sus scrofa and Homo sapiens renders the domesticated pig a promising large animal model for biomedical and translational research 1. For example, transgenic pigs are explored as organ donors for xenotransplantation, model systems for hereditary and acquired diseases 2,3 and as systems for the production of biologicals 4. Moreover, minipigs, growth-restricted Sus scrofa strains, are of increasing importance as non-rodent species in preclinical safety evaluation of pharmaceuticals 5. The recent sequencing of the pig genome was a major step for Sus scrofa research 6,7. Alongside the physical map of the genome, these studies allowed for unprecedented insights into porcine evolution, population divergence and domestication as well as revealing a rather close phylogenetic relationship between Homo sapiens and Sus scrofa. However, the critical importance of Sus scrofa as nutrient source and its promises for emerging biomedical research mandates improving and extending further the structural and functional annotation of the genome 8. One of the central themes of every genome annotation project is the structural annotation of the protein-coding gene complement of the sequenced species. Although the current assembly of the porcine genome represents an important advance 9, genome annotation pipelines can benefit significantly when new data emerges that contains information orthogonal to the available evidence 10. Moreover, although the structural annotation of the protein-coding genome of a species can be obtained at the expense of a reasonable amount of time and costs, understanding how the predicted sequences relate to biological function is less straightforward.

ACS Paragon Plus Environment

4

Page 5 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The structural and functional annotation of genomes can be considerably augmented by mass spectrometry (MS)-based proteomics, usually referred to as proteogenomics 11,12. The largescale interrogation of biological systems by MS-based proteomics provides insights into existence and abundance of proteins, cell-type and time-dependent expression patterns, posttranslational modifications (PTMs) and protein–protein interactions, all of which carry biological information complementary to genomics and transcriptomics 13,14,15. Here, we employed quantitative proteomics for profiling nine major juvenile porcine organs as well as six early embryonic stages after gestation to refine and validate the structural annotation of the porcine genome and provide an initial functional context for a large number of novel protein-coding genes as well as yet uncharacterized proteins based on their organ- and stagespecific expression patterns and annotated orthologous proteins. The analysis of embryonic stages enabled insights into temporal expression and dynamics of (un-)known development associated proteins.

EXPERIMENTAL PROCEDURES Animal welfare and keeping The juvenile domestic German landrace pigs (gilts), age 5.5 to 6 months, were kept in compliance with the animal welfare for pigs of the EU at a livestock breeding in Thalhausen at one of the agricultural experimental stations of the Technische Universitaet Muenchen (Center of Life and Food Sciences Weihenstephan, Germany). Sample collection of embryos and gilt organs To initiate synchronous estrous, the gilts were treated as standard ovulation synchronization schedule with Altrenogest® (progesteron) for 18 days followed by 750 international units

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 42

Intergonan® (PMSG, Gonadotropin) 24 h after the last Altrenogest and 750 international units Ovogest® (hCG, Chorionic gonadotropin) 80 h after Intergonan application. Twenty-four hours later they were artificially inseminated with the semen of the same boar, which was repeated again 12 h later. The day after the last insemination was determined as day one. The gilts were slaughtered at days 18, 22, 25, 28, 32 and 39, respectively, at the local abattoir. The uterus is an animal by-product of the slaughtering and was collected immediately after slaughtering. The implanted embryos were excised from the endometrium, washed with PBS and stored on dry ice. A female gilt was sedated and bled out (euthanized). The organs were excised shortly after, washed with PBS and stored on dry ice for transport. In total nine organs were extracted, namely diaphragm, spleen, biliary, kidney, liver, lung, brain, pancreas and heart. Sample preparation The juvenile organs and embryos were washed multiple times with cold PBS. 10 g organ tissue (random location) and complete embryos were lysed using Tris-HCl buffer containing 4% SDS and a Miccra D-9 homogenizer (ART Labortechnik, Germany). The lysate was ultracentrifuged for 1 h at 20 °C and 52,000× g. In total, 50 µg of the protein extract was reduced with 10 mM dithiothreitol and alkylated with 55 mM iodoacetamide. Proteins were separated via 1D LDS gel electrophoresis (4%–12% NuPAGE gel; Invitrogen, Darmstadt, Germany) and each lane was cut into 12 pieces. Proteins were subsequently digested in gel with trypsin using established procedures 35. LC-MS/MS analysis Nanoflow LC-MS/MS was performed by coupling an Eksigent nanoLC-Ultra 1D+ (Eksigent, Dublin, CA) to an Orbitrap Elite mass spectrometer (Thermo Scientific, Bremen, Germany). Peptides were delivered to a trap column (100 µm i.d. × 2 cm, packed with 5µm C18 resin,

ACS Paragon Plus Environment

6

Page 7 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Reprosil PUR AQ, Dr. Maisch, Ammerbuch, Germany) at a flow rate of 5 µL/minute in 100% buffer A (0.1% FA in HPLC grade water). After 10 minutes of loading and washing, peptides were transferred to an analytical column (75µmx40 cm C18 column Reprosil PUR AQ, 3µm, Dr. Maisch, Ammerbuch, Germany) and separated using a 55 minute gradient from 2% to 35% of buffer B (0.1% FA in acetonitrile) at 300 nL/minute flow rate. The mass spectrometer was operated in data dependent mode, automatically switching between MS and MS2. Full scan MS spectra were acquired in the Orbitrap at 30,000 resolution. Internal calibration was performed using the ion signal (Si(CH3)2O)6H)+ at m/z 445.120025 present in ambient laboratory air. Tandem mass spectra were generated for up to 15 peptide precursors using higher energy collisional dissociation (HCD) and Orbitrap readout at a resolution of 7,500. LC-MS/MS data processing MaxQuant version 1.3.0.3 was used to generate peak lists (apl files) from the raw MS files for subsequent database searching. Notable parameters for the search were: Oxidation (M) and Acetyl (Protein N-Term) as variable modifications, a mass tolerance window of 5 ppm for MS1 and 20 ppm for MS2, trypsin as enzyme, up to two missed cleavages and enabled reverse decoy database option. The search results were filtered at a ≤ 0.01 peptide and protein FDR. Proteogenomic analysis Databases To identify known, refined and novel gene and transcript models, we built ten sequence databases from DNA, cDNA, mRNA, EST and protein (PEP) to search against (Table 1). The pre-masked genome sequence 16, was translated in all six reading frames, where each database entry is an open reading frame (ORF) of at least seven amino acids length. For each known and predicted Ensembl gene, we constructed an exon graph 17. In brief, each node of the exon graph represents an exon and the edge the splice site. We restricted the number of

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 42

exons per locus to 24 and required compatible and valid exon phases. In the graph construction process we performed an on-the-fly in silico digest, and merged the peptides in a non-redundant peptide centric database (PEPcEX). Transcript sequences (cDNA, mRNA, EST) were translated in all three forward frames. Search result validation All peptide spectrum matches (PSMs) from the ten database searches were subject to an additional Andromeda score (SA) threshold (SA ≥ 58.44) to remove low (quality) scoring spectra (Supplementary Fig. S2A). The threshold resulted from a previous study, where we were able to derive (objective) false discovery rate (FDR) models as a function of a search engine score based on a synthetic (phospho-) peptide library 18. PSM grouping To merge the filtered (global FDR and score threshold) search results of multiple database searches, we introduced a naïve peptide spectrum match (PSM) grouping approach. We grouped the best scoring PSM from each search originating from the same experimental spectrum

and

selected

a representative

based

on the

highest

Andromeda

score

(Supplementary Fig. S1A). The representative PSMs not matching to known (reference) Ensembl gene models were required an additional MaxQuant posterior error probability of 1% and a distinct genome location. Mapping To derive peptide coordinates relative to the genome, we searched against the i) six-frame translation of the genome, ii) the peptide centric exon graph and the remainders iii) with BLAT (out=pslx, -t=dnax, -q=prot) and iv) with BLAST (-word_size 2 -matrix PAM30 -seg "no" -evalue 20000 -comp_based_stats 0). Criteria for the best BLAST and BLAT matches were to allow a

ACS Paragon Plus Environment

8

Page 9 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

single amino acid polymorphism (SAP) in the alignment and in case of splice events, i.e. spanning the N-Terminal and C-Terminal subsequences over separate alignments, no SAP. Splice peptides, may reach over multiple alignments, therefore we limited the number of alignments to two and required the occurrence of a consecutive alignment in a genomic interval of 9,369 bp (median gene size). Single linkage clustering To estimate the correctness of known gene model boundaries or identification of novel gene models we applied single linkage clustering 19. In brief, each peptide with a genomic coordinate was linked to its nearest neighbors on the same strand, considering a threshold T, following d(x, y) ≤ T, where x, y are peptide coordinates and d the distance. The threshold was set to 12,373 bp corresponding to a 0.95 quantile intron length. Peptide classifications The peptides in the single linkage clusters were classified into intra- and intergenic events 19. We differentiated classifications over the genome and transcriptome. Beside the classical intergenic events, fusion classifications were peptides mapping between two genes in the same single linkage cluster. Novel splice junction classifications were peptides matching to exon combinations in PEPcEX or BLAT, BLAST matches (see Mapping) not present in [PEP] Ensembl – all (known and novel gene predictions). Exon boundary classifications were peptides reaching over the exon C-terminus indicating the splice site to be in a false position. UTR exons had no Ensembl phase (-1/-1) information and are by definition part of the UTR region. Frame shift classifications were peptides matching to another frame and consequently not matching to the assigned phases of the exon.

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 42

Genome and protein inference In general, inference in MS-based proteomics distinguishes unique and shared peptides 20. In protein inference a peptide is shared or unique in the proteome, whereas in genome inference a peptide is distinct (unique) to a genomic location or to multiple (shared). All gene and transcript models as well as proteins were subject to genome inference, requiring at least a single genomic unique peptide. In addition proteins required at least a single unique peptide on the proteome level. Model grouping The gene, transcript model and protein grouping is a naive approach to remove subsets and same-sets of peptide identifications, i.e. assigning to each model or protein all peptide identifications, select the sequence with the most and omit others. In case of multiple sequences sharing all peptide identifications the longest sequence was selected as representative for the group (Supplementary Table S2). Gene predictions To provide refined and novel gene models we predicted genes with Augustus 21. We supplemented Augustus with extrinsic hints, such as EST, cDNA, mRNA and identified protein sequences, including exon branches (PEPcEX). We used the BLAST-like Alignment tool (BLAT) (-noHead -minIdentity=92) on the pre-masked 16 genome to map the EST, cDNA and mRNA sequences to the genome. And post-processed these with pslCDnaFilter (-minId=0.9 – localNearBest=0.005 –ignoreNs -bestOverlap) to find the best match. We ran exonerate 22 (-model protein2genome --showtargetgff T) for the BLAT output and identified proteins, merging all exonerate results in a single file (extrinsic information). Augustus was run in parallel for each chromosome

(--protein=on,--introns=on,--start=on,

--stop=on,--cds=on,--codingseq=on,--

ACS Paragon Plus Environment

10

Page 11 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

alternatives-from-evidence=true,--alternatives-from-sampling=false,--sample=100,-extrinsicCfgFile=extrinsic.MPE.cfg). The gff output was parsed to retrieve exon, transcript, protein coordinates and sequences. Consensus database To derive a consensus set of protein identifications for follow-up quantitative analysis, we built a protein sequence database containing all Augustus transcript predictions and sequences from databases with reliable transcript models such as cDNA, mRNA and protein sequences. To decrease redundancy on sequence level, these sources were clustered into same-sets and subsets using the MScDB algorithm, a peptide centric clustering approach 23. Notable MScDB parameters were, no mismatch (T = 0) in the clustering, in silico digest based on length with min. = 7 and max. = 52 amino acids and two miss cleavages. Functional annotation and validation To annotate unknown gene products in the Sscrofa10.2.70 release we used BLAST with default settings against protein sequences from the Bos taurus (UMD3.1.75), Mus musculus (GRCm38.75) and Homo sapiens (GRCh37.75) release. We assigned function based on the best homologous hit with a sequence identity > 50% and E-value < 10-5 in at least one species 24. To assign protein coding potential based on conservation to identified but unclassified ORFs we used PhyloCSF (--removeRefGaps --dna –aa --minCodons=7) in Omega mode 25. PhyloCSF requires pre-aligned sequences, hence we used the EMBL-EBI Muscle alignment tool 26 with sequence identity E-value < 10-5 on the above organisms as well as Ensembl Canis lupus familiaris (CanFAM3.1.75) and Equus caballus (EquCab2.75). To also validate the MS evidence, we synthesized peptides using a 2-µmol standard solid-phase synthesis following the

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 42

Fmoc strategy on a parallel peptide synthesizer (Intavis, Cologne). Fmoc-protected amino acids were obtained from Intavis. Ensembl meta information such as id, description, biotype, status, coordinates, cross-references were based on the downloadable MySQL database (Sscrofa10.2.75) and for other species on the Biomart webservice. Statistical and network analysis Protein abundance estimations are based on MaxQuant label-free quantification (LFQ) for the identified MScDB sequences. We performed missing value imputation using Perseus (1.4.1.3,), replacing missing values from a normal distribution to simulate low abundance values in a typical MS experiment. All analyses including correlation analysis, hierarchical clustering and principal component analysis are based on the open-source statistical language R 27. Correlation matrices are based on Pearson correlation using the corrplot library with hierarchical clustering (complete linkage). Fold change calculations were performed in two versions, i) median of sample and ii) median over samples. We extracted the top 2.5 percentile and performed hierarchical clustering. Significantly changing proteins were derived by a MannWhitney test, comparing all organs against all embryonic stages. P-values were adjusted with the False Discovery Rate (FDR) at 5%. To group proteins with similar expression profiles in the embryonic stages, we used the soft clustering algorithm in the bioconductor MFuzz package 28. The fuzzification parameter (m) and optimal number of clusters were estimated with built-in functions, resulting in 1.720936 and 16 respectively. MFuzz cluster were manually grouped into stage-specific expression profiles. In order to retrieve known protein-protein interactions and associations, proteins significantly associated with these cluster groups (MFuzz membership > 0.5) were submitted to STRING 29,30. Only high confidence interactions with combined scores > 0.75 were retained. The

ACS Paragon Plus Environment

12

Page 13 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

resulting stage-specific protein-protein interaction and association networks were imported into Cytoscape 31 and further clustered into highly connected sub-clusters using the GLay community structure analysis algorithm 32. Gene expression analysis and cross-species analysis Protein abundance estimations are based on MaxQuant intensity based absolute protein quantification (iBAQ) for the identified MScDB sequences, to compare against a previous study in human, including the following organs: spleen, kidney, biliary, liver, lung, pancreas and heart. The protein intensities of both data sets were normalized by the total sum of all proteins in each sample and log10 transformed. To estimate transcript abundance, we used cufflinks 33 with default parameters and the Ensembl gene set (GTF) Sscrofa10.2.70 on bam files of a previous study on landrace boars. We paired the following organs, heart, kidney, liver, lung and spleen 34. The FPKM values had to be >= 1. The id mapping between pig and human were based on the above described BLAST comparisons. Correlation matrix was based on Pearson correlation and alphabetic order.

RESULTS Structural annotation of the porcine genome In this study, we profiled the proteome of nine juvenile organs and six embryonic stages using a conventional GeLC-MS/MS approach 35 in combination with high-resolution label-free quantitative mass spectrometry (Fig. 1). In order to obtain accurate structural genome annotations, we searched the acquired 180 LC-MS/MS experiments comprising 2,706,121

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 42

peptide tandem mass spectra with the Maxquant/Andromeda search engine 36,37 against the Ensembl 9 reference protein sequence database as well as sequences derived from expressed sequence tags (ESTs), complementary DNA (cDNA), messenger RNA (mRNA), a six-frame translation of the porcine genome sequence and peptide centric exon graphs (PEPcEX). The six-frame translation enabled the identification of exact peptide matches to the porcine genome beyond known and predicted gene models and the exon graphs allowed the identification of peptide matches to (novel) splice sites (Fig. 1). The search against multiple databases (Table 1) resulted in 810,225 peptide spectrum matches (PSMs) and 93,494 peptide identifications at a 1% global peptide and protein FDR (Supplementary Table S1). Peptide sequences not present in the Ensembl reference protein sequence database, but exclusively identified in non-reference databases potentially provide evidence for novel or refined gene models. To derive non-redundant and high confidence peptide identifications, we filtered them stringently 18 and extracted the best matching peptide sequence for every tandem mass spectrum (Supplementary Fig. S1A and S2A), resulting in 800,308 PSMs and 86,811 peptide identifications. An initial characterization of these peptides revealed that 32,696 of them were exclusively identified in juvenile organs and 14,326 only in embryonic samples (Fig. 2). To further reduce the chance of false positive identifications in a proteogenomic context, additional filter criteria were applied to non-reference matches requiring a distinct genome location and a posterior error probability (local peptide FDR) < 1%, resulting in 6,845 valid peptides (Supplementary Fig. S1B and S2B). The majority of the non-reference peptide identifications was made in transcript databases, corroborating the importance of transcript sequences for the annotation of protein-coding genes.

ACS Paragon Plus Environment

14

Page 15 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The systematic analysis of the genome location and concomitant single linkage clustering of neighboring peptides allowed reviewing the current state of the porcine genome annotation. The resulting peptide clusters were classified according to their position relative to genes or exons provided by Ensembl. Identified intragenic events confirmed known gene models and provided evidence for refined gene and transcript models, including peptide matches to the 5’ untranslated regions (UTRs), hypothetical splice junctions, exon boundaries, intron regions or differing frames (Fig. 3). Surprisingly, the annotation of the known gene models revealed that 1,988 are considered novel in Ensembl, i.e. they correspond to orthologous sequence matches in non-Ensembl sources. In contrast 3,606 intergenic events support 588 novel and 321 refined gene models predicted by Augustus 21 based on peptide matches to the N-terminus, Cterminus or in-between region of known gene models indicating putative exons. Gene classes under-represented or not well understood in current genome annotations It is of note that we identified immunoglobulins (IG), pseudogenes and non-coding transcripts according to the Ensembl biotype annotation (Supplementary Table S2). The IG gene family is important for the (pre-) immune response in embryos and adult pigs. In a previous porcine study using transcript data of fetal piglets, three IGLV genes were identified to be critical for the preimmune repertoire 38,39, notably the IGLV-3 and IGLV-8 families 20 days after gestation. We identified in total 10 IG-genes including the IGLV-3, IGLV-7 and IGLV-8 families and were able to identify and confirm the presence of the IGLV-3 and IGLV-8 genes exclusively at 18 days after gestation (see below). Pseudogenes represent a gene class of much controversy regarding their actual coding potential and it is worthwhile noting that we can provide protein evidence for 38 pseudogenes (Supplementary Table S2) 40. Pseudogenes found in the consensus genome sequence might

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 42

represent active genes in some but not all individuals or, generally speaking, might constitute genetic variation in offspring In contrast to pseudogenes, processed transcripts are a class of genes for which protein coding potential is generally denied. They are transcribed, but do not contain an open reading frame. Interestingly, the majority of 18 identified non-coding genes constitute processed transcripts comprising genes with multiple high confidence peptide identifications, such as SIGLEC1 (ENSSSCG00000007146) and TXLNA (ENSSSCG00000003617). While it is likely that these processed transcripts represent erroneous annotations, we cannot rule out the possibility that the identification of translation products for processed transcripts may as well point to some proteins ‘in evolution’ akin to pseudogenes (see above) or even lincRNAs 15. Another group of genes, which is highly underrepresented in current genome annotations are small ORFs 41 or hard to predict single and multi-exon genes. Among the known limitations of the current gene prediction algorithms are the difficulties in dealing with long genes and/or introns, short exons, overlapping genes, alternative transcription start sites initiation signals (non-AUG), to name just a few

and translation

42. We systematically analyzed peptide

identifications from the six-frame translation not matching to novel, refined or known gene models, resulting in 75 ORFs (Supplementary Table 3). To increase confidence in the identifications, we applied PhyloCSF 25 to assess the protein-coding potential conserved in up to five species and further generated synthetic peptides to validate the MS/MS evidence. Ultimately, we could assign function to 42 ORFs via BLAST, and identify 15 putative / uncharted ORFs that warrant further investigation. Improved proteome coverage using a consensus protein sequence database For follow-up quantitative functional proteome analyses, we constructed a non-redundant consensus database using the MScDB algorithm 23 based on Augustus transcript predictions

ACS Paragon Plus Environment

16

Page 17 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

as well as on identified transcript models from cDNA, mRNA and protein databases. The subsequent

MaxQuant/Andromeda

database

search

identified

8,176

protein

groups

(Supplementary Table S4), which corresponds to a 24.6% increase in protein and 17.6% in peptide identifications compared to the reference Ensembl database (Supplementary Fig. S3A and B). We searched for homologs in a more recent Sus scrofa Ensembl release as well as for orthologs in Homo sapiens, Mus musculus and Bos taurus using the Basic Local Alignment Search Tool (BLAST) and found 7,774 genes/proteins with orthologs in at least one of the reference species (Supplementary Table S4) 24. Interestingly, 850 (301) of these proteins had low (no) homology (< 50 % or 0% identity) in the Sus scrofa Ensembl reference and can thus be considered as of unknown function; 522 of these genes/proteins have orthologs and 232 could be assigned unambiguously to juvenile organs or embryonic stages (Fig. 4A, Supplementary Fig. S3C). However, the probably most interesting group is constituted of 132/301 genes/proteins for which we found no orthologs in any reference species. These proteins either represent genes emerged during the last 63.1 Myr of evolution that separate Sus scrofa from Bos taurus 43, or they represent yet undiscovered (mammalian) genes. Annotation of the pig genome with quantitative proteomics We have generated proteome profiles of six embryonic stages and nine organs to analyze porcine proteomes in functional terms and to derive novel annotations for yet uncharacterized and novel protein-coding genes. Quantitative information of protein expression was extracted from data using the label-free quantification algorithm of MaxQuant 44. Protein abundance distributions in embryos and organs were very similar and span more than 4.5 orders of magnitude, which is consistent with the recent estimates of protein copy numbers in mammalian cell lines 45,46. A comparative analysis of protein expression across all samples revealed strong clusters of embryonic stages and organs (Fig. 4B). Embryonic stages generally exhibit a low degree of heterogeneity (with correlation coefficients between 0.41 and 0.95) and cluster

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 42

according to the gestational age of the embryo. The clustering is consistent with the general morphological development of the embryos between 18d and 39d (see also Fig. 1). However, the proteome profile of the 18d embryo appears to be considerably divergent, while the remaining stages reveal surprisingly similar profiles. In contrast, the overall clustering of the organs unravels a strong heterogeneity of protein expression and only three rather weakly associated groups can be separated. These are namely the neighboring organs liver and biliary gland, and, more weakly associated, heart and diaphragm as well as spleen, lung and pancreas. Heart and diaphragm are both smooth muscles, and the similarity between spleen and lung can be rationalized by the presence of immune cells as lung represents a primary entry point for pathogens. It is of note that the late embryonic stages are mainly characterized by the growth of the embryo, which is reflected by a reasonable correlation between the late embryonic stages and spleen, lung, pancreas and biliary. Organ-enriched protein expression To identify proteins, which appear uniformly among the most highly abundant proteins across all organs, we analyzed the top 2.5% of the proteins within each organ (Fig. 4C). Hierarchical clustering of the resulting 749 proteins disclosed 265 proteins commonly expressed at high levels and mainly involved in central metabolic and cellular processes. Interestingly, even the most highly abundant proteins in an organ form clusters of organ-specific protein expression and the corresponding proteins point to molecular processes associated with the respective biological specialization. For instance, a tight cluster of brain specific protein expression comprises multiple proteins with well-established functions in neuron projection, neuron development and synaptic transmission. This analysis of organ-specific protein expression can be complemented by an analysis of those proteins, which are exclusively or preferentially detected in a particular organ. We analyzed the 2.5% most differentially expressed proteins per organ (compared to the average expression across all organs; Supplementary Fig. S4).

ACS Paragon Plus Environment

18

Page 19 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Hierarchical clustering of protein expression profiles results in the same organ clusters as observed for the correlation analysis in Figure 4B and the analysis of the most abundant proteins. Gene ontology analysis of the corresponding protein clusters invariably highlighted organ-specific biology. This is illustrated for the kidney-specific cluster comprising 77 proteins, which are functionally related to the apical membrane, ion and sugar transport, mitochondria, endocytosis, and also encompass angiotensin I converting enzyme 1 and 2 (ACE and ACE2) and other members of the renin-angiotensin-system. The identification of organ-specific protein expression may yield novel insights into physiological processes and may aid in the functional annotation of novel proteins and proteins with no ascribed function but exclusive (or high) expression in particular organs. Functional annotation of novel and yet uncharacterized embryonic proteins In order to identify novel and yet uncharacterized proteins with a putative function and exclusive (or high) expression during embryonic stages, we compared protein expression in embryonic stages against juvenile organs. A non-parametric Mann-Whitney test revealed 155 proteins significantly overexpressed in embryonic samples (Supplementary Fig. 5A). The majority of these proteins has highly similar orthologs in Homo sapiens, Mus musculus and Bos taurus (Fig. 4A), but also comprises some gene predictions not comprised in the reference organisms (Supplementary Fig. S3C). Among the differentially expressed proteins are six proteins not yet described in Sus scrofa including the FAT tumor suppressor homolog 4 (FAT4, see also below) and four uncharacterized proteins with a putative, yet unknown function during embryonic development such as the two isoforms of the C1orf123 gene corresponding to an uncharacterized protein family UPF0587, one with similarity to a 40S ribosomal protein as well as the protein HIST1H1B. Temporal protein expression profiling in porcine embryos

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 42

Proteome profiles of embryonic stages considerably deviate from those of juvenile organs (Fig. 4B, Supplementary Fig. S5B). To identify longitudinal patterns from the time profiles of proteins between the 18d and 39d stage, we utilized fuzzy c-means (FCM) clustering. In contrast to hard partitioning algorithms such as k-means or hierarchical clustering, FCM clustering is characterized by a gradual membership for a set of clusters (represented by colors in Fig. 5), offers robust clustering and is well suited for the analysis of time-course data 28,47. Based on 16 initial FCM clusters, we grouped 11 clusters constituting strong cluster memberships (i.e. cluster membership > 0.5; Fig. 5, Supplementary Fig. 6 and 7; Supplementary Table S5) into six groups of distinct temporal expression profiles. Overlaying these temporal expression profiles with protein-protein interactions and associations derived from STRING 48 revealed a time-resolved map of expression patterns of functionally related protein networks and modules including many proteins linked to embryonic development. This is illustrated by the network module 1 (Fig. 5) comprising 113 highly expressed proteins during early embryonic stages and that are subsequently down-regulated. The cluster contains PGH2 synthase and PGE synthase responsible for the synthesis of prostaglandin E2 (PGE2), suggesting an important role of the embryo in the hormonal regulation by PGE2 of placenta and embryo development. This is consistent with the co-regulation of matrix metalloproteinases and other proteins involved in tissue remodeling, which assist cytotrophoblast invasion of the decidua 49, a critical step of the embryo implantation, which eventually leads to the blastocyst and the maternal endometrium forming the placenta. Similarly, the expression profiles of a large functional cluster related to laminin and integrins and many associated proteins indicate that the processes responsible for the attachment, migration, and organization of cells into tissues and blood vessels in the early embryo and placenta are completed between the 18d and 25d stages. Likewise, network module 2, up-regulated during 28d and 32d (Figure 3), comprises multiple members of the MAP kinase pathway (such as PDGFR, PKC, BRAF, and MAP kinases), other

ACS Paragon Plus Environment

20

Page 21 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

neuron-specific kinases (such as PTK2, EPHB3, EPHB4, and CAMK4) and microtubule proteins altogether indicating neuron and brain development as one of the predominant developments at this stage. Interestingly, these proteins show the same temporal expression profiles as the FAT4 protein, of which we have identified two distinct isoforms with distinct temporal expression profiles. The tumor suppressor FAT4 is one of the key receptors for the Hippo signaling pathway regulating organ size in animals through the regulation of cell proliferation and apoptosis 50. Recent evidence suggests that the cadherin receptor-ligand pair DCHS1 and FAT4 is essential for cerebral cortical development 51,52 and the regulation of neuronal migration 53. Consistent with this, the temporal peak expression of the FAT4 and DCHS1 pair coincides with the expression of proteins with strong ties to neuronal functions such as the MAP kinase signaling pathway and other kinases in network module 2 and underscore the important regulatory role of DCHS1/FAT4 in neuronal stem cells and porcine brain development. In contrast, proteins upregulated at stage 39d (Supplementary Fig. S8; Supplementary Table S6) are very diverse in functional roles, indicating that multiple un-related processes take place in the late embryo, such as neuron differentiation, which is indicated by proteins such as neurexin II and metabotropic glutamate receptors 4 and 5 or expression of CYP450 monooxygenases indicating liver development or growth. Comparison of porcine and human protein expression in seven tissues Porcine cells and organs are investigated as potential source for xenotransplantation. While the immune biology of xenotransplantation has been studied in considerable detail 54, with the exception of discrepancies in coagulation, which plays a major role in xenograft failure, less emphasis has been given to investigations of the physiology of pig organ systems and their compatibility with those of humans 55. With genome-wide protein expression data at hand for seven pig and human organs 15, covering five organs discussed as potential xenograft sources 55, we set out to investigate the extent of physiological differences on the protein expression

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 42

level using the most differentially expressed proteins and correspondingly over-/underrepresented biological functions. The protein expression profiles of human and porcine organs are, by and large, considerably similar with Pearson correlation coefficients between 0.58 and 0.69 (Fig. 6A, Supplementary Fig. 9; Supplementary Table S7). However, some organs show significant differences in terms of physiological functions. One noteworthy example is the human heart proteome, which appears to comprise a significantly higher abundance of mitochondrial proteins (and hence mitochondria) and contractile fiber proteins than the porcine heart. This is reflected by 27 mitochondrial proteins, such as respiratory chain proteins and ATP synthase complex, 8 contractile fiber proteins such as myosin and tropomyosin and 7 extracellular matrix proteins, which are among the 65 most differential proteins (top 5%) with higher expression in human. In light of the anatomical and physiological similarity between a porcine and human heart including mean arterial pressure, heart rate and myocardial blood flow 55, it is astounding to observe such striking differences on protein and metabolic level. Clearly, more detailed and focused studies, such as proteomics of single muscle fibers 56, are required. Conserved control of protein abundance in mammals Messenger RNA (mRNA) abundance of a transcript alsone is generally a weak predictor for abundance of the corresponding protein 15, and it comes to no surprise that this is also the case for porcine samples analysed here (Supplementary Fig. S10). However, we and others could recently show that the translation rate constant of mRNA is the second major determinant for protein abundance at steady state and that the translation rate appears to be a fundamental, encoded (constant) characteristic of a transcript 46,15. We recently showed this for 12 human tissues by comparing the ratio between mRNA and protein abundance as a proxy for the translation rate constant. We here extended this analysis to porcine organs and unravel that the protein to mRNA ratio is fairly conserved not only across five porcine tissues 34, but also between human and porcine orthologs (Fig. 6B). It is of note that despite the variety of data

ACS Paragon Plus Environment

22

Page 23 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sources used for this analysis, the correlation of the protein-to-mRNA ratio of 0.62 is comparable to the correlation coefficients of porcine vs. human transcript (around 0.68) and protein abundances (around 0.67) (Supplementary Fig. S11). It is worthwhile noting that highly abundant proteins, such as those involved in cellular maintenance (e.g., energy metabolism and protein synthesis), evolve more slowly than proteins of lower abundance, which include most regulatory proteins 45,57,58. Highly expressed proteins (i.e. those with a high translation rate constant) are apparently under strong selective pressure, likely because of energy constraints 59 and/or because of a requirement for translational robustness (i.e., minimizing the risk for protein aggregation and toxicity) 60, whereas low-abundance proteins are subject to lower selective pressure. Taken together, this highlights the importance of translational control mechanisms for the evolution of protein abundance.

DISCUSSION MS-based proteomics and proteogenomics can provide definite evidence for protein-coding genes, help to refine gene models and provide a functional context for well-studied as well as yet uncharacterized proteins 12,15. While proteogenomics still holds surprises even for very well-studied species such as Homos sapiens 15,61, it can be enormously useful for organisms with less well annotated genomes such as plants 19 or animals. This study represents the first large-scale analysis of the porcine proteome and one of the few time-resolved proteomic studies of mammalian embryonic and fetal development 62. While the current efforts in porcine genome annotation are in most cases accurate (67,772 out of 86,811 peptides matching to previously known gene models), genome annotation is incomplete and could be considerably improved by integrating proteomics evidence using proteogenomics approaches. Most notably, we could expand the proteome coverage of the

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 42

porcine proteome by 24.6% and peptide identifications by 17.6%. A significant contribution came from the proteome profiles of embryonic and fetal stages, biological entities not exploited for building the previous version of the porcine proteome. This underscores the general importance of using not only orthogonal levels of omics data, but also complementary biology for an enhanced genome annotation. The time-resolved map of fetal protein expression was likewise important for the functional annotation of the porcine genome and led to insights into functionally related protein networks and modules such as brain development and the FAT4/DCHS1 signaling axis. The case of FAT4 nicely illustrates how structural and functional genome annotation can go hand in hand. Despite orthologous evidence, FAT4 was not comprised in the recent version of the porcine genome, but could be identified as a novel gene of Sus scrofa in this study and, in turn, was identified as one of the key regulators of porcine brain development. The comparative analysis of porcine and human organs revealed surprising differences of the heart proteomes which have not been described on a macroscopic level before. This underscores the notion that molecular profiling techniques such as functional and expression proteomics can enhance the understanding of biochemical and physiological differences between human and porcine tissues or cells including incompatible and absent functions, and ultimately enable a better understanding of these incompatibilities in life-supporting pig-intohuman organ grafting. Furthermore, comparative expression analysis of porcine and human organs on the transcript and protein level illustrates a strong positive correlation and hence evolutionary conservation over species. This will likely have major implications for the field of xenotransplantation. Clearly, the provided initial proteome map of nine juvenile tissues and six embryonic stages is not as comprehensive as those available for comparable mammalian species such as mouse

ACS Paragon Plus Environment

24

Page 25 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

63,64 or human 61,15. And it is also obvious that substantial improvements in technology will lead to a better ability to detect protein variants, such as differential splice products, PTMs, mutations or isoforms in a systematic fashion. Nonetheless, this study and the associated data set represent a valuable resource for further studies on porcine and mammalian developmental biology.

SUPPORTING INFORMATION 1. Supplementary_Figures.pdf. Supplementary Figures. Supplementary Figure 1. Data processing. Supplementary Figure 2. Evaluation criteria for peptide identifications. Supplementary Figure 3. MaxQuant peptides and protein groups. Supplementary Figure 4. Protein expression in organs. Supplementary Figure 5. Differential protein expression. Supplementary Figure 6. Fuzzy clustering of embryonic stages. Supplementary Figure 7. LFQ intensities in fuzzy cluster. Supplementary Figure 8. Network overview. Supplementary Figure 9. Sus scrofa and Homo sapiens protein expression. Supplementary Figure 10. Transcript to protein correlation. Supplementary Figure 11. Sus scrofa and Homo sapiens transcript and protein expression.

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 42

2. Marx_Supplementary_Table_1.xlsx. Peptide identifications with search result meta and genome (Sscrofa10.2.70) coordinates. 3. Marx_Supplementary_Table_2.xlsx. Ensembl, Augustus gene, transcript, and exon meta information. 4. Marx_Supplementary_Table_3.xlsx. Small ORF candidates based on conservation. 5. Marx_Supplementary_Table_4.xlsx. MScDB cluster composition. 6. Marx_Supplementary_Table_5.xlsx. Fuzzy cluster. 7. Marx_Supplementary_Table_6.xlsx. Network analysis. 8. Marx_Supplementary_Table_7.xlsx. Transcript (FPKM) and protein (iBAQ) expression.

ACKNOWLEDGEMENTS We would like to thank Mathias Wilhelm for insightful discussions, Michael Kroetz-Fahning, Andrea Hubauer, Andreas Klaus, Steffen Loebnitz for assistance in the sample preparation and Fiona Pachl for measuring the samples. This research was in part funded by the DFG International Research Training Group "Regulation and Evolution of Cellular Systems" (GRK 1563). HM acknowledges the support of the TUM Graduate School at the Technische Universität München, Germany.

CONFLICT OF INTEREST STATEMENT HH and BK are cofounders and shareholders of OmicScouts GmbH, a company which provides MS-based proteomics services. HH is the managing director of OmicScouts GmbH.

ACS Paragon Plus Environment

26

Page 27 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ABBREVIATIONS FDR

False discovery rate

IG

Immunoglobulin

MS

Mass spectrometry

ORF

Open reading frame

PEPcEX

Peptide centric exon graph

PSM

Peptide spectrum match

SAP

Single amino acid polymorphism

PEP

Protein sequence

EST

Expressed sequence tag

UTR

Untranslated region

LFQ

Label free quantification

iBAQ

Intensity-based absolute quantification

FPKM

Fragments per kilobase of exon per million fragments mapped

BLAST

Basic local alignment search tool

PTM

Post translational modification

LC

Liquid chromatography

cDNA

complementary DNA

mRNA

messenger RNA

MScDB

Mass spectrometry centric protein sequence database

FCM

Fuzzy c-means

MS/MS

Tandem mass spectrometry

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 42

AUTHOR CONTRIBUTIONS HM, AS, OR, DF and BK designed the study. HM performed experiments. HM and HH analyzed data. HM, HH, SU and BK wrote the manuscript.

DATA ACCESS The raw, apl and txt files were deposited in the PRIDE repository (PXD003204).

REFERENCES 1. Vodicka, P. et al., The miniature pig as an animal model in biomedical research. Ann N Y Acad Sci 1049, 161-171 (2005). 2. Li, S. et al., Viable pigs with a conditionally-activated oncogenic KRAS mutation. Transgenic Res 24 (3), 509-517 (2015). 3. Wolf, E., Braun-Reichhart, C., Streckel, E. & Renner, S., Genetically engineered pig models for diabetes research. Transgenic Res 23 (1), 27-38 (2014). 4. Klymiuk, N., Aigner, B., Brem, G. & Wolf, E., Genetic modification of pigs as organ donors for xenotransplantation. Mol Reprod Dev 77 (3), 209-221 (2010). 5. Bode, G. et al., The utility of the minipig as an animal model in regulatory toxicology. J Pharmacol Toxicol Methods 62 (3), 196-220 (2010). 6. Groenen, M. A. M. et al., Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491 (7424), 393-398 (2012). 7. Humphray, S. J. et al., A high utility integrated map of the pig genome. Genome Biol 8 (7), R139 (2007). 8. Prather, R. S., Pig genomics for biomedicine. Nat Biotechnol 31 (2), 122-124 (2013). 9. Curwen, V. et al., The Ensembl automatic gene annotation system. Genome Res 14 (5), 942-950

ACS Paragon Plus Environment

28

Page 29 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(2004). 10. Brent, M. R., Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9 (1), 62-73 (2008). 11. Jaffe, J. D., Berg, H. C. & Church, G. M., Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4 (1), 59-77 (2004). 12. Nesvizhskii, A. I., Proteogenomics: concepts, applications and computational strategies. Nat Methods 11 (11), 1114-1125 (2014). 13. Aebersold, R. & Mann, M., Mass spectrometry-based proteomics. Nature 422 (6928), 198-207 (2003). 14. Mallick, P. & Kuster, B., Proteomics: a pragmatic perspective. Nat Biotechnol 28 (7), 695-709 (2010). 15. Wilhelm, M. et al., Mass-spectrometry-based draft of the human proteome. Nature 509 (7502), 582587 (2014). 16. Smit, A., Hubley, R. & Green, P., RepeatMasker Open-4.0. http://www.repeatmasker.org (20132015). 17. Tanner, S. et al., Improving gene annotation using peptide mass spectrometry. Genome research 17 (2), 231-239 (2007). 18. Marx, H. et al., A large synthetic peptide and phosphopeptide reference library for mass spectrometry-based proteomics. Nat Biotechnol 31 (6), 557-564 (2013). 19. Castellana, N. E. et al., Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A 105 (52), 21034-21038 (2008). 20. Nesvizhskii, A. I. & Aebersold, R., Interpretation of shotgun proteomic data: the protein inference problem. Molecular & cellular proteomics : MCP 4 (10), 1419-1440 (2005). 21. Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S., Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). 22. Slater, G. S. C. & Birney, E., Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005). 23. Marx, H., Lemeer, S., Klaeger, S., Rattei, T. & Kuster, B., MScDB: a mass spectrometry-centric protein sequence database for proteomics. J Proteome Res 12 (6), 2386-2398 (2013).

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 42

24. Tian, W. & Skolnick, J., How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333 (4), 863-882 (2003). 25. Lin, M. F., Jungreis, I. & Kellis, M., PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27 (13), i275--i282 (2011). 26. Edgar, R. C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32 (5), 1792-1797 (2004). 27. Ihaka, R. & Gentleman, R., R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5 (3), 299–314 (1996). 28. Futschik, M. E. & Carlisle, B., Noise-robust soft clustering of gene expression time-course data. J Bioinform Comput Biol 3 (4), 965-988 (2005). 29. Snel, B., Lehmann, G., Bork, P. & Huynen, M. A., STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res 28 (18), 3442-3444 (2000). 30. Szklarczyk, D. et al., STRING v10: protein-protein interaction networks, integrated over the tree of life. Nucleic Acids Res 43 (Database issue), D447--D452 (2015). 31. Shannon, P. et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13 (11), 2498-2504 (2003). 32. Su, G., Kuchinsky, A., Morris, J. H., States, D. J. & Meng, F., GLay: community structure analysis of biological networks. Bioinformatics 26 (24), 3135-3137 (2010). 33. Trapnell, C. et al., Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28 (5), 511-515 (2010). 34. Farajzadeh, L. et al., Pairwise comparisons of ten porcine tissues identify differential transcriptional regulation at the gene, isoform, promoter and transcription start site level. Biochem Biophys Res Commun 438 (2), 346-352 (2013). 35. Schirle, M., Heurtier, M.-A. & Kuster, B., Profiling core proteomes of human cell lines by onedimensional PAGE and liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 2 (12), 1297-1305 (2003). 36. Cox, J. & Mann, M., MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 26 (12), 1367-1372 (2008). 37. Cox, J. et al., Andromeda: a peptide search engine integrated into the MaxQuant environment. J

ACS Paragon Plus Environment

30

Page 31 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Proteome Res 10 (4), 1794-1805 (2011). 38. Butler, J. E., Wertz, N. & Sun, X., Antibody repertoire development in fetal and neonatal piglets. XIV. Highly restricted IGKV gene usage parallels the pattern seen with IGLV and IGHV. Mol Immunol 55 (34), 329-336 (2013). 39. Wertz, N., Vazquez, J., Wells, K., Sun, J. & Butler, J. E., Antibody repertoire development in fetal and neonatal piglets. XII. Three IGLV genes comprise 70% of the pre-immune repertoire and there is little junctional diversity. Mol Immunol 55 (3-4), 319-328 (2013). 40. Brosch, M. et al., Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome research 21 (5), 756-767 (2011). 41. Anderson, D. M. et al., A micropeptide encoded by a putative long noncoding RNA regulates muscle performance. Cell 160 (4), 595-606 (2015). 42. Mathé, C., Sagot, M.-F., Schiex, T. & Rouzé, P., Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30 (19), 4103-4117 (2002). 43. Hedges, S. B., Dudley, J. & Kumar, S., TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics 22 (23), 2971-2972 (2006). 44. Luber, C. A. et al., Quantitative proteomics reveals subset-specific viral recognition in dendritic cells. Immunity 32 (2), 279-289 (2010). 45. Beck, M. et al., The quantitative proteome of a human cell line. Mol Syst Biol 7, 549 (2011). 46. Schwanhäusser, B. et al., Global quantification of mammalian gene expression control. Nature 473 (7347), 337-342 (2011). 47. Olsen, J. V. et al., Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 127 (3), 635-648 (2006). 48. Franceschini, A. et al., STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res 41 (Database issue), D808--D815 (2013). 49. Corcoran, M. L., Kibbey, M. C., Kleinman, H. K. & Wahl, L. M., Laminin SIKVAV peptide induction of monocyte/macrophage prostaglandin E2 and matrix metalloproteinases. J Biol Chem 270 (18), 10365-10368 (1995). 50. Mao, Y. et al., Characterization of a Dchs1 mutant mouse reveals requirements for Dchs1-Fat4 signaling during mammalian development. Development 138 (5), 947-957 (2011).

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 42

51. Cappello, S. et al., Mutations in genes encoding the cadherin receptor-ligand pair DCHS1 and FAT4 disrupt cerebral cortical development. Nat Genet 45 (11), 1300-1308 (2013). 52. Ishiuchi, T., Misaki, K., Yonemura, S., Takeichi, M. & Tanoue, T., Mammalian Fat and Dachsous cadherins regulate apical membrane organization in the embryonic cerebral cortex. J Cell Biol 185 (6), 959-967 (2009). 53. Zakaria, S. et al., Regulation of neuronal migration by Dchs1-Fat4 planar cell polarity. Curr Biol 24 (14), 1620-1627 (2014). 54. Ekser, B. & Cooper, D. K. C., Overcoming the barriers to xenotransplantation: prospects for the future. Expert Rev Clin Immunol 6 (2), 219-230 (2010). 55. Ibrahim, Z. et al., Selected physiologic compatibilities and incompatibilities between human and porcine organ systems. Xenotransplantation 13 (6), 488-499 (2006). 56. Murgia, M. et al., Single muscle fiber proteomics reveals unexpected mitochondrial specialization. EMBO Rep 16 (3), 387-395 (2015). 57. Pál, C., Papp, B. & Hurst, L. D., Highly expressed genes in yeast evolve slowly. Genetics 158 (2), 927931 (2001). 58. Subramanian, S. & Kumar, S., Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics 168 (1), 373-381 (2004). 59. Lane, N. & Martin, W., The energetics of genome complexity. Nature 467 (7318), 929-934 (2010). 60. Drummond, D. A., Bloom, J. D., Adami, C., Wilke, C. O. & Arnold, F. H., Why highly expressed proteins evolve slowly. Proc Natl Acad Sci U S A 102 (40), 14338-14343 (2005). 61. Kim, M.-S. et al., A draft map of the human proteome. Nature 509 (7502), 575-581 (2014). 62. Edwards, A. V. G., Schwämmle, V. & Larsen, M. R., Neuronal process structure and growth proteins are targets of heavy PTM regulation during brain development. J Proteomics 101, 77-87 (2014). 63. Huttlin, E. L. et al., A tissue-specific atlas of mouse protein phosphorylation and expression. Cell 143 (7), 1174-1189 (2010). 64. Geiger, T. et al., Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse. Mol Cell Proteomics 12 (6), 1709-1722 (2013).

ACS Paragon Plus Environment

32

Page 33 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

TABLES Table 1. Search database information Name

Molecule Type

Size (Entries)

Version

Ensembl - all

PEP

25,883

10.2.70

Ensembl - ab initio

PEP

52,372

10.2.70

RefSeq

PEP

59,991

01.03.2013

UniProtKB

PEP

33,205

28.02.2013

Ensembl - all

cDNA

82,635

10.2.70

Ensembl - ab initio

cDNA

157,116

10.2.70

RefSeq

mRNA

195,291

28.03.2013

GenBank

EST

10,034,796

28.03.2013

PEPcEX

DNA

2,936,004

10.2.70

Ensembl

DNA

85,738,772

10.2.70

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 42

FIGURE LEGENDS Figure 1. Experimental strategy of MS-based quantitative proteomics genome annotation. The biological samples comprise six distinct porcine embryonic stages and nine juvenile organs (top panel). Samples were lysed and subjected to GeLC-MS/MS analysis. The resulting peptide fragment spectra were interpreted with genome, transcript and protein databases including a six-frame translation and peptide centric exon graphs (bottom panel). Figure 2. Peptide identifications in genome, transcriptome and proteome databases. Initial characterization of the porcine samples to databases as a function of peptide identifications. Figure 3. Proteogenomic peptide classification. Schematic gene models with classifications of non-reference peptides (blue) into intragenic and intergenic events. The proteogenomic peptide classification helped to refine existing gene models (black) or provide evidence for novel exons or gene models (green). Figure 4. Protein expression in porcine organs and embryonic stages. (A) Heatmap showing sequence identity of Bos taurus, Mus musculus and Homo sapiens (10.2.75 release) proteins homologous to MScDB identifications supplemented by tissue specificity. Sequence identity was used as an approximation of functional relatedness. (B) Expression correlation and hierarchical clustering based on measured relative protein abundance over embryos and organs. (C) Hierarchical clustering of the top 2.5% of proteins in organs (749). Gray boxes and text indicate sample annotation for selected clusters and significantly (p < 0.05) enriched gene ontology terms. Figure 5. Temporal protein expression profiles of embryonic stages. The upper panel illustrates fuzzy clusters of protein expression profiles. The color coding indicates the

ACS Paragon Plus Environment

34

Page 35 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

membership value of each protein to a cluster (red is highest). Clusters were grouped by global expression trends. The lower panel represents an overlay of the groups with STRING 48 protein-protein interactions. Three sub-networks exemplify time-resolved biological processes crucial to embryonic development. Figure 6. Orthologous transcript and protein expression analysis. (A) Principal component analysis of porcine and human organs (liver, heart, spleen, lung, kidney, biliary, pancreas; human data from 15). (B) Comparison of porcine and human mRNA-to-protein ratio as a proxy for the translation rate. Protein abundance is expressed as normalized iBAQ values and transcript abundance as FPKMs.

FOR TOC ONLY

ACS Paragon Plus Environment

35

Figure 1

Journal of Proteome Research

Page 36 of 42

Embryonic stages (n = 6) Juvenile organs (n = 9) 1 2 18 d 25 d 28 d 32 d 35 d 39 d 180 d 3 4 Diaphragm 5 Spleen 6 Biliary 5 mm 7 Kidney 8 Liver 9 Lung 10 Brain 11 Pancreas 12 Heart 13 14 15 Proteogenomic database search 16 17 Genome 18 Proteome 19 Peptide centric Exon Graph Six frame translation 20 Ensembl Frame +1 21 RefSeq Frame +2 UniProtKB 22 Frame +3 23 NNN 24 5’ Exon Exon 2 Exon 3 3’ Exon 1 Exon 2 Exon 3 25 Transcriptome 1 3’ 5’ 26 NNN 27 GenBank ACS Paragon Plus Frame -1 Environment 28 RefSeq Frame -2 29 UniProtKB Frame -3 30

Figure 2

40000

Peptides (n)

Page 37 Journal of 42 of Proteome Research60000

0

5000

10000

15000

20000

25000

30000

1 Peptides (n) 20000 2 3 0 4 Diaphragm 5 Spleen Biliary 6 Kidney 7 Liver 8 Lung 9 Brain Pancreas 10 Heart 11 12 18 d 13 25 d 25000 28 d 20000 14 32 d 15 35 d 15000 16 39 d 10000 17 5000 18 19 20 21 22 ACS Paragon Plus Environment 23 24 25 [DNA] Ensembl [DNA] PEPcEX [mRNA] RefSeq [EST] GenBank [PEP] UniProtKB [PEP] RefSeq [PEP] Ensembl − all [PEP] Ensembl − ab initio [cDNA] Ensembl − all [cDNA] Ensembl − ab initio

Intergenic events Intragenic events

1 2 3 4 5 6 7 8

Figure 3

Journal of Proteome Research

Page 38 of 42

Fusion (364)

N-Term extension (430)

DNA

C-Term extension (244)

+3

Novel splice junction (319) UTR translated (628)

Exon boundary (1,088)

+2

Frameshift

(96) NovelEnvironment exon ACS Paragon Plus (1,475)

Novel (2,756)

Figure 4

Page 39 of 42

Journal of Proteome Research

A

Identity Source 100

Bos taurus Mus musculus Homo sapiens

80 60 40

Embryo Organ Both

20 0

1

Color Key Count 0 20 60

C

Biliary Liver 18d 25d 32d 35d 28d 39d Heart Diaphragm Brain Kidney Pancreas Spleen Lung

456789 log10(LFQ)

Oxidoreductases Carboxylic acid and lipid metabolism

0.9 0.8 0.7

Neuron projection Neuron development Synaptic transmission

0.6 0.5

Sarcomer Sarcoplasmic reticulum Myosin complex

0.4 0.3 0.2

Biliary

Liver

Kidney

Spleen

ACS Paragon Plus Environment

Lung

Brain

Pancreas

0

Diaphragm

0.1

Heart

1 2 3 4 5 6 7 8B 9 10 11 12 Biliary 13 Liver 14 18d 25d 15 32d 16 35d 17 28d 39d 18 Heart 19 Diaphragm 20 Brain Kidney 21 Pancreas 22Spleen 23 Lung 24 25 26 27

Core metabolism Cytoskeleton Transport Heat shock proteins 14-3-3 proteins

1

Membership

2

1 42 Page 40 of 0.9

0

2 1 0

2 1 0

2 1 0

2 1 0

Expression changes Expression changes

Journal of Proteome Research

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0 −2 −1

1

2

−2 −1

2 0 −2 −1

1

−2 −1 2 0 −2 −1

1

−2 −1 2 0 −2 −1

1

−2 −1 2 0 −2 −1

1

0 −2 −1

1

2

−2 −1

1 2 3 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 4 Time [d] 5 6 7 8 9 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 18 25 28 32 35 39 10 18 25 28 32 35 39 11 Time [d] 12 TNNT3 13 Muscle proteins Placenta TNNC2 14 MYBPC1 MLLT4 Prostaglandin regulation TPM4TNNT2 15 TPM3 RRAS 16 LCP2 TPM2 SORBS3 SORBS1 NISCH PTGIS BSG PTGS1 17 CALD1 CREB1 PPFIA1 TPM1 PTGES 18 VDAC1 PDLIM7 HPGDS PPP2CA MYLK CYP4F3 RAC1 ITGAX DBNL LGALS3 CYP4A11 MYL6 MAPK1 CYCS 19 PTGS2 PPP2CB FNTA CAV1 JAM3 LTBP1 CAMP 20 RBMS1 CASP3 VCAM1 CAST ITGA5 BAX GPC4 SLMAP ITGB5 ITGA1 EGFR COL18A1 COL5A2 21 HBEGF ITGB1 SDC2 OAS2 ITGA2 COL11A1 TGM2 ITGA4 LTF SFRP1 LAMB1 22 BAK1 TNXB TGFBI LAMA1 EZR CD9 ITGA3 LAMC2 COL5A1 PALLD COL4A1 CD44 LAMA2ITGAV RAB5A 23 HSPG2 COL6A6 ITGA6 LAMC1 MFGE8 ITGA8 TIMP1 MMP14 Extracellular matrix-receptor interaction ITGB6 24 NCL HMMR Secretory granule TNR GDI1 FN1 GCM1 CD34 NID1 SRGN 25 Regulation of apoptosis IGF2BP3 ADAM9 SERPINF2 26 SV2C FNDC3A NID2 ITGAD FGG Extensive extracellular matrix remodeling 27 NOV APOH ZYX FBLN2 28 DOCK1 29 SYNPO2 ABI1 Mitogen-activated protein kinase 30 FBN1 WASL Brain pathway 31 Ubiquitin-dependent protein degradation 32 RPL22L1 EPHB3 PDGFRB Cell cycle regulation RPL17 33 Protein biosynthesis HYOU1 ALG2 ETF1 ZYG11B RPS29 34 RASA1 EPHB4 TGFB1I1 SH3GLB1 RPL38 RPL23 35 PDCD6IP UBA5 UFM1 CSE1L TCEA1 PPP2R1A USP5 RPL23A 36 HSP90B1CUL2 PTK2B RRM1 EIF4G1 SEC61G TCEB1 H2AFY 37 CDK1 UBE2D2 ADRBK1 USP8 NEDD8 ANAPC4 EIF4A1 38 PSMD6 UBA6 OTUB1 PSMD11 SPTBN1 PSMC3 39 Microtubule PSMC5 MALT1 UBC PARP9 PSMC2 PRKCB UBE3A UBA1 PRKCG KIAA0368 40 NPLOC4PHGDH POMP PRKACB UCHL5USP14 DNM1L 41 DDX58 DAXX USP21 PLAA VTA1 MAP2K4 MAP2K1PPP2R1B TUBA1A PLCG1 42 AIMP2 BAG1 DGKA DYNC1H1 CAMK4 HSP90AB1 CHPT1 PLA2G7 BRAF DCTN1 TUBA1B MAPK3 43 MAP2K6 MAP2K2 AHSA1 AKT1 SPAG9 FASN ACACB DYNC1LI2 CAMK2D 44 TUBA4A UBQLN1 SMAD2 EPB41L1 TLN1 HSPA9 45 DUSP9 ACACA PKN1 SWAP70 SCD NDUFAB1 46 UBQLN2 PLXNA2 GNG2 Fatty acid 47 biosynthesis TOMM22 RAD23B CDK5 VIM 48 DCX GNG7 GNG12 49 ACS Paragon Plus Environment LNPEP UBQLN4 50 PSME3 51 G protein-coupled receptor signaling

Figure 5

Figure 6

Page 41 of 42

Journal of Proteome Research

A

B

Lung

Heart

Kidney

Heart

Spleen Pancreas Pancreas

−20

−10

0

10

PC1 (33.9 %)

2 1 0

Spleen

−1

Sus scrofa Homo sapiens

r = 0.62

−2

Kidney Lung

−3

0 −10

Liver

log10 (norm. iBAQ / FPKM) − Sus scrofa

Biliary Biliary

−20

PC2 (19.5 %)

3

Liver

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ACS Paragon Plus Environment 20

30

−3

−2

−1

0

1

2

log10 (norm. iBAQ / FPKM) − Homo sapiens

3

Embryonic stages (n = 6)

18 d 1 2 3 4 5 6 7 8 9 10 11

25 d

Journal of Proteome Research

28 d

32 d

35 d

39 d

5 mm

ACS Paragon Plus Environment

JuvenilePage organs 42 (n of =429)

180 d Diaphragm Spleen Biliary Kidney Liver Lung Brain Pancreas Heart