Evolutionary Characteristics of Missing Proteins ... - ACS Publications

Nov 2, 2015 - J. Proteome Res. , 2015, 14 (12), pp 4985–4994 ... telomeres in long arms among all (missing-protein-encoding) genes in each chromosom...
1 downloads 0 Views 1MB Size
Subscriber access provided by CMU Libraries - http://library.cmich.edu

Article

The evolutionary characteristics of missing proteins --- Insights into the evolution of human chromosomes related to missing-protein-encoding genes Aishi Xu, Guang Li, Dong Yang, Songfeng Wu, Hongsheng Ouyang, Ping Xu, and Fuchu He J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 02 Nov 2015 Downloaded from http://pubs.acs.org on November 2, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The evolutionary characteristics of missing proteins --- Insights into the evolution of human chromosomes related to missing-protein-encoding genes Aishi Xu†,‡, #, Guang Li†, #, Dong Yang†, #, *, Songfeng Wu†, Hongsheng Ouyang‡, Ping Xu†,*, Fuchu He†,* †State

Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for

Protein Sciences Beijing, Beijing Institute of Radiation Medicine. Beijing, 102206, P. R. China. ‡Animal

#

Sciences College of Jilin University. Changchun, 130062, P. R. China.

Contributing equally to this work.

*

Corresponding author

E-mail: [email protected], Tel & Fax: 8610-68177417 E-mail: [email protected], Tel: 8610-80727777-1306 E-mail: [email protected], Tel: 8610- 80705166 Keywords: Missing protein, Spatial-temporal specific genes, Chromosome evolution, Paralogous gene group.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Although the “missing protein” is a temporary concept in C-HPP, the biological information for their “missing” could be an important clue in evolutionary studies. Here, we classified missingprotein-encoding genes into two groups, the genes encoding PE2 proteins (with transcript evidence) and the genes encoding PE3/4 proteins (with no transcript evidence). These missingprotein-encoding genes distribute unevenly among different chromosomes, chromosomal regions, or gene clusters. In the view of evolutionary features, PE3/4 genes tend to be young, spreading at the non-homology chromosomal regions, evolving at higher rates. And interestingly, there is a higher proportion of singletons in PE3/4 genes than the proportion of singletons in all genes (background) and OTCSGs (organ, tissue, cell type-specific genes). More importantly, most of the PE3/4 genes belong to the newly duplicated members of the paralogous gene groups, which mainly contribute to special biological functions, such as “smell perception”. These functions are heavily restricted into specific type of cells, tissues or specific developmental stages, acting as the new functional requirements which facilitated the emergence of the missing-protein-encoding genes during evolution. In addition, the criteria for the extremely special physical-chemical proteins were firstly set up based on the properties of PE2 proteins, and the evolutionary characteristics of those proteins were explored. Overall, the evolutionary analyses of missingprotein-encoding genes are expected to be highly instructive for proteomics and functional studies in the future.

2

ACS Paragon Plus Environment

Page 2 of 35

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction The proteins, without any mass spectrometry or antibody detection evidences so far, are termed “missing proteins” by the chromosome-centric human proteome project (C-HPP)1. “Missing protein” is a dynamic and temporary concept. With the advancement of protein identification techniques, the wider range of the samples to be detected and the more perfect genome annotation, there will be no missing proteins ultimately. However, missing proteins can be a clue to define special types of genes which in turn can be used into the research of gene and chromosome evolution. Based on neXtProt database2-4, missing proteins can be classified into two categories, missing proteins without or with transcript evidence. The missing proteins without any transcript evidence (PE3/4) may be encoded by the genes only expressing at extremely not common organs, tissues or cell types (OTCs), or special conditions5. However, the missing proteins with transcript evidence (PE2) are not supposed to be “missing” at protein level, which may be difficult for identification due to their special physical and chemical properties6. The comprehensive analysis of evolutionary characteristics of the genes encoding for these two types of missing proteins will give us new insights into the features of gene and chromosome evolution. In this study, we focused on these important questions: how and why did these special types of genes emerge during evolution? What are the functional requirements and biological effects of these evolutionary events? To this end, we firstly focused on the evolutionary characteristics of PE3/4 genes. The origin time, the chromosomal homology features, the duplication events and the evolution rates of the PE3/4 genes were calculated and analyzed. The function annotation and the representation 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

analyses of the genes within the paralogous gene groups (PGG) containing PE3/4 genes revealed the functional features of PE3/4 genes. In addition, PE2 genes were used to define a scope of special physical-chemical properties of proteins, and the evolutionary characteristics of them were explored. Overall, this study will provide the community a more comprehensive understanding of gene and chromosome evolution.

Materials and methods Definition and classification of missing proteins Missing proteins are those proteins which cannot be detected so far by antibody or proteomic analysis using mass spectrometry1. In this study, missing proteins are classified into two categories according to the detecting evidence annotated in neXtProt database2, 3 of 2014-0919 release. One category is the missing proteins with transcript level evidence, i.e., the PE2 class in neXtProt and another one is the proteins without transcript level evidence, i.e., the PE3 and PE4 classes in neXtProt. Spatial-temporal specific genes The PE3/4 genes may only express at extremely not common organs, tissues, cell types (OTCs), or special conditions, so they were considered as extremely spatial-temporal specific genes. Ordinary spatial-temporal specific genes were the organ, tissue or cell type-specific genes (OTCFGs). To define the spatial-temporal specificity of gene expression, a comprehensive transcriptome dataset of 32 different human organs and tissues was used to count the number of organs or tissues in which the interest gene expressed7. The transcriptome dataset was based on the next generation sequencing. A cutoff value of 1 FPKM (Fragments Per Kilobase of exon per

4

ACS Paragon Plus Environment

Page 4 of 35

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Million fragments mapped) was used as the detection limit. Based on the dataset, the genes which cannot be detected or only be detected in one sample were regarded as OTCSGs. Evolutionary characteristics of human genes The evolutionary rates of human proteins were calculated based on the ortholog pairs between H. sapiens and P. troglodytes, which were retrieved from Ensembl8 release 75 (http://www.ensembl.org/) using BioMart. The number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) between the ortholog pairs were estimated by the maximum likelihood method using the PAML package9. The ratio of dN and dS (dN/dS) was used to represent the evolutionary rate of one protein. Gene age is defined by the most recent common ancestor (MRCA) of the species containing the gene based on the orthology relationships extracted from the database of Ensembl Compara10. To simplify the classification, all the genes of human are divided into 5 age groups: Opisthokonta, Bilateria, Chordata, Euteleostomi, Mammalia. For the comparison of the singleton gene ages between PE3/4 genes and OTCSGs, the genes in Mammalia group will be further divided into two groups: Mammaila and Primates. In general, the genes in Opisthokonta group are regarded as old genes, while the genes in the groups of Mammalia and Primates are regarded as young genes11, 12. Chromosome evolutionary characteristics The multi-species homologous synteny blocks (msHSBs) were defined based on the results of two previous studies13, 14. The sequenced genomes of ten representative species from four orders of eutherian mammals (Homo sapiens, Pan troglodytes, and Macaca mulatta from Primates; Mus musculus and Rattus norvegicus from Rodentia, Canis familiaris from Canidae, 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

Bos taurus and Sus scrofa from Cetartiodactyla), one methatherian (Monodelphis domestica), and one member of the class Aves (Gallus gallus) were used to define the HSBs, using the human genome as the reference. The human genome assembly was transitioned to GRCh37 by the remapping tool (http://www.ncbi.nlm.nih.gov/genome/tools/remap). The chromosomal location information of human genes were downloaded from Ensembl (release 75) by BioMart. We defined a gene as ‘In Block’, ‘Partial In Block’, ‘Outside of Block’ or ‘Span Block’ according to the relative locations of gene and blocks (Figure S1 in the SI). Identification of paralogous gene groups The paralogy relationships among all the human protein-coding genes were retrieved from Ensembl (release 75) by BioMart. Pairwise paralogy relationships were used to identify the paralogous gene groups (PGGs). If one gene has paralogy relationship with any other genes in a PGG, the gene would be included in this PGG. Functional Enrichment Analysis The

analysis

was

carried

out

using

the

DAVID

tool

(http://david.abcc.ncifcrf.gov/home.jsp)15. The UniProt AC IDs from datasets were recognized by DAVID. The functional enrichment analysis of the genes in PGGs containing PE3/4 genes or the genes encoding the proteins with extremely special physical-chemical properties was carried out based on Fisher Exact test, regarding all the PCGs (protein-coding genes) as the background. Default parameters were chosen for enrichment calculation. Statistical analysis approaches for the criterion of extremely special physical-chemical properties

6

ACS Paragon Plus Environment

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

For the determination of the criterion of extremely special physical-chemical properties, the Kolmogorov–Smirnov test (K-S test) was used to compare the datasets of PE1 proteins (proteins with protein evidence) and PE2 proteins (missing proteins with transcript evidence). From the point at which the probability density values are equal between the datasets of PE1 and PE2 proteins, the repeating K-S tests were performed to find the critical point at which the P value just less than 0.05. The corresponding values of physical-chemical properties are regarded as the cut-off values of the criterion of extremely special physical-chemical properties.

Results and Discussion Chromosomal distribution of the genes encoding for missing proteins The chromosomal distribution characteristics of genes are the results of the chromosome evolution at whole/part chromosome levels. So, the chromosomal distribution characteristics of the missing-protein-encoding genes were investigated to get some clues of their evolutionary history. It’s interesting that the chromosomal distribution of missing-protein-encoding genes is not random (Figure 1A), which is represented at three levels: chromosome, region and gene-gene distance. Firstly, the missing-protein-encoding genes are not distributed equally among different chromosomes (Figure 1B). In the view of the proportion of missing-protein-encoding genes in each chromosome, the percentages of PE2 genes among all PCGs on chromosome Y, 11, 21, 19, 9 and X are higher than 14.95% (the 75th percentiles of all the chromosomes). Meanwhile, percentages of PE3/4 genes on chromosome Y, 4, 8, 13, 15 and 16 are higher than the 75th percentiles (1.94%).

7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Secondly, the missing-protein-encoding genes are not distributed evenly among different regions within a certain chromosome. The conserved regions among species in the chromosome are named msHSB. The regions out of msHSB experienced breakages and rearrangements during evolution. It is obvious that both PE3/4 genes and PE2 genes tend to distribute in the regions out of msHSB in each chromosome, compared with all PCGs (Figure 1C). This feature indicates missing-protein-encoding genes were likely to emerge along with the breakages and rearrangements of chromosome during evolution. What’s more, missing-protein-encoding genes tend to emerge at the regions of telomeres and centromeres compared to all the PCGs in each chromosome (See supplementary description and Figure S2, S3 in the SI for detailed information). For example, in the short-arm terminal regions within 5% of total length of chromosome 11, there are 76 PE2 genes and 6 PE3/4 genes, accounting for 44.6% of PCGs in this region (Figure S3A in the SI), which are mainly involved in the biological process of “sensory perception of smell”. This special tendency that much more missing-protein-encoding genes distribute at the regions of telomeres and centromeres suggested the special evolutionary events of these genes. Thirdly, when the tendency of gene adjacency was investigated, we found that the missingprotein-encoding genes tend to be adjacent to the other protein-coding genes, but not to the other missing-protein-encoding genes (Table 1). It is interesting that the genes in the clusters containing missing-protein-encoding genes tend to have similar functional descriptions (Table S1 in the SI). There are 20 such clusters with at least 5 genes and at least one of them is missingprotein-encoding gene. For example, in chromosome 1 (from152483320 to 152816459), there is a cluster which contains 21 genes with the similar functional description “late cornified envelope

8

ACS Paragon Plus Environment

Page 8 of 35

Page 9 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

protein”, which are the precursors of the cornified envelope of the stratum corneum. Among them, 14 genes are belonging to missing-protein-encoding genes. In one word, the missing-protein-encoding genes are not distributing evenly among different chromosomes, chromosomal regions, and gene clusters. These results implied there are intrinsic relationships among the special chromosomal distribution, special biological function and special physical-chemical properties or extremely spatial-temporal specificity. These phenomena may be the results of long evolutionary history of genes and chromosomes. So, it is important to explore their relationships in the evolutionary view. Then how did the missingprotein-encoding genes originate? Was there any relationship between segmental duplication of chromosome and the emergence of missing-protein-encoding genes? Evolutionary characteristics of PE3/4 genes PE3/4 genes may be only expressed at extremely spatial-temporal states. How and why did this special type of genes emerge during evolution is an interesting and fundamental question. To explore the evolutionary characteristics of these special genes, the gene age, chromosomal homology features, duplication events and evolution rates were taken into consideration. Because OTCSGs (see Materials and methods section for its detailed explanation) were thought to have ordinary spatial-temporal specificity, they were used as a reference gene set when the evolutionary characteristics of PE3/4 genes were investigated. Compared to all PCGs and OTCSGs, PE3/4 genes have a higher tendency to be young genes, which originated from the latest common ancestor of Mammalia or its descendants (Figure 2A). In previous studies, it has been reported that spatial-temporal-specific genes tend to be young genes16 or the genes encoding young protein domains17. We further confirmed that PE3/4 genes, which may be the genes with extremely spatial-temporal-specific expression 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 35

patterns, are more likely to be young genes, compared with ordinary spatial-temporal-specific genes. In the view of chromosomal homology features, we found that PE3/4 genes tend to be out of msHSBs (Figure 2B). These regions are fragile during the evolution of chromosomes, that is, these regions experienced breakage and rearrangement14. These observations indicated that the chromosomal fragments containing PE3/4 genes tend to be not conserved among the examined species (see Materials and methods section for details of 10 species). To investigate the gene duplication events of PE3/4 genes, we counted paralogs based on the information of genes’ duplication. It is known that tissue-specific genes tend to be paralogs18, which was confirmed by our analysis based on OTCSGs. However, it’s surprising that PE3/4 genes have a much higher proportion of singletons compared with all PCGs and OTCSGs (Figure 2C). The further analyses of singletons’ gene ages revealed the causes for this strange phenomenon. It is shown that there are more PE3/4-singletons originating after the last common ancestor of primates (Figure S4 in the SI), so the higher proportion of singletons in PE3/4 genes may be due to the young age of these singletons. This phenomenon also suggested that although the paralogous genes tend to be expressed in specific samples, the samples may be common samples and be easy to detect. So these genes can be detected at least at one common sample. On the contrary, although most singletons tend to be expressed widely, some primate-specific singletons are only expressed at not common samples or under special conditions, which are not common in proteomic studies, so they are not easy to detect. The gene evolution rate during the most recent evolutionary period can represent the strength of purifying selection on the gene19, 20. So we calculate the evolution rates of human PCGs by measuring the ratio of dN and dS between the ortholog pairs of H. sapiens and P. 10

ACS Paragon Plus Environment

Page 11 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

troglodytes. PE3/4 genes have obviously higher evolutionary rates than all PCGs (Wilcoxon rank sum test: p~0) and OTCSGs (p~0, Figure 2D). The ratios of dN and dS of 58 PE3/4 genes (33.72%) were higher than 1 (dN > dS). These results indicate most PE3/4 genes are under weaker purifying selection, even some of them (33.72%) are under positive selection. In sum, we can conclude that PE3/4 genes have a higher tendency to be young genes and have higher evolutionary rates than others. And the singleton proportion of PE3/4 genes is relatively higher. As PE3/4 genes contain two parts (the genes encoding PE3 or PE4 proteins; here we simply named them as "PE3 or PE4 genes"), it's worthy to test if there are obvious difference between these two parts in the evolutionary characteristics. We found both PE3 and PE4 genes tend to be younger genes compared with all PCGs and OTCSGs, but there are less proportion of the oldest genes (Opisthokonta grade) in PE4 than PE3 (Figure S5A in the SI). The PE3 genes, compared with PE4, tend to be out of msHSB (Figure S5B in the SI), evolving more quickly (Figure S5D in the SI). So, PE3 contribute more to these two evolutionary features of PE3/4 genes than PE4. And it's interesting that PE4 genes tend to have a much higher proportion of singletons than PE3, so PE4 contribute more to the high proportion of singletons in PE3/4 genes. Altogether, both PE3 and PE4 genes originated lately during evolution, but PE4 genes experienced less duplication events and evolving under stronger pressure of purifying selection than PE3 genes. The features of the gene duplication events of PE3/4 genes Gene duplication is a primary way for the gene expansion and there is an obvious relationship between gene duplication and gene expression. For example, duplication events of mammalian genes tend to lead to a tissue-specific expression pattern of the duplicated genes21.

11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 35

In order to clarify the gene duplication pattern of PE3/4 genes, the paralogous genes were divided into paralogous gene groups (PGGs). All human paralogous genes can be assigned into 3320 PGGs, 82 of which contain PE3/4 genes (Table S2 in the SI). The sizes of PGGs containing PE3/4 genes range from 2 to 39. Notably, there are only few PE3/4 genes in these PGGs, and even about 74.4% of PGGs only have one PE3/4 gene in each PGG (Figure 3A). This result implies that the PE3/4 genes only are the small part of the PGGs containing them, that is, only few genes in these PGGs are expressed at extremely not common OTCs/conditions. All the members in one PGG originated from the same oldest ancestor. Why are there obvious differences in the expression pattern among the genes originating from the same ancestor? What’s the most marked difference among these members? It is obvious that, during evolution, some members in one PGG experienced many times of duplication, whereas others only duplicated once. So, we focused on the duplication time of the genes in PGGs. The time of the last duplication for one gene is an important feature to understand the functional novelty of this gene. The comparison of last duplication times between PE3/4 genes and other genes (or OTCSGs) in a certain PGG indicates that PE3/4 genes tend to be the newly duplicated genes (Figure 3B, S6 and Table S3 in the SI). This phenomenon suggests that some newly duplicated genes tend to be expressed only at very not common OTCs/conditions. For example, there are 25 members in the PGG in which all the genes encode ubiquitin carboxyl-terminal hydrolases. Among the 25 paralogous genes, there are 12 PE3/4 genes. Interestingly, the 12 PE3/4 genes are all the youngest duplicates which were duplicated within H. sapiens (Figure 3C). The ubiquitin carboxyl-terminal hydrolases can thiol-dependently hydrolyze ester, thioester, amide, peptide and isopeptide bonds formed by the C-terminal Gly of ubiquitin22. Among them, USP17L24, a member which was duplicated lastly in H. sapiens, was reported as a 12

ACS Paragon Plus Environment

Page 13 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

deubiquitinating enzyme that removes conjugated ubiquitin from specific proteins to regulate different cellular processes, such as cell proliferation, cell cycle, apoptosis, cell migration, and the cellular response to viral infection23. It can be deduced that the 12 PE3/4 genes have the similar functions because of their same domain composition to USP17L24 (Figure 3C). The expansion of this family in H. sapiens may lead to the functional division of the new members so that some of them, e. g., the 12 PE3/4 genes, only express at extremely not common OTCs/conditions. Another example is the PGG containing one PE3/4 gene named LIMS3L, which also duplicated lastly in the PGG (Figure 3D). The genes in this PGG, e. g. LIMS3, mainly take part in developmental process and the deletion of LIMS3 in the 2q13 region might have led to rare phenotypic disorders involved in nuclear-cytoplasm trafficking, signaling for tissue patterning and differentiation24. LIMS3L may have the similar functions, because of its most closed relationship with LIMS3. So it hasn’t been detected until now, even at transcript level. In summary, we found that the paralogous PE3/4 genes tend to be the newly duplicated genes in the PGG. It is generally accepted that there are obvious divergence in expression pattern and functional features among the paralogous genes in the same PGG25, 26. However, none of the previous studies further investigated the relationship between the duplication time and expression pattern, and it is not known if the PE3/4 genes, as the extremely spatial-temporalspecific genes, have more specific characteristics. Our results answered these questions and implied during evolution, some newly duplicated paralogous genes tend to be expressed at very not common OTCs/conditions. It's recommended to use various samples at special developmental stages (e.g. the early embryonic stage) or special conditions (e.g. starvation) to detect this kind of missing proteins. In addition, we found the special characteristics of 13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 35

chromosomal distribution, function and evolution are intrinsically related to each other. For example, although the PE3/4 genes tend to be the young genes (Figure 2A), there are still 18 PE3/4 genes originating from the last common ancestor of Opisthokonta, the oldest grade in our analysis. In fact, 15 of them were duplicated lastly in H. sapiens although the PGGs originated from Opisthokonta (Table S4 in the SI). This result indicates the interplay between the de novel gene origination and the newly gene duplication contributed to the formation of new genes with divergent functions and special expression. Functional constraints of the evolution of PE3/4 genes As described above, PE3/4 genes tend to be the newly duplicated genes with distinctive evolutionary characteristics. Paralogs generally have similar amino acid sequences and biological functions. So we supposed that the newly duplicated genes may have similar functions with minor modifications compared with other members in the same PGG. To further confirm the functional requirements and effects of these duplication events, we investigated the functional characteristics of the PGGs containing PE3/4 genes. 742 genes are included in the 82 PGGs containing PE3/4 genes, and they were subjected into the function over/underrepresentation analysis (see Materials and methods). The over-represented GO terms are mainly related to sensory perception (biological processes, BP), olfactory receptor activity (molecular functions, MF) and plasma membrane (cellular components, CC) (Figure 4A). The result of KEGG pathway analysis also indicated that the olfactory transduction pathway is overrepresented in the PGGs containing PE3/4 genes (Figure 4B). As an example, there are 18 PGGs (170 genes) are involved in olfactory receptor activity, and the representative genes in each PGG are shown in figure 4. These results suggested that there were obvious functional requirements for the novel duplication of the genes with some special functions, such as “olfactory receptor” 14

ACS Paragon Plus Environment

Page 15 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(See supplementary description and Table S5, S6 in the SI for detailed information). The new duplicates may have the similar type of biological functions compared with their paralogous genes in the same PGG, but there should be some divergence because of the process of suband/or neo-functionalization27. The new functions maybe only required at very not common conditions/OTCs. So these genes became PE3/4 genes, which cannot be detected easily. The definition and the evolutionary characteristics of the proteins with extremely special physical- chemical properties It is well known that extremely special physical-chemical properties of proteins will affect their detectability by proteomic technologies6,

28

. For instances, the proteins with high

hydrophobicity, high isoelectric-point, or low molecular weight, cannot be separated and identified successfully. However, there are no criteria to define the extremely special physicalchemical properties. In this study, for the first time, using the missing protein information as an important clue, we setup the criteria based on statistical analyses. To do this, firstly, we compared physical and chemical properties of the proteins with protein level evidence (PE1) and the missing proteins with transcript level evidence (PE2). There are obvious difference between the proteins of PE1 and PE2 in the distribution of hydrophobicity (Figure 5A), molecular weight (Figure 5B) and isoelectric point (Figure 5C). Compared to PE1 proteins, PE2 proteins tend to be more hydrophobic, smaller in molecular weight and to be with higher isoelectric point, which is consistent with previous studies6, 28. In order to obtain scientific criteria for the highly hydrophobic proteins with low-molecular-weight high isoelectric point, we apply a repeating KS-test (Kolmogorov–Smirnov test) between PE1 proteins and PE2 proteins to find the cutoff value (p < 0.05, see Method section for detailed information). As the results, the

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 35

highly hydrophobic proteins (hydrophobicity value>0.615), low-molecular-weight proteins (MW8.93) are defined. The Venn diagram (Figure 5D) of the intersection between the three types of special physical-chemical proteins and PE2 proteins indicated there are little intersection between these three type of special physical-chemical proteins, which suggest these properties contribute to detectability independently. More than half of PE2 proteins belong to these three special classes. In another view, the highly hydrophobic proteins contain much more proportion of PE2 proteins (60.8%) than other two classes (about 22%), which suggest the high hydrophobicity of proteins more often affect their detectability. The PE2 genes (with transcript evidence) cannot be detected at protein level so far, mainly due to the extremely specific physical-chemical properties. We recommended various technologies, including immunology methods, to enlarge the detection of this kind of missing proteins. In this revision, we added these discussion in the manuscript. Meanwhile, these three classes of proteins have different over-represented GO items for biological processes or cellular components (See supplementary description and Table S7 in the SI for detailed information). In the view of evolutionary characteristics, compared with all the PCGs, genes encoding proteins with special physical and chemical properties tend to be young genes, which originated from the latest common ancestor of mammalian or its descendants (Figure 5E). Obviously, the highly hydrophobic proteins tend to be out of msHSBs with a much higher proportion of paralogs and evolve more quickly (Figure S7 in the SI). Altogether, these evolutionary characteristics of PE3/4 genes provide an additional new understanding of the figure of gene and chromosome evolution. Under the pressures of requirements of specific functional genes, some novel genes emerged with a series of duplication 16

ACS Paragon Plus Environment

Page 17 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

or shuffling events of chromosomal segments during evolution. But these new genes tend to be expressed at so extremely special states that we cannot detect them so far. Our work provide a successful example that we can obtain important discoveries by combining the biological theories with the information from omics data29. What’s more, this study will be highly instructive for proteomics and functional studies in the future. For example, we can use the evolutionary features of missing proteins to predict their potential function or expression characteristics12. Based on this prediction, the researchers can select appropriate samples or conditions to identify the missing proteins. As another example, the combination of the missing protein information and the evolutionary features can be used as the detailed classification parameters, from which we can get more detailed relationship between protein intrinsic properties and their specific abundance at given spatio-temporal states.

Conclusion In this study, we found the missing-protein-encoding genes are not distributed evenly among different chromosomes, chromosomal regions, and gene clusters. The PE3/4 genes tend to be young genes, distributing at the chromosomal regions without homology among species, evolving at higher rate. Interestingly, compared with OTCSGs, the PE3/4 genes tend to have a much higher proportion of singletons, which may be due to the young age of these singletons. More importantly, PE3/4 genes tend to be the newly duplicated members in the PGGs (paralogous gene groups), and the genes included in these PGGs mainly participate the special biological functions, such as smell perception.

17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In addition, we set up the criteria for the extremely special physical-chemical proteins based on the distribution comparison of the physical-chemical property values between PE2 genes and the proteins with protein evidence. The evolutionary characteristics of them were also explored. Taken all the findings together, we can infer a model of the evolutionary features for the formation of missing-protein-encoding genes. In brief, the requirements of some special biological function resulted in the specific OTCs/conditions for the expression of certain genes during evolution. These acted as the functional constraint facilitating the emergence of the missing-protein-encoding genes by the new origination or new duplication with specific physicochemical properties.

18

ACS Paragon Plus Environment

Page 18 of 35

Page 19 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Supporting Information This work contains supplementary description of part of the results, 7 supplementary tables and 7 supplementary figures. Table S1: Clusters contain more than 5 genes and at least 1 missingprotein-coding gene. Table S2: Detailed information of all the genes in paralogous gene groups (PGGs) (in the file “Supporting Information - Table S2, Table S6 and Table S7.xlsx”). Table S3: Orders of genes’ duplication time. Table S4: List of PE3/4 genes originating from the latest common ancestor of Opisthokonta. Table S5: The number of OR genes in neXtProt database. Table S6:The number of OR genes identified in the transcriptome, translatome and proteome datasets of previous studies (in the file “Supporting Information - Table S2, Table S6 and Table S7.xlsx”). Table S7: Gene Ontology over/under-representation analysis of the proteins with extremely special physical-chemical properties (in the file “Supporting Information - Table S2, Table S6 and Table S7.xlsx”). Figure S1: Schematic diagram for the relative locations of gene and block. Figure S2: The percentages of the (missing-protein-encoding) genes closed to the telomeres in short arms, center of centromere and telomeres in long arms among all (missingprotein-encoding) genes in each chromosome. Figure S3: The percentages of the two types of missing-protein-encoding genes closed to the telomeres in short arms, center of centromere and telomeres in long arms among all genes in each region. Figure S4: Ages of singletons. Figure S5: Evolutionary characteristics of genes encoding PE3 and PE4 proteins. Figure S6: The cumulative probability curve of the difference values of average order of PE3/4 genes and OTCSGs. Figure S7: Evolutionary characteristics of the proteins with extremely special physical-chemical properties. This material is available free of charge via the Internet at http://pubs.acs.org.

19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 35

Glossary Most recent common ancestor (MRCA): the most recent ancestral organism from which all genes (or other characters) of interest are derived. For example, the MRCA of human and mouse is a kind of ancestral animal (the common ancestor of all Euarchontoglires) that lived on the earth about 65~90 million years ago. Gene age: : the origin time of a gene. Most studies have simply used the MRCA of the species containing genes with similar sequences to represent these genes' age. Gene duplication: : any duplication of a region of DNA that contains a gene. It is a major mechanism through which new genetic material is generated during molecular evolution. Common sources of gene duplications include ectopic homologous recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage. Gene Evolutionary rate: An evolutionary rate is used to describe the dynamics of change in a lineage across many generations. For the genes encoding proteins, the ratio of Non-synonymous substitution (dN) to Synonymous substitution (dS) provides a glimpse on the selective forces driving the evolution of a protein-coding sequence. The higher value of dN/dS represents the faster evolutionary rate and the weaker force of purifying selection. Orthology: the relation of homologous DNA sequences (or other biological characters) created by a speciation event at their MRCA. Sequences with this relation are called orthologs and are said to be orthologous. Paralogy: the relation of homologous DNA sequences (or other biological characters) created by a duplication of their MRCA. Sequences with this relation are called paralogs and are said to be paralogous. Singleton: the genes without any paralogous genes. Chromosome evolution: evolution of the shape, size, composition, number and redundancy of chromosomes. During the evolution, the whole or part of chromosome experienced complex process of duplication, breakage and rearrangement. Meanwhile, some regions of chromosome are stable during the long period of evolution. Multi-species homologous synteny blocks (msHSBs): the chromosome regions which are stable during the long period of evolution and conserved across species. 20

ACS Paragon Plus Environment

Page 21 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Acknowledgements This work was partially supported by International Science & Technology Cooperation Program of China (2014DFB30020, 2014DFB30010), Chinese State Key Projects for Basic Research (2015CB910700, 2014CBA02001) and the Chinese State Key Laboratory of Proteomics (SKLP-O201404). We sincerely thank Shihua Zhang from Academy of Mathematics and Systems Science (AMSS) of the Chinese Academy of Sciences (CAS), Gong Zhang from College of Life Science and Technology, Jinan University, Chengpu Zhang, Cheng Chang, Na Su, Pengbo Cao, and Feifei Guo from Beijing Proteome Research Center (BPRC) for their valuable advices for data analysis.

References 1.

Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.;

Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; MarkoVarga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30, (3), 221-3. 2.

Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn,

G. S., Metrics for the Human Proteome Project 2013–2014 and Strategies for Finding Missing Proteins. Journal of Proteome Research 2014, 13, (1), 15-20. 21

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.

Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.;

Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen, C.; Bairoch, A., neXtProt: a knowledge platform for human proteins. Nucleic Acids Res 2012, 40, (Database issue), D76-83. 4.

Gaudet, P.; Argoud-Puy, G.; Cusin, I.; Duek, P.; Evalet, O.; Gateau, A.; Gleizes, A.;

Pereira, M.; Zahn-Zabal, M.; Zwahlen, C.; Bairoch, A.; Lane, L., neXtProt: organizing protein knowledge in the context of human proteome projects. J Proteome Res 2013, 12, (1), 293-8. 5.

Guruceaga, E.; Sanchez del Pino, M. M.; Corrales, F. J.; Segura, V., Prediction of a

missing protein expression map in the context of the human proteome project. J Proteome Res 2015, 14, (3), 1350-60. 6.

Zhang, C.; Li, N.; Zhai, L.; Xu, S.; Liu, X.; Cui, Y.; Ma, J.; Han, M.; Jiang, J.; Yang, C.;

Fan, F.; Li, L.; Qin, P.; Yu, Q.; Chang, C.; Su, N.; Zheng, J.; Zhang, T.; Wen, B.; Zhou, R.; Lin, L.; Lin, Z.; Zhou, B.; Zhang, Y.; Yan, G.; Liu, Y.; Yang, P.; Guo, K.; Gu, W.; Chen, Y.; Zhang, G.; He, Q. Y.; Wu, S.; Wang, T.; Shen, H.; Wang, Q.; Zhu, Y.; He, F.; Xu, P., Systematic analysis of missing proteins provides clues to help define all of the protein-coding genes on human chromosome 1. J Proteome Res 2014, 13, (1), 114-25. 7.

Uhlén, M.; Fagerberg, L.; Hallström, B. M.; Lindskog, C.; Oksvold, P.; Mardinoglu, A.;

Sivertsson, Å.; Kampf, C.; Sjöstedt, E.; Asplund, A.; Olsson, I.; Edlund, K.; Lundberg, E.; Navani, S.; Szigyarto, C. A.-K.; Odeberg, J.; Djureinovic, D.; Takanen, J. O.; Hober, S.; Alm, T.; Edqvist, P.-H.; Berling, H.; Tegel, H.; Mulder, J.; Rockberg, J.; Nilsson, P.; Schwenk, J. M.; Hamsten, M.; von Feilitzen, K.; Forsberg, M.; Persson, L.; Johansson, F.; Zwahlen, M.; von Heijne, G.; Nielsen, J.; Pontén, F., Tissue-based map of the human proteome. Science 2015, 347, (6220).

22

ACS Paragon Plus Environment

Page 22 of 35

Page 23 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

8.

Cunningham, F.; Amode, M. R.; Barrell, D.; Beal, K.; Billis, K.; Brent, S.; Carvalho-

Silva, D.; Clapham, P.; Coates, G.; Fitzgerald, S.; Gil, L.; Giron, C. G.; Gordon, L.; Hourlier, T.; Hunt, S. E.; Janacek, S. H.; Johnson, N.; Juettemann, T.; Kahari, A. K.; Keenan, S.; Martin, F. J.; Maurel, T.; McLaren, W.; Murphy, D. N.; Nag, R.; Overduin, B.; Parker, A.; Patricio, M.; Perry, E.; Pignatelli, M.; Riat, H. S.; Sheppard, D.; Taylor, K.; Thormann, A.; Vullo, A.; Wilder, S. P.; Zadissa, A.; Aken, B. L.; Birney, E.; Harrow, J.; Kinsella, R.; Muffato, M.; Ruffier, M.; Searle, S. M.; Spudich, G.; Trevanion, S. J.; Yates, A.; Zerbino, D. R.; Flicek, P., Ensembl 2015. Nucleic Acids Res 2015, 43, (Database issue), D662-9. 9.

Yang, Z., PAML: a program package for phylogenetic analysis by maximum likelihood.

Computer applications in the biosciences: CABIOS 1997, 13, (5), 555-556. 10.

Vilella, A. J.; Severin, J.; Ureta-Vidal, A.; Heng, L.; Durbin, R.; Birney, E.,

EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 2009, 19, (2), 327-35. 11.

Long, M.; Betran, E.; Thornton, K.; Wang, W., The origin of new genes: glimpses from

the young and old. Nat Rev Genet 2003, 4, (11), 865-75. 12.

Capra, J. A.; Stolzer, M.; Durand, D.; Pollard, K. S., How old is my gene? Trends Genet

2013, 29, (11), 659-68. 13.

Murphy, W. J.; Larkin, D. M.; Everts-van der Wind, A.; Bourque, G.; Tesler, G.; Auvil,

L.; Beever, J. E.; Chowdhary, B. P.; Galibert, F.; Gatzke, L.; Hitte, C.; Meyers, S. N.; Milan, D.; Ostrander, E. A.; Pape, G.; Parker, H. G.; Raudsepp, T.; Rogatcheva, M. B.; Schook, L. B.; Skow, L. C.; Welge, M.; Womack, J. E.; O'Brien S, J.; Pevzner, P. A.; Lewin, H. A., Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 2005, 309, (5734), 613-7. 23

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

14.

Larkin, D. M.; Pape, G.; Donthu, R.; Auvil, L.; Welge, M.; Lewin, H. A., Breakpoint

regions and homologous synteny blocks in chromosomes have different evolutionary histories. Genome Res 2009, 19, (5), 770-7. 15.

Huang da, W.; Sherman, B. T.; Lempicki, R. A., Systematic and integrative analysis of

large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4, (1), 44-57. 16.

Milinkovitch, M. C.; Helaers, R.; Tzika, A. C., Historical constraints on vertebrate

genome evolution. Genome Biol Evol 2010, 2, 13-8. 17.

Yang, D.; Zhong, F.; Li, D.; Liu, Z.; Wei, H.; Jiang, Y.; He, F., General Trends in the

Utilization of Structural Factors Contributing to Biological Complexity. Molecular Biology and Evolution 2012, 29, (8), 1957-1968. 18.

Freilich, S.; Massingham, T.; Blanc, E.; Goldovsky, L.; Thornton, J. M., Relating tissue

specialization to the differentiation of expression of singleton and duplicate mouse proteins. Genome Biol 2006, 7, (10), R89. 19.

McDonald, J. H.; Kreitman, M., Adaptive protein evolution at the Adh locus in

Drosophila. Nature 1991, 351, (6328), 652-654. 20.

Bustamante, C. D.; Fledel-Alon, A.; Williamson, S.; Nielsen, R.; Todd Hubisz, M.;

Glanowski, S.; Tanenbaum, D. M.; White, T. J.; Sninsky, J. J.; Hernandez, R. D.; Civello, D.; Adams, M. D.; Cargill, M.; Clark, A. G., Natural selection on protein-coding genes in the human genome. Nature 2005, 437, (7062), 1153-1157. 21.

Huminiecki, L.; Wolfe, K. H., Divergence of spatial gene expression profiles following

species-specific gene duplications in human and mouse. Genome research 2004, 14, (10a), 18701879.

24

ACS Paragon Plus Environment

Page 24 of 35

Page 25 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

22.

Quesada, V. c.; Dı́az-Perales, A.; Gutiérrez-Fernández, A.; Garabaya, C.; Cal, S.; López-

Otıń , C., Cloning and enzymatic analysis of 22 novel human ubiquitin-specific proteases. Biochemical and Biophysical Research Communications 2004, 314, (1), 54-62. 23.

Saitoh, Y.; Miyamoto, N.; Okada, T.; Gondo, Y.; Showguchi-Miyata, J.; Hadano, S.;

Ikeda, J.-E., The RS447 Human Megasatellite Tandem Repetitive Sequence Encodes a Novel Deubiquitinating Enzyme with a Functional Promoter. Genomics 2000, 67, (3), 291-300. 24.

Jaiswal, S. K.; Kumar, A.; Ali, A.; Rai, A. K., Co-occurrence of mosaic supernumerary

isochromosome 18p and intermittent 2q13 deletions in a child with multiple congenital anomalies. Gene 2015, 559, (1), 94-8. 25.

Helgeland, H.; Sandve, S. R.; Torgersen, J. S.; Halle, M. K.; Sundvold, H.; Omholt, S.;

Vage, D. I., The evolution and functional divergence of the beta-carotene oxygenase gene family in teleost fish--exemplified by Atlantic salmon. Gene 2014, 543, (2), 268-74. 26.

Rytkonen, K. T.; Prokkola, J. M.; Salonen, V.; Nikinmaa, M., Transcriptional divergence

of the duplicated hypoxia-inducible factor alpha genes in zebrafish. Gene 2014, 541, (1), 60-6. 27.

Kassahn, K. S.; Dang, V. T.; Wilkins, S. J.; Perkins, A. C.; Ragan, M. A., Evolution of

gene function and regulatory control after whole-genome duplication: Comparative analyses in vertebrates. Genome Research 2009, 19, (8), 1404-1418. 28.

Chandramouli, K.; Qian, P.-Y., Proteomics: Challenges, Techniques and Possibilities to

Overcome Biological Sample Complexity. Human Genomics and Proteomics 2009, 1, (1). 29.

He, F., Lifeomics leads the age of grand discoveries. Science China Life Sciences 2013,

56, (3), 201-212.

25

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 35

Tables Table 1. The tendency of gene adjacency on chromosomes All genes in neXtProt excluding PE5

PE2 genes

PE3/4 genes

Gene number

19270

2615

260

Adjacent gene pair category

ALL&ALL

PE2&ALL

PE2& PE2

PE3/4&ALL

PE3/4&PE3/4

Adjacent gene number

3082

456

185

67

37

Percentage (%)

15.99

17.44

7.07

25.77

14.23

0.0043 (O)

3.31E-48 (U)

3.27E-5 (O)

0.2464 (U)

P-value (O/U)

ALL : All genes in neXtProt database except the genes of PE5. PE2 : The genes encoding PE2 proteins. PE3/4 : The genes encoding PE3 and PE4 proteins. O : This type of adjacent gene pairs are over-represented compared with all genes. U : This type of adjacent gene pairs are under-represented compared with all genes. The cutoff value for the distance of contiguous gene pairs was 15953bp.

26

ACS Paragon Plus Environment

Page 27 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure Legends Figure 1. Chromosomal distribution of the genes encoding missing proteins. (A) Multispecies homologous synteny blocks (msHSB) are shown as blue box. The centromere regions of the chromosomes are represented by black box. PE2 means the genes encoding PE2 proteins and PE3/4 are the genes encoding PE3 and PE4 proteins. PE2 genes and PE3/4 genes are marked with green or red lines respectively. The number of all protein-coding genes (PCGs) are shown at the end of each chromosome. The percentages of PE2 genes and PE3/4 genes in each chromosome (B), and the percentages of genes out of msHSB in All PCGs, PE2 genes, PE3/4 genes (C), are all shown by the horizontal barplots. Figure 2. Evolutionary characteristics of PE3/4 genes. (A), Ages of PE3/4 genes. All the protein-coding genes of H. sapiens are divided into 5 groups according to their evolutionary origin time, including the genes originating from the common ancestor of Opisthokonta (old), Bilateria, Chordata, Euteleostomi, and Mammalia (young). (B), Chromosomal evolutionary characteristics of PE3/4 genes. Percentage of genes with different distribution among msHSBs are showed in different colors. (C), The percentage of paralogs in OTCSGs and PE3/4 genes. (D), Evolutionary rate of PE3/4 genes. In each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles (q1 and q3), the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually (red points). Figure 3. The gene duplication events of PE3/4 genes. (A), Percentage of PE3/4 genes in PGG (paralogous gene group). Each red point represents a PGG contained PE3/4 genes. (B), The cumulative probability curve of the difference values of average order of PE3/4 genes and other genes in PGGs. If the difference value>0, it means the PE3/4 genes duplicated later than others 27

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 35

in the same PGG. Otherwise, if the difference value