RNA- and Antibody-Based Profiling of the Human Proteome with

Mar 2, 2014 - of this effort, a new version 12 of the Human Protein Atlas ... antibody-based profiling performed within the Human Protein Atlas as wel...
2 downloads 0 Views 8MB Size
Article pubs.acs.org/jpr

RNA- and Antibody-Based Profiling of the Human Proteome with Focus on Chromosome 19 Charlotte Stadler,† Linn Fagerberg,† Åsa Sivertsson,† Per Oksvold,† Martin Zwahlen,† Björn M. Hallström,† Emma Lundberg,† and Mathias Uhlén*,†,‡ †

Science for Life Laboratory, KTH - Royal Institute of Technology, SE-171 21 Stockholm, Sweden Department of Proteomics, School of Biotechnology, AlbaNova University Center, Royal Institute of Technology, Stockholm, Sweden



S Supporting Information *

ABSTRACT: An important part of the Human Proteome Project is to characterize the protein complement of the genome with antibody-based profiling. Within the framework of this effort, a new version 12 of the Human Protein Atlas (www.proteinatlas.org) has been launched, including transcriptomics data for 27 tissues and 44 cell lines to complement the protein expression data from antibody-based profiling. Besides the extensive addition of transcriptomics data, the Human Protein Atlas now contains antibody-based protein profiles for 82% of the 20 329 putative protein-coding genes. The comprehensive data resulting from RNA-seq analysis and antibody-based profiling performed within the Human Protein Atlas as well as information from UniProt were used to generate evidence summary scores for each of the 20 329 genes, of which 94% now have experimental evidence at least at transcript level. The evidence scores for all individual genes are displayed with regards to both RNA- and antibody-based protein profiles, including chromosome-centric visualizations. An analysis of the human chromosome 19 shows that ∼43% of the genes are expressed at the transcript level in all 27 tissues analyzed, suggesting a “house-keeping” function, while 12% of the genes show a more tissue-specific pattern with enriched expression in one of the analyzed tissues only. KEYWORDS: antibodies, Human Proteome Project, Human Protein Atlas, immunohistochemistry, immunofluorescence, RNA deep sequencing, tissue profiling, transcriptomics, chromosome 19



INTRODUCTION Recently, the Human Proteome Project (HPP) was launched with the overall aim to characterize the protein-complement of the human genome.1,2 A working plan has been laid out with a foundation of three pillars: mass spectrometry (MS), antibodies (Abs), and the Knowledgebase (KB).3 One of the greatest challenges in the overall task of collecting knowledge about the gene products in human is not only the vast number of protein isoforms and post-translational modifications but also the complexity in space and time as a result of stable and transient protein interactions, developmental stages, and cell cycle progression. With this in mind, it is unlikely that a complete proteome can be covered during the same conditions and with one single technology, and this is one of the reasons for combining antibodies and mass spectrometry in the analysis of the human proteome. The objective of the Knowledgebase pillar is to gather information of human proteins, such as HPP-generated data, disease-related information, chromosomal location, the number of annotated protein variants, splice isoforms and PTMs, and potential 3D structures.4,5 In addition, the HPP has defined two orthogonal groups on top of the pillars with a Biology/Disease © 2014 American Chemical Society

Human Proteome Project (B/D-HPP) and a Chromosomebased Human Proteome Project (C-HPP).1,6 The aim of the CHPP initiative is to organize a chromosome-centric project, in which every chromosome is curated by country-defined teams with the goal to identify at least one representative protein for every protein coding human gene.1,2 One of the projects in the antibody pillar is the Human Protein Atlas program (HPA) where the human proteome is systematically explored using an antibody-based approach.7 Since the project started 10 years ago, the HPA project has generated over 50 000 polyclonal antibodies targeting almost 95% (n = 19 100) of the human protein-coding genes, and many of these antibodies have been used to generate validated protein expression profiles now covering 82% of these genes. Recently, we reported on the introduction of transcriptomics data, derived from 11 functionally different human cell lines and how this contributed to increase the number of detected genes.8 Here the version 12 of the HPA portal is described, in which transcriptomics data has been generated for a broader Received: November 22, 2013 Published: March 2, 2014 2019

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

panel of in total 44 cell lines and also 27 different tissues.9 For each gene, the RNA expression level detected in each of the cell lines and tissues analyzed is shown as well as the antibodybased protein expression in human organs, tissues, and cells. Furthermore, the overall structure of the Protein Atlas has now been rebuilt to divide all data in four separate subatlases, including a normal tissue atlas, a cancer atlas, a cell line atlas, and a subcellular atlas.



expressed preferentially at the C-terminus of the target protein, as previously shown to give more accurate protein localizations.15 Cells were grown in DMEM with 10% FBS and 1% Geneticin (product number 10131019, Life Technologies) to maintain expression of the GFP-tagged target protein. Cells were seeded and fixated using the standard protocol based on cross-linking with 4% PFA and permeabilized with 0.1% Triton X-100.16 The entire staining and imaging procedure was done according to the standard protocol, with the exception that the marker for the endoplasmic reticulum was replaced by an antiGFP chicken polyclonal antibody (ab13970, Abcam), used at a concentration of 5 ng/μL to enhance GFP signals. For each different GFP-expressing cell line, detection of the corresponding endogenous target protein was done with rabbit-polyclonal antibody generated within the HPA project (HPA029143 for LTN1, HPA044833 for U2AF1, and HPA012058 for C7ORF58) at a standard concentration of 2 ng/μL. Secondary antibodies donkey−antichicken Alexa488, goat−antimouse Alexa647, and goat−antirabbit Alexa555 (all from Invitrogen) were used at a concentration of 2.5 ng/μL for detection of GFP, tubulin, and the endogenous target protein, respectively.

EXPERIMENTAL SECTION

Transcript Profiling (RNA-seq)

RNA sequencing of the 44 cell lines was performed as previously described, using Illumina Hiseq2000 and the standard Illumina RNA-seq protocol.8,10 The analysis of RNA-seq data from a total number of 95 samples from 27 tissues has been previously described.9 In brief, raw reads were mapped to the human genome, and FPKM (fragments per kilobase of exon model per million mapped reads)11 values were calculated as quantification scores for all genes. The average FPKM value of all individual samples was used to estimate the gene expression level, and a cutoff value of 1 FPKM was used as a limit for detection.

UniProt Evidence Scores

The UniProt protein existence data were assigned to classes as follows: evidence at protein level (class 1), evidence at transcript level (class 2), inferred from homology (class 3), predicted (class 4), and uncertain (class 5). The UniProt protein IDs were mapped to gene identifiers based on Ensembl build 73.17

RNA Specificity Classification

Each of the 20 314 genes with RNA data were classified into one of eight categories based on the FPKM levels in 27 tissues: (1) “not detected”: FPKM < 1 in all 27 tissues; (2) “highly tissue-enriched”: a 50-fold higher FPKM level in one tissue compared with all other tissues; (3) “moderately tissueenriched”: a five-fold higher FPKM level in one tissue compared with all other tissues; (4) “group-enriched”: fivefold higher average FPKM level in a group of two to seven tissues compared with all other tissues; (5) “mixed low”: detected in 1−26 tissues with FPKM < 10 in at least one of the detected tissues; (6) “mixed high”: detected in 1−26 tissues with FPKM > 10 in all detected tissues; (7) “expressed in all low”: detected in 27 tissues and at least one tissue with FPKM < 10; and (8) “expressed in all high”: detected in all 27 tissues with FPKM > 10.

RNA Evidence Scores

For each gene, the mean FPKM value for all replicate samples of a cell line or tissue was used to estimate and classify the abundance into four categories. For the 27 tissues, the abundance categories were based on the following criteria: “high”: 50 < FPKM, “medium”: 10 < FPKM < 50, “low”: 1 < FPKM < 10, and “not detected”: FPKM < 1. For the 44 cell lines, slightly different cut offs were used for the “medium” and “low” categories: 20 < FPKM < 50 and 1 < FPKM < 20, respectively. The total RNA evidence score for a gene was calculated as the maximum abundance category across all tissues and cell lines.

Tissue Profiling

Tissue microarrays (TMAs) containing triplicate 1 mm cores of 46 different types of normal tissue and duplicate 1 mm cores of 216 different cancer tissues representing the 20 most common forms of human cancer were generated as previously described.12 All of the tissues samples used are from the archives at the Department of Pathology of Uppsala University Hospital in agreement with approval from the Research Ethics Committee at Uppsala University (Uppsala, Sweden) (Ups 02−577). Automated IHC was performed as previously described using an Autostainer 480 instrument (Lab Vision).13 After incubation with primary antibodies for 10 min, slides were counterstained with hematoxylin. As negative controls, slides were incubated with PBS only. Images of the entire slide were acquired using a TheAperioScanScope XT Slide Scanner (Aperio Technologies, Vista, CA) equipped with a 20× objective, and the images of all normal and cancer tissues were manually evaluated and scored by certified pathologists.

HPA Evidence Scores

The HPA evidence scores are calculated from manual curation of Western Blot results and the tissue and subcellular staining congruence with literature and categorized as high, medium, low, or very low. The HPA evidence scores have previously been described in detail by Fagerberg et al.8 Evidence Summary Scores

The overall evidence summary score for each gene is categorized as high, medium, low, RNA only, or none, as previously described for the HPA version 11.8 In brief, it is based on (1) summarized evidence scores from UniProt, (2) RNA evidence according to the results obtained from the transcriptomics data done within the HPA, and (3) the HPA evidence, based on antibody profiles. Evidence summary “high” and “medium” refer to gene products with protein evidence in UniProt or HPA evidence categorized as “high”. Evidence summary “low” refers to genes with HPA evidence “medium” and evidence at transcript level or lower in UniProt. Evidence summary “only RNA” refers to genes with RNA evidence “high” or UniProt evidence class 2, whereas genes with evidence summary “none” have RNA evidence “medium” or

Subcellular Profiling and GFP Colocalization Experiments

All GFP-expressing cell lines used for IF colocalization experiments were generated in the lab of Dr. Anthony Hyman, Max Planck institute Dresden, Germany.14 All transfected cells were HeLa Kyoto cells, and the GFP tag was 2020

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

Figure 1. Schematic overview of the new structure of the Human Protein Atlas 12 with the division of data into four subatlases: the tissue, cancer, cell line, and subcell atlas. The atlas also contains RNA expression profiles from 27 tissues and 44 cell lines.

Figure 2. Status of antibody-based profiling for each chromosome showing the fraction of genes with (A) tissue profiles based on IHC and (B) subcellular profiles based on IF. The reliability categories “Supportive” and Uncertain” are based on the antibody staining congruence with literature as well as the degree of staining similarity when several antibodies have been used to profile the same gene.

and to generate the expression profiles of the target protein in tissues and cells. This includes protein arrays to test for antibody recognition of the antigen, Western blot, immunohistochemistry (IHC) for expression profiling in normal and cancer tissue samples,12,18 and immunofluorescence (IF) for subcellular protein profiling.19 The new version 12 of the Human Protein Atlas has been separated into four subatlases according to the following; the Normal Tissue Atlas, containing images and protein profiles based on normal tissue samples and RNA expression levels in 27 of the 46 tissues, the Cancer Tissue Atlas with protein profiles from 20 different tumor types, the Cell Line Atlas containing all information about IHC stainings in 46 human

lower but lack evidence at the transcript and protein levels in UniProt. To correlate the gene evidence summary levels to the definitions of protein evidence set up by the HPP consortium, the evidence summary scores “high” and “medium” correlate to the score “supportive”, while evidence summary score “low” and “RNA only” correlate to “uncertain”.



RESULTS

New Structure of the Human Protein Atlas

All antibodies generated within the Human Protein Atlas are tested in a variety of applications both to evaluate the specificity 2021

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

Table 1. IHC and IF Reliability Scores IHC chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y total

supportive

IF uncertain

455 242 222 149 155 273 200 141 161 149 259 204 65 104 114 172 287 70 255 106 60 93 167 15 4118 total number of genes 20 287

no_data

1196 811 689 481 582 606 562 393 482 499 739 624 211 416 383 554 669 168 876 342 119 280 545 19 12246

chromosome

424 225 162 137 152 173 155 167 162 122 319 236 52 128 118 151 249 50 349 109 62 80 121 20 3923

cancer cell lines and the RNA expression levels for each protein in 44 of the 46 cell lines used, and the Subcellular Atlas covering IF-based subcellular protein profiles in selected cell lines based on RNA expression levels. In addition, subcellular protein profiling of 994 mouse genes (corresponding to 1043 antibodies) has been performed in the mouse cell line NIH 3T3 and published in the new Subcellular Atlas. An overview of the new Atlas structure is shown in Figure 1. In total, the new Human Protein Atlas contains expression profiles at the tissue or subcellular level based on 21 984 antibodies toward 16 621 genes, which is an increase of more than 1500 genes and 3000 antibodies compared with the previous version of the atlas. Among the 16 621 genes with antibody profiles, 10 967 are defined as “supportive”, taking into account only antibodies with HPA evidence of high and medium reliability, according to the latest HPP metrics guidelines.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

supportive

uncertain

358 120 200 199 66 116 99 154 227 47 197 193 110 35 79 199 118 156 179 161 98 131 0 105 3347 total number of genes 20 287

373 129 200 199 59 125 129 166 197 63 264 265 78 42 82 219 148 158 197 186 123 156 1 144 3703

no_data 1344 521 917 666 203 407 387 557 781 178 1019 820 369 164 292 655 501 575 676 570 480 518 53 584 13237

number of genes within the different reliability categories is shown for each chromosome. To get a supportive score, the observed staining pattern must be supported by extensive literature or by the staining pattern of an additional antibody targeting the same protein, whereas limited literature or differences in staining patterns for profiles based on several antibodies will result in the score “uncertain”. Compared with the previous version of the protein atlas, the number of genes with IF profiles has been reduced (7054 as compared with 12 040). The lower number of published genes is a result of more strict publication criteria, in which IF results from many cell lines with low gene expression levels as determined by the transcriptomics data have been excluded to be reanalyzed in cell lines showing higher gene expression. Although the number of published genes with IF profiles have decreased, the fraction of published genes with high reliability score has increased, as 47% are now scored as supportive, compared with 24% in the previous version of the Protein Atlas.

Chromosome Centric View of Antibody-Based Protein Profiles

Evidence Summary and the Contribution of RNA Profiling

The contribution to the C-HPP project from the Protein Atlas program is to make protein expression profiles based on antibodies available for the research community. In Figure 2, the number of genes with antibody-based expression profiles and the reliability scores for the annotated protein expressions in tissues using IHC (A) and at the subcellular level using IF (B) is shown for each chromosome. The total number of genes with IHC profiles is now 16 621 based on 21 984 antibodies, while 7054 genes have subcellular profiles, based on 9120 antibodies in total. The number of genes with supportive reliability score is 4118 for IHC and 3347 for IF. In Table 1, the

Recently we introduced the chromosome-centric visualization tool displaying all genes with an evidence summary score based on manual annotations from the UniProt Consortium (UniProt evidence),21,22 antibody-based data generated within the HPA program (HPA evidence), and transcriptomics data from selected cell lines analyzed within the HPA project (RNA evidence).8 The data behind the evidence summary scores have been updated and expanded and now include transcriptomics data of 27 tissues and 44 cell lines in total.9 The expansion of the RNA 2022

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

Figure 3. Evidence summary and contribution of RNA data. (A) Chromosome-centric evidence summary status based on HPA evidence, RNA evidence, and UniProt evidence. The x axis refers to the number of genes. (B) Pie chart of the distribution of evidence summary scores for all protein coding genes. (C) Venn diagram displaying the overlap of the number of detected transcripts in the 27 tissues and 44 cell lines.

expression panel, updates in UniProt, and the continued work in generating antibody-based protein profiles have contributed to providing gene expression evidence for 94% of the 20 287 human protein coding genes mapped to any of the 24 chromosomes and evidence at the protein level (scored as high, medium or low) for 77% of the genes (Figure 3A,B and Supplementary Table S1 in the Supporting Information). On the basis of transcript data, it can be concluded that the number of detected genes is similar in most cell lines, ranging from 10 151 genes in the lymphoid cell line U-698 cells to 12 587 genes in the osteosarcoma cell line U-2 OS (Supplementary Table S2 in the Supporting Information). A majority of these genes are found in most cell lines, while only a few are cell-type-specific. This is in concordance with our previous studies where protein expression was evaluated in three functionally different cell lines.10,23 This means that a relatively small increase in detected genes is achieved by a quite extensive increase in the number of analyzed cell types. The new panel of 44 cell lines allows for detection of 86% of all genes at the transcript level as compared with 76% in the previously published panel of 11 cell lines, using the same cutoff of FPKM > 1. However, as the RNA abundance for many proteins varies between different cell lines, a larger panel of cell lines increases the chances of detecting a certain protein. This has been evident for the subcellular protein profiling, where the three cell lines showing the highest RNA expression level for the target protein are now selected for IF analysis.24

Slightly higher numbers of detected genes have been found in the panel of 27 tissues, where 18 438 genes were detected in at least one of the represented tissues. Among the genes detected in the tissue panel, 1367 genes are exclusively detected in the tissues while not detected in any of the 44 cell lines (Figure 3C). Similarly, 420 genes detected in the cell line panel could not be found in any of the tissues. A Gene Ontology-based analysis of the genes detected uniquely in tissues shows three major groups of enriched genes: (1) genes involved in epidermal cell differentiation, (2) genes encoding extracellular components, and (3) testis-related genes involved in sexual reproduction, spermatogenesis, and male gamete generation (data not shown). The third group reflects the high number of testis-specific genes and implies large differences in gene expression between testis and the testisderived cell line NTERA-2. Besides the testis related genes, the detected genes found uniquely in tissues or cell lines could not be mapped to any particular tissue or cell line. For genes detected uniquely in cell lines, enrichment categories include anatomical structure development, system development, and developmental processes (data not shown). Most of these genes show low RNA expression levels (FPKM < 10), which can explain the reason for not being detected in any of the tissues. Another explanation could be the in vitro adaptation of cell lines or the fact that most cell lines are malignantly transformed cells. 2023

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

Figure 4. Transcript profiling of chromosome 19. (A) Distribution of RNA evidence scores based on transcript abundance in 44 cell lines and 27 tissues for the genes on chromosome 19. (B) RNA classification of the genes detected in the 27 tissues encoded by chromosome 19. (C) Illustrative examples of genes with tissue-specific expression in testis, liver, placenta, and bone marrow, as detected by IHC and transcript abundance.

Progress of Protein Evidence on Chromosome 19

level; currently, 78% (n = 1154) of the proteins have antibodybased expression profiles. On the basis of the RNA profiling of all 27 tissues, an RNA classification score has been assigned to each gene, reflecting the abundance of the gene transcript in the panel of analyzed tissues. Depending on the transcript expression level as defined by FPKM values and the number of tissues in which the transcript was detected, the RNA classification serves to categorize the genes as house-keeping (found in all analyzed tissues) or into different degree of being tissue-specific. Figure 4B shows the distribution of each RNA classification score for all detected genes encoded by

Within the framework of the C-HPP project, chromosome 19 has been assigned to Sweden; therefore, a more in-depth analysis of the status of proteins encoded by this chromosome has been done. The RNA evidence for the genes encoded by chromosome 19 (n = 1480) shows that 94% of the genes have been detected in at least one of the tissues or cell lines analyzed (Figure 4A) and that approximately three-quarters of these genes are expressed at medium to high levels in at least one tissue or cell line. This gives a good estimation of the number of gene products that are likely to be found also at the protein 2024

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

Figure 5. New strategies for increased reliability in subcellular profiling. (A) Validation of antibody specificity by siRNA knock-down of the target protein exemplified by KHSRP, CARS, C1ORF174, and C10ORF58. RFI = relative fluorescence intensity (compared with nonsilenced control). (B) Co-localization experiments with IF (green) and a GFP-fusion of the corresponding target protein (purple) expressed in HeLa cells for LTN1, U2AF1, and C7ORF58. Original IF in non-GFP expressing cells and the transcript abundance for the target protein is shown in the left panel. Scale bar: 10 μm.

indispensible techniques that cannot be replaced. This type of information complements the knowledge generated in HPP, which most often comprises quantitative measurements of protein levels using MS. The spatial information generated by HPA can also be used to bridge with B/D-HPP projects focused on defects in particular cellular structures or pathways. In turn, the gene expression levels can be useful in the selection of suitable sample types for successful protein detection.24−26 It has been shown that detection of proteins with higher gene expression values is often more reliable, as target protein concentration increases compared with the concentration of other potential off-target proteins. Although it is still possible to accurately detect the protein product of low abundant transcripts (FPKM < 1) with IF, higher demands are put on validation of these staining. This is especially important for gene products lacking previous evidence at the protein level, which still comprise ∼30% of the human protein-coding genes in the UniProt database.21 One powerful approach to evaluate the protein distribution at the subcellular level and the specificity of the antibody binding to the target protein is the use of siRNA in combination with IF. This approach has previously been tested within the Human Protein Atlas to validate the subcellular localization of 54 proteins and the specific binding of the corresponding antibodies.27 Examples of this approach where antibody specificity and protein localization could be confirmed are shown in Figure 5A for four different proteins. While CARS

chromosome 19. Approximately 43% of all detected genes are found to be house-keeping (expressed in all analyzed tissues at high or low levels), 32% are expressed in a multitude of tissues of various expression level, 5% are group enriched (expressed at higher level in a subset of tissues), and 12% show a tissueenriched expression profile with five-fold (8%) to fifty-fold (4%) higher FPKM values in one of the analyzed tissues as compared with the maximum value in all other tissues. A few examples of highly tissue-enriched proteins encoded by chromosome 19 are shown in Figure 4C. These proteins were all suggested to be highly tissue-enriched according to the transcriptomics data of the 27 different tissues analyzed, and a corresponding tissue-specific expression pattern was also seen with IHC, showing expression in the testis for DKKL1, placenta for CGB, liver for CYP2A, and bone marrow for ELANE. These examples demonstrate the usefulness of combining and integrating RNA and protein expression profiles to improve the reliability of the antibody-based profiling. Strategies for Increasing Evidence Coverage

Transcriptomics data have shown to greatly contribute to proteomics studies by giving estimations of protein abundance in different tissues and cell types. However, transcriptomics data, as well as MS-based experiments, do not give any information about spatial resolution of the protein at the subcellular level. For this, imaging techniques for visualizing proteins at the single-cell level using IHC and IF are 2025

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

variants and the corresponding protein isoforms is becoming a natural next step in exploring the human proteome. Although the RNASeq technology already distinguishes between the splice variants through detection of specific transcripts, the link to the specific isoform and its function remains to be established. The work described here is part of an ongoing effort to create a high-quality Protein Atlas with highly validated expression profiles. A central part in this effort is focused on new strategies for antibody validation, involving the siRNA platform to evaluate antibody specificity, and the use of GFP-tagged proteins to confirm the protein localization obtained from IF. Supportive results from the siRNA or GFP approach will greatly enhance the credibility of the antibody and protein localization, respectively, and help to increase the knowledge of yet uncharacterized proteins. In conclusion, we show the power of combining transcriptomics and antibody-based profiling data to score the experimental evidence of expression for each of the human putative protein-coding genes. Strategies to analyze the genes with no experimental evidence at the protein level should be emphasized using also complementary technologies and more specialized tissues and cells to allow for integration of the data presented here, with complementary efforts within the framework of the Human Proteome Project.

and KHSRP are well-known proteins whose accurate location to the cytoplasm and nucleus respectively could be confirmed, C1orf174 and C10orf58 were previously uncharacterized proteins here localized to the nucleus and nucleoli. The siRNA strategy for validation of antibody target specificity and subcellular localization is currently being implemented in the Protein Atlas workflow and siRNA validation data will be included in future releases. Another strategy useful in validating subcellular protein distribution is to compare the results with a complementary technique making use of another protein detection system, such as tagging of green fluorescent protein (GFP) to the target. By simultaneous detection of the endogenous protein expressed by the antibody and the visualization of the same protein tagged with GFP in the very same cell, a direct comparison of protein distribution can be achieved. The high correlation between these two approaches was demonstrated by the comparison of ∼500 proteins, showing the same protein localizations for 80% of the analyzed proteins.15 This effort is now being scaled-up to encompass more proteins, and some examples of protein targets analyzed are shown in Figure 5B. The results demonstrate how antibodies unable to detect low abundant proteins in standard IF experiments, successfully recognized the tagged variant of the target protein once expressed. This makes the approach a promising tool for determining the subcellular localization of an increased number of proteins expressed only at low levels in the panel of cell lines routinely used.





ASSOCIATED CONTENT

* Supporting Information S

DISCUSSION Here we report on updated experimental evidence at RNA and protein level for all human genes as released in the new version 12 of the Human Protein Atlas (www.proteinatlas.org). The number of genes with experimental evidence at least at the transcript level has now increased to 94% of the putative human protein coding genes, leaving only 6% of the putative human protein coding genes as, so far, undetected. The transcriptomics data from 27 tissues and 44 cell lines included in the new Protein Atlas have not only contributed to increase the number of protein coding genes with experimental evidence but also become a central tool for complementing the antibody-based protein expression profiles. As reported, the number of published genes with subcellular profiles has decreased in this version of the Protein Atlas compared with the previous, as proteins expressed at low RNA levels will be reanalyzed in other cell lines showing higher expression levels to more reliably judge the staining patterns obtained. Because the antibody specificity is dependent on the antibody affinity to both on- and off- targets as well as their concentration in the sample, a protein with very low transcript abundance can still be detected with accurate result if localized at a high concentration to a specific part of the cells, such as different domains in the nucleus or at the centrosomes. Similarly, high abundant transcripts might fail detection at the protein level if diffusely expressed in the cytosol as a whole. Using a FPKM value of >1 as evidence of a an expressed gene, the number of detected genes in each tissue and cell line ranges between approximately 11 600−15 500 in tissues and 10 100−12 600 in the cell lines analyzed. With these numbers, we estimate that protein expression profiles using antibodies could be used to analyze close to 90% of the proteins using the existing panel of tissues and cell lines. As the goal of analyzing representative gene products from all protein coding genes is getting closer, deeper analysis of splice

Evidence summary scores. Cell lines and tissues used for RNA profiling in HPA 12. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +46 8 5537 8325. Fax: +46 8 5537 8482. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge the entire staff of the Human Protein Atlas project. This work was supported by grants from the Knut and Alice Wallenberg Foundation and the EU 7th framework programs Affinomics and PROSPECTS.



REFERENCES

(1) Paik, Y. K.; et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−223. (2) Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 2013, 12, 1−5. (3) Legrain, P.; et al. The human proteome project: Current state and future direction. Mol. Cell. Proteomics 2011, 10, M111.009993. (4) Gaudet, P.; et al. neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 2013, 12, 293− 298. (5) Lane, L.; et al. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 2012, 40, D76−D83. (6) Aebersold, R.; et al. The biology/disease-driven human proteome project (B/D-HPP): enabling protein research for the life sciences community. J. Proteome Res. 2013, 12, 23−27.

2026

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027

Journal of Proteome Research

Article

(7) Uhlen, M.; et al. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol. Cell. Proteomics 2005, 4, 1920−1932. (8) Fagerberg, L.; et al. Contribution of antibody-based protein profiling to the human Chromosome-centric Proteome Project (CHPP). J. Proteome Res. 2013, 12, 2439−2448. (9) Fagerberg, L.; et al. Analysis of the human tissue-specific expression by genome-wide integration of transcriptomics and antibody-based proteomics. Mol. Cell. Proteomics 2013, 13, 397−406. (10) Lundberg, E.; et al. Defining the transcriptome and proteome in three functionally different human cell lines. Mol. Syst. Biol. 2010, 6, 450. (11) Mortazavi, A.; Williams, B. A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5, 621−628. (12) Ponten, F.; Jirstrom, K.; Uhlen, M. The Human Protein Atlas–a tool for pathology. J. Pathol. 2008, 216, 387−393. (13) Paavilainen, L.; et al. The impact of tissue fixatives on morphology and antibody-based protein profiling in tissues and cells. J. Histochem. Cytochem. 2010, 58, 237−246. (14) Poser, I.; et al. BAC TransgeneOmics: a high-throughput method for exploration of protein function in mammals. Nat. Methods 2008, 5, 409−415. (15) Stadler, C.; et al. Immunofluorescence and fluorescent-protein tagging show high correlation for protein localization in mammalian cells. Nat. Methods 2013, 10, 315−323. (16) Stadler, C.; Skogs, M.; Brismar, H.; Uhlen, M.; Lundberg, E. A single fixation protocol for proteome-wide immunofluorescence localization studies. J. Proteomics 2010, 73, 1067−1078. (17) Flicek, P.; et al. Ensembl 2013. Nucleic Acids Res. 2013, 41, D48−D55. (18) Uhlen, M.; et al. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28, 1248−1250. (19) Barbe, L.; et al. Toward a confocal subcellular atlas of the human proteome. Mol. Cell. Proteomics 2008, 7, 499−508. (20) Lane, L.; et al. Metrics for the human proteome project 2013− 2014 and strategies for finding missing proteins. J. Proteome Res. 2014, 13, 15−20. (21) Apweiler, R.; et al. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41, D43−D47. (22) Magrane, M.; UniProt Consortium. UniProt Knowledgebase: a hub of integrated protein data. Database 2011, DOI: 10.1093/ database/bar009. (23) Stadler, C.; et al. Mapping the subcellular protein distribution in three human cell lines. J. Proteome Res. 2011, 10, 3766−3777. (24) Danielsson, F.; et al. RNA deep sequencing as a tool for selection of cell lines for systematic subcellular localization of all human proteins. J. Proteome Res. 2013, 12, 299−307. (25) Hebenstreit, D.; et al. RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol. Syst. Biol. 2011, 7, 497. (26) Nagaraj, N.; et al. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011, 7, 548. (27) Stadler, C.; et al. Systematic validation of antibody binding and protein subcellular localization using siRNA and confocal microscopy. J. Proteomics 2012, 75, 2236−2251.

2027

dx.doi.org/10.1021/pr401156g | J. Proteome Res. 2014, 13, 2019−2027