Technical Note pubs.acs.org/jpr
RNA Deep Sequencing as a Tool for Selection of Cell Lines for Systematic Subcellular Localization of All Human Proteins Frida Danielsson,†,‡ Mikaela Wiking,†,‡ Diana Mahdessian,† Marie Skogs,† Hammou Ait Blal,† Martin Hjelmare,† Charlotte Stadler,† Mathias Uhlén,†,§ and Emma Lundberg*,† †
Science for Life Laboratory, KTH - Royal Institute of Technology, SE-171 21 Stockholm, Sweden Department of Proteomics, KTH - Royal Institute of Technology, SE-106 91 Stockholm, Sweden
§
ABSTRACT: One of the major challenges of a chromosomecentric proteome project is to explore in a systematic manner the potential proteins identified from the chromosomal genome sequence, but not yet characterized on a protein level. Here, we describe the use of RNA deep sequencing to screen human cell lines for RNA profiles and to use this information to select cell lines suitable for characterization of the corresponding gene product. In this manner, the subcellular localization of proteins can be analyzed systematically using antibody-based confocal microscopy. We demonstrate the usefulness of selecting cell lines with high expression levels of RNA transcripts to increase the likelihood of high quality immunofluorescence staining and subsequent successful subcellular localization of the corresponding protein. The results show a path to combine transcriptomics with affinity proteomics to characterize the proteins in a gene- or chromosome-centric manner. KEYWORDS: antibody, Human Protein Atlas, Human Proteome Project, RNA sequencing, subcellular localization
■
INTRODUCTION One of the major challenges of the Human Proteome Project (HPP) and in particular of the chromosome-centric approach (C-HPP) is to find the so-called “missing proteins” that have no previous proof of existence. In contrast to a biology- or diseasebased approach, the aim here is to prove the existence of and subsequently characterize one representative protein from each gene in a systematic manner. Since there is no biological context to guide the selection of samples, the C-HPP faces a great problem in how to select the right sample for each gene product. Ideally, a streamlined approach should be employed in which the protein target is present at detectable levels in the selected biosample to allow in-depth analysis and subsequently build up a basic knowledge of each protein, including subcellular localization. A highly selective and tightly regulated expression of genes, in both time and space, is a prerequisite for the development of different cell types with specialized function. Within the human body there are a large number of different cell types; in fact, there is a recent estimation that the body contains at least 411 cell types, including 145 different neurons.1 Slightly simplified, it is the proteome content of a cell that determines the specific function of that cell. Thus, the proteome will be unique for each of these 411 cells and there will never be a single sample that allows the characterization of all human proteins. Many genes will be expressed in most if not all cell types, in particular proteins involved in the basic cellular machinery (often referred to as house-keeping proteins) such as ribosomal subunits or the transcription machinery, while others will be rarely found such as proteins expressed during developmental stages of the embryo. © 2012 American Chemical Society
Recent advances in RNA deep sequencing have resulted in many studies of the transcriptomes of numerous human cell lines and tissues. Despite ongoing discussions on how to interpret the RNA-seq (Whole Transcriptome Shotgun Sequencing) data and where to put the cutoff for considering a gene to be expressed, most studies coherently report on the expression of approximately 11000−15000 protein coding genes per cell type.2−4 For instance, in a study of 24 cells and tissues, roughly 60−70% of all protein coding genes were expressed and out of these around 8000 were detected across all samples.3 Others have showed that there is a good correlation between transcriptome and protein levels.2,5 With the emerging RNA-sequencing information, it is tempting to explore the use of RNA-seq data as a basis for sample selection in genome-centric proteome studies. In this technical note we describe how RNA-sequencing can be used as a basis for cell line selection for subcellular localization of the human proteome. This is used in the context of the Human Protein Atlas (HPA) program, which has been set up to allow for a systematic exploration of the whole human proteome using antibody-based proteomics.6,7 In particular, we have used the combination of transcriptomics and affinity proteomics to optimize the effort to generate a knowledge resource in which the subcellular localization of all human proteins are systematically determined, thus providing partial evidence for protein function as well as the characterization of the subcellular proteome. The results suggest that knowledge Special Issue: Chromosome-centric Human Proteome Project Received: October 4, 2012 Published: December 10, 2012 299
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
Immunofluorescence Staining
about the presence of a protein is of immense importance when evaluating the results of an antibody-based study, since all antibodies most likely have an off-target binding of much lower affinity than the target binding, and the risk of false results increases when antibodies are used to stain samples where the target protein is not present.
■
Immunostaining of the cells was prepared in 96-well glass bottom plates (Whatman Inc.) coated with 40 μg of 12.5 μg/ mL human fibronectin (VWR). Approximately 15.000 cells were seeded in each well and incubated at 37 °C for 6−8 h. After washing with PBS, cells were fixed with 40 μL 4% ice cold PFA (Sigma Aldrich) dissolved in growth medium supplemented with 10% serum for 15 min and permeabilized with 40 μL 0.1% Triton x-100 (Sigma Aldrich) in PBS for 3 × 5 min. Rabbit monospecific HPA antibodies were dissolved to 2 μg/ mL in blocking buffer (PBS + 4% fetal bovine serum) containing 1 μg/mL mouse antitubulin (Abcam, ab7291, Cambridge, UK) and 1 μg/mL chicken anticalreticulin (Abcam, ab14234). After washing with PBS, diluted primary antibodies were added (40 μL) and the plates were incubated overnight at 4 °C. The following day, all wells were washed with PBS for 4 × 10 min. Secondary antibodies, goat antirabbit Alexa488, rabbit antimouse Alexa555, and goat antichicken Alexa647 (Molecular Probes, Invitrogen) diluted to 1 μg/mL in blocking buffer were added and the plates were incubated for 1 h at room temperature. The cells were counterstained with 50 μL of the nuclear probe DAPI (Invitrogen) diluted to 300 nM for 10 min. After washing with PBS, all wells were mounted with PBS containing 78% glycerol.
MATERIAL AND METHODS
Cell Cultivation
The cell lines were cultivated in Petri dishes at 37 °C in a 5% CO2 humidified environment in different culture media; A-4318 in Roswell Park Memorial Institute medium (RPMI); U-2 OS9 and RT-410 in McCoy’s medium; U-251 MG11 and Hep-G212 in Eagle’s Minimal Essential Medium (EMEM); MCF-7,13 HEK 293,14 CaCO-215 and HeLa16 in EMEM with additional 1% NEAA (Nonessential amino acids); A-5498 and PC-317 in Dulbecco’s Modified Eagle Medium (DMEM). All growth media were supplemented with 10% fetal bovine serum (FBS; VWR, Radnor, PA). All cells were harvested at 60−70% confluency by trypsinisation. RNA Sequencing
RNA was extracted with the RNeasy extraction kit according to the manufacturers instructions (Qiagen, Hilden, Germany) where after cDNA libraries were prepared according to the Illumina TruSeq standard procedures. Sequencing was performed on duplicate samples for each cell line on Illumina HiSeq 2000, as 2 × 100 bp paired-end reads (Illumina, San Diego, CA). Sequences were mapped to the hg19 build of the human reference genome using TopHat v. 1.0.14.18 Putative PCR duplicates were removed from the aligned sequences using the MarkDuplicates subcommand of PicardTools (v. 1.29). As quantitative measurements of gene expression, FPKM (Fragments Per Kilobase of exon model per Million mapped reads) values were calculated to normalize for both gene length and total number of reads in the measurement. FPKM values were calculated with respect to genes from Ensembl release version 63.37 (www.ensembl.org) using Cufflinks (v. 1.0.3). The raw sequence data files were uploaded to the NCBI short read archive with accession number SRA062599.
Fluorescence Image Acquisition
Image acquisition was performed with a Leica SP5 confocal microscope equipped with a 63× 1.4 numerical aperture oil immersion objective with the support of LAS AF matrix software. Two images of each HPA antibody were acquired at room temperature in three sequential steps (HPA antibody, DAPI and ER, microtubules) at a single z-axis level with the following scanning settings; Pinhole 1 Airy unit, 16 bit acquisition and a pixel size of 80 × 80 nm. For each sample, the z-axis level and the detector gain were manually adjusted in order to optimize the visualization of each protein and maximize the dynamic range (i.e., to only allow for a few saturated pixels), respectively. The images were acquired manually and the operator ensured that the images were representative for the entire sample. Image Annotation
The acquired images were manually examined and the subcellular patterns observed annotated. The stainings were classified into four different categories; negative, weak, moderate and strong, based on the detector gain value. According to the staining, the subcellular location of the protein was referred to the following subcellular compartments: cytoplasm, nucleus, nuclear membrane, nucleolus, mitochondrion, endoplasmic reticulum, Golgi apparatus, microtubules, actin filaments, intermediate filaments, plasma membrane, focal adhesions, cell junctions, centrosome, aggresome, vesicles, microtubule organizing centrum and cytokinetic bridge. The stainings were further characterized by the following descriptions; smooth, granular, speckled, spotty, fibrous or clustered. Images that were vaguely stained or had a diffused pattern were annotated as unspecific. If more than one localization was observed, they were defined as main or additional, depending on their appearance in one or several of the cell lines and on the relative intensity between the observed organelles. All images and annotations are available at the Human Protein Atlas Web portal (www.proteinatlas.org). Finally a validation score of the observed localization was assigned for each protein/cell line and was classified as either
Data Analysis
The RNA sequencing- and subcellular annotation data were analyzed using R statistical programming environment with addition of the ggplot2 package.19 DAVID20 was used to perform a Gene Ontology based enrichment analysis of cellular components and biological processes among the proteins that are not expressed in the eleven cell lines, of 4723 proteins in that list, 3826 were recognized by DAVID and used as input list. As a background, the set of all human protein coding genes was used and the database search was restricted to only include Gene Ontology Biological Process and Gene Ontology Cellular Component annotations. Antibodies
The antibodies used for immunofluorescence were rabbit polyclonal antibodies generated and validated within the Human Protein Atlas project.7 The antigens used were recombinant protein epitope signature tags, typically between 50 and 100 amino acids21 and the resulting antibodies were affinity purified using the antigen as affinity ligand.22 300
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
glioblastoma cell line; A-431-an epidermoid carcinoma cell line; and U-2 OS - an osteosarcoma cell line. By using the antibodies generated in the HPA project, 9659 proteins have been systematically studied by immunofluorescence (IF) microscopy in these three cell lines and 78% could be assigned a subcellular localization. The staining intensity of each protein was further classified as negative, weak, moderate or strong based on the employed detector gain settings. Each localization was assigned a certain validation score based on the concordance with information in UniProt. Twenty-three % of the proteins had a location that was supported by UniProt, 44% had an uncertain score, 11% showed a location in conflict with UniProt whereas 22% were not detected at all. The distribution of validation scores was similar in the three cell lines with a majority of the staining being validated as uncertain. A more detailed examination of this group showed an over-representation of antibodies lacking UniProt literature which might be due to the fact that many of the proteins studied early on in the HPA project did not have information on UniProt at that time, and have not been revalidated since.
Supportive, Uncertain or Nonsupportive based on the concordance with available experimental protein characterization data in UniProtKB or by staining from additional independent antibodies targeting the same protein. There are in total 13 different sublevels of the validation score classes; for details and further descriptions, see http://www.proteinatlas. org/about/quality+scoring#ifv.
■
RESULTS AND DISCUSSION
RNA Sequencing of Cell Lines
RNA sequencing was performed on duplicate samples to generate quantitative measurements of the presence and prevalence of gene transcripts in the cell lines. Hebenstreit et al.23 recently suggested that in mammalian cells many low abundant transcripts are not translated into stable proteins, in particular genes with an expression value (FPKM) of less than 1. Based on this recommended cutoff of FPKM >1, we determined the number of genes per cell line expressed at the RNA level. This measure serves as a proxy for the maximum number of proteins that can be localized in this cell line. The numbers of expressed genes were fairly similar for all cell lines ranging from 65 to 76% of all protein coding genes (n = 20.159, see Table 1 for details), in line with previous results from RNAsequencing of these cell lines.2,5,24,25
Correlation between RNA Levels and Protein Immunofluorescent Staining
When combining the RNA sequencing data for the three original cell lines, in total 81% (n = 16.248) of all human protein coding genes were expressed. Although the number of detected genes (81%) and proteins (78%) in these cell lines are very similar, the overlap of detected genes is not perfect. In the U-2 OS cell line 4.466 genes are detected by both methods, 1.701 only by IF, and 914 only by RNA-seq (Figure 1A). Figure 1B show examples of proteins where the staining observed correlates with the level of RNA expression. Notably, there is a high value in characterizing both a RNA-seq “positive” and “negative” cell line, especially when characterizing so-called “missing proteins”. Figure 1C show examples of proteins where no correlation is observed, the mitochondrial ribosomal protein L37 is barely detected using IF despite a high RNA transcript expression (FPKM 93.1) and a Western Blot showing the correct band while the transcription factor CP1 is strongly stained, show the correct band in Western Blot despite that the RNA transcript is not detected at all. There are several explanations for this discrepancy of both false positive and false negative data. First, certain antibodies may not work in the IF application. Second, most antibodies have a high affinity for the intended target protein but also a much lower affinity to other proteins. In the absence of the target protein the antibody may cross-react and bind to other proteins and hence give a false IF staining in the subcellular analyses. Such cases are expected to be a large source of false localizations assigned by systematic antibody-based stainings. Third, the cutoff used to determine when a functional transcript is to be considered as expressed is highly debated and the use of FPKM > 1 as a cutoff may be too high and hence give rise to false negative RNA data. Finally, protein stability and turnover rate may vary and complicate the use of a generic cutoff of when a transcript can be considered to give rise to functional proteins. In addition to this, some differences may occur due to presence of protein isoforms as disregarded from in this gene-centric comparison (as most HPA antibodies target multiple protein isoforms). The distribution of FPKM values within each IF intensity class (negative, weak, moderate, strong) clearly shows that there is a general correlation between the level of RNA expression and intensity of the IF staining of the corresponding
Table 1. Cell Lines Used for Subcellular Localization of All Human Proteinsa
Cell line
Tissue origin
A-431
Skin
U-251 MG U-2 OS A-549 CACO-2
Brain Bone Lung Colon
HEK 293
Embryonal kidney
HeLa
Cervix
Hep-G2
Liver
MCF-7
Pleural effusion Bone marrow
PC-3
RT-4
Urinary bladder
Expressed genes (FPKM >1)
Gene expression (%)
Epidermoid carcinoma cell line Glioblastoma cell line
13.637
68
13.128
65
Osteosarcoma cell line Lung carcinoma cell line Colon adenocarcinoma cell line Embryonal kidney cell line, transformed by adenovirus type 5 Cervical epithelial adenocarcinoma cell line Hepatocellular carcinoma cell line Metastatic breast adenocarcinoma cell line Metastatic poorly differentiated prostate adenocarcinoma cell line Urinary bladder transitional cell carcinoma cell line
15.478 13.849 13.796
76 69 68
14.413
71
14.061
70
13.724
68
13.686
68
13.889
69
14.210
70
Description
a
Description of the 11 cell lines included in the study are shown, together with the total number of detected genes and the corresponding coverage of the human proteome for each cell line respectively.
Systematic subcellular localization of proteins in three human cell lines
In the beginning of the HPA project, three cell lines were selected to be used for subcellular localization of proteins.6 The cell lines were carefully chosen to be dissimilar and hence theoretically maximize the number of proteins expressed in at least one of them. The cell lines used were: U-251 MG - a 301
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
Figure 1. Subcellular localization of proteins with correlating/noncorrelating RNA levels. (A) Venn diagram showing the overlap of detected proteins by IF and RNA-seq in the U-2 OS cell line. (B−C) Confocal images of immunofluorescently stained cells where the protein of interest is stained with a specific antibody (showed in green) and the additional markers DAPI (blue) as well as Microtubules (red). The scale bar represents 10 μm. (B) Images show the following proteins in the cell lines A-431, U2-OS and U-251 MG; (I) TES localized to the plasma membrane and cytoplasm (HPA015269, FPKM values respectively: 75.9, 0.193 and 0.327). (II) HSD17B11 localized to vesicles (HPA021608, FPKM values respectively: 0.93, 33.6 and 8.22). (III) SH2D3A localized to centrosomes (HPA035722, FPKM values respectively: 21.2, 0.16 and 0.77). (IV) MAP1B (HPA022275 localized to the cytoplasm, FPKM values respectively: 0.75, 40.8 and 82.2). (C) (I) Mitochondrial ribosomal protein L37 is weakly stained (HPA025951, green) despite a high RNA level (FPKM value 93.1) and a supportive Western Blot, where just a single band was detected corresponding to the predicted size (47.6,48.1 KDa). (II) Transcription factor CP2-like 1 protein is strongly stained (HPA029708; green) and successfully detected by Western Blot, where a single band was detected corresponding to the predicted size (54.6 KDa), despite a FPKM value of 0.
intensity. This is not surprising since the there are many parameters affecting the protein detection and IF staining intensity, such as the number of target epitopes in the polyclonal antibody pool, affinity of the antibody to the target protein and accessibility to the target protein upon sample fixation. Furthermore, the IF images are two-dimensional and not representing the full cell volume and the number of cells measured are limited (10−20 per protein). Finally, in antibodybased imaging studies the “local” concentration of a protein is more important than the overall level. A few proteins in close vicinity to each other, for instance in a centrosome, can give rise to a strong detectable signal, whereas a higher number of proteins dispersed throughout the cytoplasm will not be detected.
protein, as the mean FPKM for respective group is 3.36, 9.52, 21.6, and 70.4 respectively. Nevertheless, the spread of the FPKM values within each class is similar and there are proteins with really low, as well as really high, FPKM-values in each intensity class. Interestingly, there are proteins detected in the strong and moderate groups with FPKM levels below the cutoff, but notably the number of lowly expressed genes in this group is much lower than for the other two groups. Several recent studies show a high correlation between RNA-seq levels and protein abundance as determined by Mass Spectrometry in human cell lines.5,24 For instance, we previously reported on a high correlation in the very same cell lines.2 Hence, we conclude that the lower correlation between IF staining intensity and RNA-seq observed here indeed reflect the lack of correlation between protein abundance and IF staining 302
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
The distribution of FPKM values within each IF validation group (Supportive, Uncertain and Nonsupportive) clearly indicates that high FPKM values are dominating among the supportive stainings while the other three groups are very similar (Figure 2B). The mean FPKM values in the groups were 46.5 for supportive stainings, 17.6 for uncertain stainings, 16.6 for proteins with no staining and 21.2 for nonsupportive stainings. This result indicates that if a certain protein of interest could be studied in the cell line where it is more highly expressed, the staining would more likely reflect the correct
subcellular distribution of that protein. Also here there are proteins detected in the supportive groups with FPKM levels below the cutoff, but the number of lowly expressed genes in this group is much lower than for the other groups. Altogether, these results show that in general a high transcript expression is more likely to generate a strong immunofluorescent staining, representing the true subcellular distribution of this protein, but that RNA-seq levels cannot be used as an indicator/predictor of successful subcellular localization for individual proteins. Furthermore, the results emphasize the complexity of using a generic cutoff (such as FPKM < 1 as often used) on RNA-seq data to determine when a functional protein is expressed. Figure 2C show the FPKM distribution within different organelle groups in U-2 OS (main localization as determined by IF). It is clear that the distribution is similar for most organelles. However, proteins localized to the endoplasmic reticulum and mitochondria show slightly higher FPKM values. Mitochondrial proteins and also ribosomal subunits (often present in the endoplasmic reticulum) have previously been shown to be expressed at a level above average.5,24 We have also previously shown that the current fixation and permeabilization protocol used for IF is not optimal for membranous structures such as the endoplasmic reticulum and plasma membrane26 (and unpublished work) and one can speculate that a higher expression may in general be needed for a successful protein detection in these compartments. Expansion of the Cell Line Panel
The goal of the HPA project is to provide a complete draft of the human proteome available by 2015. In order to meet this goal, efforts are being taken to ensure that all proteins are expressed in the model systems utilized. To allow the localization of a higher number of the human proteins the cell line panel needs to be expanded beyond the three historically used in the HPA project. In order to complement the old cell lines and again maximize the number of proteins expressed, the additional cell lines again needed to represent different organs in the human body. The cell lines should also be commonly used, well characterized and easy to cultivate. In addition to this, the scheme for subcellular localization in HPA is based on the use of adherent cells to allow cultivation in microscope-compatible multiwell plates. As a first step toward an expanded and comprehensive cell line panel, new cell lines were chosen from the subset already used for immunohistochemistry within the HPA project.27 From this subset, eight additional cell lines meeting these criteria were carefully selected and integrated into the HPA subcellular localization workflow; A-549 - a lung carcinoma cell line, CACO-2 - a colon adenocarcinoma cell line, HEK 293 - an embryonal kidney cell line, HeLa - a cervical adenocarcinoma cell line, Hep-G2 - a hepatocellular carcinoma cell line, MCF-7 - an adenocarcinoma cell line, PC-3 - a metastatic and poorly differentiated adenocarcinoma cell line and RT-4 - an urinary bladder carcinoma cell line, (Table 1). RNA sequencing was performed also on these cell lines to generate quantitative measurements of the presence and prevalence of gene transcripts. Altogether, the eleven cell lines in the HPA cell panel express 87% (n = 17.515) of the human protein coding genes, indicating that approximately 1.250 additional proteins should be possible to detect and characterize. Figure 3 visualizes the gene expression in the eleven cell lines for all expressed genes (FPKM > 1), all
Figure 2. Distribution of FPKM values for different IF classes. Density plots showing the estimated densities of log2 FPKM values in different classes of IF staining. (A) Distribution of FPKM values in the four IF staining intensity groups Strong (n = 1.745), Moderate (n = 2.557), Weak (n = 4.792) and No staining (n = 4.157). The cutoff for detection of FPKM > 1 is highlighted with a vertical dashed line. (B) Distribution of FPKM values in the four IF validation score groups Supportive (n = 6.681), Uncertain (n = 12.934), No staining (n = 6.570) and Non Supportive (n = 3.203). The cutoff for detection of FPKM > 1 is highlighted with a vertical dashed line. (C) Distribution of FPKM values for proteins localized to different subcellular compartments (Mean FPKM in brackets): Cell membrane (19.1), Cytoplasm (31.1), Cytoskeleton (16.4), Endoplasmic reticulum (75), Golgi apparatus (17.8), Mitochondria (41.3), Nuclear membrane (25.3), Nucleus and nucleoli (26.2) and vesicles (13.5). The cutoff for detection of FPKM > 1 is highlighted with a vertical dashed line. 303
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
Figure 4. Distribution of IF validation scores. Barplot showing the distribution of the IF validation scores Supportive, Uncertain, No staining and Nonsupportive, before and after introduction of RNA sequencing data as a basis for cell line selection.
Figure 3. Distribution of number of detected genes by RNA-seq across the different number of cell lines for three different cutoffs for detection. (a) Column chart showing the distribution of number of detected genes across different number of cell lines (from 1 to 11) for 17.515 genes with FPKM > 1, 17.119 genes with FPKM > 10 and 13.288 genes with FPKM > 50.
For this increase, the more comprehensive UniProt literature available today is likely an important factor, since antibodies that were previously classified as uncertain are now more easily classified as either supportive or nonsupportive. Ideally, the protein should always be detected and localized when using cell lines selected based on RNA-seq. However, 20% of the proteins do not give rise to any IF staining. The reasons for this, as already discussed, may be that the antibody is not working in the immunofluorescent application or that the protein is not present at detectable levels albeit the observed RNA-level, again reminding us of the complexity of using a generic cutoff for RNA-seq data. Also, it is well-known that the cell fixation performed upon IF staining may not be suitable for all types of proteins.26 A suggested next step for localization of the unstained proteins detected at the RNA level should be to employ a different fixation protocol, for instance based on dehydration instead of paraformaldehyde cross-linkage and second to use a different antibody. If a protein is still not detected, other complementary approaches should be considered, such as expression as a fusion to a fluorescent protein28 or organelle fractionation followed by identification by massspectrometry.29 Figure 5 shows examples of proteins that were successfully localized in cells from the extended cell line panel but not in the original panel. These changes confirm that RNA sequencing is a necessary step toward an efficient and complete characterization of the subcellular distribution of the entire human proteome.
moderately expressed genes (FPKM > 10) and all strongly expressed genes (FPKM > 50). Interestingly, by having a diverse set of cell lines, the number of moderately or highly expressed genes dramatically increases to 17.119 and the number of genes highly expressed in at least one cell line is as high as 13.288. Notably there is only very few (n = 43) genes expressed at a high level in eight or more of the cell lines. Besides a higher number of expressed genes, the use of a larger cell line panel and selection of suitable cell lines based on RNAseq should thus, in line with the results shown in Figure 2, give rise to a higher number of successful stainings and hence successfully localized proteins. Systematic Subcellular Localization of Proteins in Selected Cell Lines Based on RNA-seq Data
The scheme for subcellular localization in HPA was redesigned so that the subcellular localization of each protein is examined in the two cell lines (out of the expanded panel of eleven cell lines) with the highest RNA expression of the corresponding gene. In addition, all proteins are localized in U-2 OS disregarding of the RNA level so that a complete proteome of one human cell line will still be characterized. Four-hundred twelve proteins (randomly selected with no prior knowledge of subcellular localization) were systematically localized according to the above-mentioned scheme to evaluate whether the use of RNA-sequencing could improve the outcome. To compare the quality of the IF data between before and after the introduction of the new panel, an equal number of proteins from the old data set were randomly selected. Figure 4 shows how the distribution of subcellular validation scores has changed with the new scheme. The group of supportive staining has increased from 23 to 33% while the uncertain and unstained groups have decreased from 44 to 29% and 22 to 20%, respectively. However, the group of nonsupportive stainings has also increased from 11 to 20%.
What Is Not Detected in the Current Cell Line Panel?
Despite the almost 4-fold expansion of the HPA cell line panel, the RNA sequencing data implies that yet more cells or tissues are needed to reach a satisfactory or complete coverage of the human proteome. Thus, it is now necessary to look into what proteins that are not expressed in either of the eleven cell lines. A functional clustering analysis, using the online DAVID software, was performed for all nonexpressed genes (FPKM ≤ 1, n = 4.723), using the whole human genome as background. The results (Table 2 for biological process and Table 3 for cellular components) show that there is a high enrichment of 304
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
Table 3. DAVID Functional Enrichment Analysis of Undetected Proteinsa enriched term (cellular component)
count
Intrinsic to membrane Integral to membrane Plasma membrane Extracellular region Intermediate filament Intermediate filament cytoskeleton Integral to plasma membrane Keratin filament Intrinsic to plasma membrane Extracellular space
1.422 1.382 1.066 641 93 93 343 53 347 213
Benjamini value 8.4 3.7 3.0 5.9 1.5 8.6 1.2 6.4 1.2 1.5
× × × × × × × × × ×
10−59 10−58 10−57 10−48 10−19 10−19 10−15 10−15 10−14 10−12
a
Top 10 results of Cellular compartments enriched among 3.826 genes that were not detected in any of the 8 cell lines are shown. For each annotation, the corresponding number of genes and an adjusted p-value are shown.
membrane and intermediate filaments. The approach to compensate for the loss of these proteins will now be to sequence the RNA of cell lines that are expected to express these types proteins, such as cell lines of neurological or hematopoietic origin as well as supportive cells such as fibroblasts. Also, other cell types will be considered as candidates, for example primary cells and stem cells. As a large group of plasma membrane proteins is enriched among the nonexpressed proteins, it is of high importance to find new alternative sources where they are expressed. This result supports the previously suggested “inside−outside” rule of gene expression,2,3 postulating that plasma membraneassociated proteins are more often differentially expressed between cell lines. Furthermore, this indicates a disadvantage of working only with adherent cells with similar morphology. As both adherent and suspension cells will adapt to the conditions of in vitro cultivation, thus altering their surface protein expression, a combination of the two cell types is preferable in order to achieve the widest range of cell surface proteins.
Figure 5. Examples of proteins successfully localized by expansion of the cell line panel. Confocal images of immunofluorescently stained cells where the protein of interest is stained with a specific antibody (showed in green) together with additional markers for the nucleus (blue) or Microtubules (red). Both the subcellular location reported in UniProt and the RNA levels are in agreement with the observed results. (A) Cytoplasmic staining of the protein adenosylhomocysteinase in CACO-2 cells and no staining in U2-OS cells (HPA044675, FPKM values 507.8 and 126.0 respectively). (B) Moderate nuclear staining of the Hepatocyte nuclear factor 1-alpha in CACO-2 cells and weak staining in U2-OS cells (HPA035231, FPKM values 17.7 and 0.813 and respectively). (C) Weak staining of cyclin D2 in CACO-2 cells show a cell cycle dependent localization to the nucleus and no staining in U2-OS cells (HPA049138, FPKM values 50.1 and 34.7 respectively). The scale bar represents 10 μm.
■
Table 2. DAVID Functional Enrichment Analysis of Undetected Proteinsa enriched term (biological process) G-protein couples receptor protein signaling pathway Sensory perception of chemical stimulus Sensory perception of smell Sensory perception Cognition Neurological system process Cell surface receptor linked signal transduction Defense response Defense response to bacterium Immune response
count
Benjamini value
617
1.3 × 10−200
361 338 464 482 555 721 215 69 203
3.9 1.8 6.4 6.7 1.8 4.9 3.2 1.1 8.5
× × × × × × × × ×
CONCLUSIONS We have here shown that high expression levels of RNA transcripts are more likely to generate good IF stainings, making RNA sequencing an attractive tool for direction to the right sources of protein expression, in the aim of characterizing the subcellular localization of all human proteins. Nevertheless, we show that RNA-seq levels cannot be used as an indicator or predictor of successful subcellular localization for individual proteins as the immunofluorescent staining depends on many other parameters than just the absolute level of the protein of interest, such as affinity of the antibody and local concentration of the protein studied. Furthermore, the results emphasize the complexity of using a generic cutoff (such as FPKM < 1 as often used) on RNA-seq data to determine when a functional protein is expressed. By expanding the cell line panel used for subcellular localization of proteins in the HPA project, we show that more human proteins can be successfully localized. In fact, the main contribution of this expanded panel of rather similar adherent cell lines is the possibility to study the protein in a cell line where the gene is expressed at a moderate or higher level, greatly increasing the likelihood of a successful localization. The RNA sequencing data however implies that yet more cells or tissues are needed to reach a satisfactory or complete coverage
10−178 10−175 10−156 10−144 10−131 10−129 10−25 10−22 10−14
a
Top 10 results of Biological processes enriched among 3.826 genes that were not detected in any of the 8 cell lines are shown. For each annotation, the corresponding number of genes and an adjusted pvalue are shown.
genes involved in G-protein couples receptor protein signaling, sensory perception, neurological processes and the immune response. Furthermore, the cellular components highly enriched are secreted proteins as well as proteins in the plasma 305
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
(7) Uhlen, M.; Bjorling, E.; Agaton, C.; Szigyarto, C. A.; Amini, B.; Andersen, E.; Andersson, A. C.; Angelidou, P.; Asplund, A.; Asplund, C.; Berglund, L.; Bergstrom, K.; Brumer, H.; Cerjan, D.; Ekstrom, M.; Elobeid, A.; Eriksson, C.; Fagerberg, L.; Falk, R.; Fall, J.; Forsberg, M.; Bjorklund, M. G.; Gumbel, K.; Halimi, A.; Hallin, I.; Hamsten, C.; Hansson, M.; Hedhammar, M.; Hercules, G.; Kampf, C.; Larsson, K.; Lindskog, M.; Lodewyckx, W.; Lund, J.; Lundeberg, J.; Magnusson, K.; Malm, E.; Nilsson, P.; Odling, J.; Oksvold, P.; Olsson, I.; Oster, E.; Ottosson, J.; Paavilainen, L.; Persson, A.; Rimini, R.; Rockberg, J.; Runeson, M.; Sivertsson, A.; Skollermo, A.; Steen, J.; Stenvall, M.; Sterky, F.; Stromberg, S.; Sundberg, M.; Tegel, H.; Tourle, S.; Wahlund, E.; Walden, A.; Wan, J.; Wernerus, H.; Westberg, J.; Wester, K.; Wrethagen, U.; Xu, L. L.; Hober, S.; Ponten, F. A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol. Cell. Proteomics 2005, 4 (12), 1920−32. Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.; Wernerus, H.; Bjorling, L.; Ponten, F. Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28 (12), 1248−50. (8) Giard, D. J.; Aaronson, S. A.; Todaro, G. J.; Arnstein, P.; Kersey, J. H.; Dosik, H.; Parks, W. P. In vitro cultivation of human tumors: establishment of cell lines derived from a series of solid tumors. J. Natl. Cancer Inst. 1973, 51 (5), 1417−23. (9) Ponten, J.; Saksela, E. Two established in vitro cell lines from human mesenchymal tumours. Int. J. Cancer 1967, 2 (5), 434−47. (10) Rigby, C. C.; Franks, L. M. A human tissue culture cell line from a transitional cell tumour of the urinary bladder: growth, chromosone pattern and ultrastructure. Br. J. Cancer 1970, 24 (4), 746−54. (11) Westermark, B. The deficient density-dependent growth control of human malignant glioma cells and virus-transformed glia-like cells in culture. Int. J. Cancer 1973, 12 (2), 438−51. (12) Aden, D. P.; Fogel, A.; Plotkin, S.; Damjanov, I.; Knowles, B. B. Controlled synthesis of HBsAg in a differentiated human liver carcinoma-derived cell line. Nature 1979, 282 (5739), 615−6. (13) Soule, H. D.; Vazguez, J.; Long, A.; Albert, S.; Brennan, M. A human cell line from a pleural effusion derived from a breast carcinoma. J. Natl. Cancer Inst. 1973, 51 (5), 1409−16. (14) Graham, F. L.; Smiley, J.; Russell, W. C.; Nairn, R. Characteristics of a human cell line transformed by DNA from human adenovirus type 5. J. Gen. Virol. 1977, 36 (1), 59−74. (15) Fogh, J.; Fogh, J. M.; Orfeo, T. One hundred and twenty-seven cultured human tumor cell lines producing tumors in nude mice. J. Natl. Cancer Inst. 1977, 59 (1), 221−6. (16) Scherer, W. F.; Syverton, J. T.; Gey, G. O. Studies on the propagation in vitro of poliomyelitis viruses. IV. Viral multiplication in a stable strain of human malignant epithelial cells (strain HeLa) derived from an epidermoid carcinoma of the cervix. J. Exp. Med. 1953, 97 (5), 695−710. (17) Kaighn, M. E.; Narayan, K. S.; Ohnuki, Y.; Lechner, J. F.; Jones, L. W. Establishment and characterization of a human prostatic carcinoma cell line (PC-3). Invest. Urol. 1979, 17 (1), 16−23. (18) Trapnell, C.; Pachter, L.; Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25 (9), 1105−11. (19) R Development Core Team, R: A language and environment for statistical computing; Vienna, Austria, 2008. (20) Huang da, W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009, 4 (1), 44−57. (21) Berglund, L.; Bjorling, E.; Jonasson, K.; Rockberg, J.; Fagerberg, L.; Al-Khalili Szigyarto, C.; Sivertsson, A.; Uhlen, M. A whole-genome bioinformatics approach to selection of antigens for systematic antibody generation. Proteomics 2008, 8 (14), 2832−9. (22) Nilsson, P.; Paavilainen, L.; Larsson, K.; Odling, J.; Sundberg, M.; Andersson, A. C.; Kampf, C.; Persson, A.; Al-Khalili Szigyarto, C.; Ottosson, J.; Bjorling, E.; Hober, S.; Wernerus, H.; Wester, K.; Ponten, F.; Uhlen, M. Towards a human proteome atlas: high-throughput generation of mono-specific antibodies for tissue profiling. Proteomics 2005, 5 (17), 4327−37.
of the human proteome. To characterize the whole human proteome and enable identification of more rarely expressed genes, one can envision that very specialized cells will be needed. The results show a path to combine transcriptomics with affinity proteomics as a necessary step toward an efficient and complete characterization of the entire human proteome in a gene- or chromosome-centric manner. From now on, the strategy within the HPA project will be to select suitable cell samples, based on RNA-seq data, for subcellular localization of each and every protein. This is a strategy that could also be used in a collective manner by the participants of the C-HPP consortium. We suggest that a comprehensive database with RNA-seq data of most human cell lines and tissues would prove a valuable starting point for an efficient and harmonized characterization of the human proteome by the respective CHPP teams.
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Author Contributions ‡
These authors contributed equally to this manuscript.
Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We acknowledge Science for Life Laboratory Stockholm and SNISS for help with massively parallel sequencing and bioinformatics analysis and the entire staff of the Human Protein Atlas project. This work was supported by grants from the Knut and Alice Wallenberg Foundation and the strategic grant Science for Life Laboratory.
■
REFERENCES
(1) Vickaryous, M. K.; Hall, B. K. Human cell type diversity, evolution, development, and classification with special reference to cells derived from the neural crest. Biol. Rev. Camb. Philos. Soc. 2006, 81 (3), 425−55. (2) Lundberg, E.; Fagerberg, L.; Klevebring, D.; Matic, I.; Geiger, T.; Cox, J.; Algenas, C.; Lundeberg, J.; Mann, M.; Uhlen, M. Defining the transcriptome and proteome in three functionally different human cell lines. Mol. Syst. Biol. 2010, 6, 450. (3) Ramskold, D.; Wang, E. T.; Burge, C. B.; Sandberg, R. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol. 2009, 5 (12), e1000598. (4) Sultan, M.; Schulz, M. H.; Richard, H.; Magen, A.; Klingenhoff, A.; Scherf, M.; Seifert, M.; Borodina, T.; Soldatov, A.; Parkhomchuk, D.; Schmidt, D.; O’Keeffe, S.; Haas, S.; Vingron, M.; Lehrach, H.; Yaspo, M. L. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008, 321 (5891), 956−60. (5) Nagaraj, N.; Wisniewski, J. R.; Geiger, T.; Cox, J.; Kircher, M.; Kelso, J.; Paabo, S.; Mann, M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011, 7, 548. (6) Barbe, L.; Lundberg, E.; Oksvold, P.; Stenius, A.; Lewin, E.; Bjorling, E.; Asplund, A.; Ponten, F.; Brismar, H.; Uhlen, M.; Andersson-Svahn, H. Toward a confocal subcellular atlas of the human proteome. Mol. Cell. Proteomics 2008, 7 (3), 499−508. Fagerberg, L.; Stadler, C.; Skogs, M.; Hjelmare, M.; Jonasson, K.; Wiking, M.; Abergh, A.; Uhlen, M.; Lundberg, E. Mapping the subcellular protein distribution in three human cell lines. J. Proteome Res. 2011, 10 (8), 3766−77. 306
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307
Journal of Proteome Research
Technical Note
(23) Hebenstreit, D.; Fang, M.; Gu, M.; Charoensawan, V.; van Oudenaarden, A.; Teichmann, S. A. RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol. Syst. Biol. 2011, 7, 497. (24) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 2012, 11 (3), M111 014050. (25) Klevebring, D.; Fagerberg, L.; Lundberg, E.; Emanuelsson, O.; Uhlen, M.; Lundeberg, J. Analysis of transcript and protein overlap in a human osteosarcoma cell line. BMC Genomics 2010, 11, 684. Solnestam, B. W.; Stranneheim, H.; Hallman, J.; Kaller, M.; Lundberg, E.; Lundeberg, J.; Akan, P. Comparison of total and cytoplasmic mRNA reveals global regulation by nuclear retention and miRNAs. BMC Genomics 2012, 13, 574. (26) Stadler, C.; Skogs, M.; Brismar, H.; Uhlen, M.; Lundberg, E. A single fixation protocol for proteome-wide immunofluorescence localization studies. J. Proteomics 2010, 73 (6), 1067−78. (27) Fagerberg, L.; Stromberg, S.; El-Obeid, A.; Gry, M.; Nilsson, K.; Uhlen, M.; Ponten, F.; Asplund, A. Large-scale protein profiling in human cell lines using antibody-based proteomics. J. Proteome Res. 2011, 10 (9), 4066−75. (28) Simpson, J. C.; Wellenreuther, R.; Poustka, A.; Pepperkok, R.; Wiemann, S. Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 2000, 1 (3), 287−92. (29) Yates, J. R., 3rd; Gilchrist, A.; Howell, K. E.; Bergeron, J. J. Proteomics of organelles and large cellular structures. Nat. Rev. Mol. Cell Biol. 2005, 6 (9), 702−14.
307
dx.doi.org/10.1021/pr3009308 | J. Proteome Res. 2013, 12, 299−307