Multidimensional Proteome Analysis of Human Mammary Epithelial

Uma K. Aryal , Stephen J. Callister , Benjamin H. McMahon , Lee-Ann McCue , Joseph Brown , Jana Stöckel , Michelle Liberton , Sujata Mishra , Xiaohui...
0 downloads 0 Views 244KB Size
Multidimensional Proteome Analysis of Human Mammary Epithelial Cells Jon M. Jacobs,† Heather M. Mottaz,† Li-Rong Yu,‡ David J. Anderson,† Ronald J. Moore,† Wan-Nan U. Chen,§ Kenneth J. Auberry,† Eric F. Strittmatter,† Matthew E. Monroe,† Brian D. Thrall,§ David G. Camp, II,† and Richard D. Smith* Biological Sciences Division & Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington, USA Received August 6, 2003

Recent multidimensional liquid chromatography MS/MS studies have contributed to the identification of large numbers of expressed proteins for numerous species. The present study couples size exclusion chromatography of intact proteins with the separation of tryptically digested peptides using a combination of strong cation exchange and high resolution, reversed phase capillary chromatography to identify proteins extracted from human mammary epithelial cells (HMECs). In addition to conventional conservative criteria for protein identifications, the confidence levels were additionally increased through the use of peptide normalized elution times (NET) for the liquid chromatographic separation step. The combined approach resulted in a total of 5838 unique peptides identified covering 1574 different proteins with an estimated 4% gene coverage of the human genome, as annotated by the National Center for Biotechnology Information (NCBI). This database provides a baseline for comparison against variations in other genetically and environmentally perturbed systems. Proteins identified were categorized based upon intracellular location and biological process with the identification of numerous receptors, regulatory proteins, and extracellular proteins, demonstrating the usefulness of this application in the global analysis of human cells for future comparative studies. Keywords: human • HMEC • multidimensional • liquid chromatography • proteome • global • size exclusion

Introduction The landscape of biological systems analysis has changed dramatically with the continued completion of genome sequences for entire organisms.1-5 The next formidable challenge is to look globally at the dynamics of cellular systems and pathways. The identification and detection of structural and functional proteins plays a pivotal role in this analysis and helps link perturbations at the protein level to responses at the cellular level. Proteomic research has undergone changes to reflect the needs of a post-genomic era6,7 by advancing the technology of multidimensional liquid separations coupled with mass spectroscopy8-15 that now make it possible to confidently identify large numbers of proteins in cell or tissue samples. To date, the global proteomes of numerous prokaryotic and eukaryotic species have been analyzed,16-21 resulting in extensive protein databases of these organisms, and now with the sequence of the human genome in hand,22 large scale proteomic identification methods can be more effectively directed * To whom correspondence should be addressed. Richard D. Smith, Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, P.O. Box 999, Richland, WA 99352, USA. † Environmental Molecular Sciences Laboratory. ‡ Present address: SAIC-Frederick, National Cancer Institute at Frederick, Frederick, Maryland, USA 21702. § Biological Sciences Division.

68

Journal of Proteome Research 2004, 3, 68-75

Published on Web 11/05/2003

toward human proteins, cells, and tissues. The use of human cell lines has been an essential tool for studying cellular responses under controlled conditions, and has been instrumental in elucidating regulatory pathways involved in human disease. For example, normal human mammary epithelial cells (HMEC) have been used extensively to investigate growth factor regulatory cascades, and as a means for comparative investigations involving mammary cancer cells.23-25 Proteomic analysis of the nontumorigenic cell line would facilitate identification of proteins normally expressed, possible novel proteins involved in various cellular pathways, and facilitate discovery of key protein markers for phenotypic characterization. The development of a comprehensive, human, baseline, protein database would enable future comparative proteomic studies, and allow the development of peptide “tags” that can dramatically increase throughput for proteome analyses based upon the use of very accurate mass and liquid chromatography (LC) elution time measurements,14,15 and provide a better application to quantitative measurement approaches. Demonstrated here is the application of a combination of powerful multidimensional separation techniques coupled with mass spectrometry for the global detection of 1574 proteins from HMEC cultures. To achieve an added level of protein separation, size exclusion chromatography was first performed at the intact protein level followed by the more common 10.1021/pr034062a CCC: $27.50

 2004 American Chemical Society

Proteome Analysis of Human Mammary Epithelial Cells

peptide separation techniques of strong cation exchange chromatography coupled with high resolution, reversed phase capillary liquid chromatography MS/MS (LC-MS/MS). The analysis was repeated in the presence and absence of cysteine alkylation for added coverage and for investigation of the effect of cysteine alkylation on multidimensional peptide fractionation. The approach taken for protein identification uses both conventional tandem mass spectrometry approaches, further augmented by the use of a recently developed approach using LC elution time information.26 The set of HMEC proteins confidently identified is to our knowledge the most comprehensive yet reported for a human cell line, and provides a baseline for comparison against variations in other genetically and environmentally perturbed human systems. A wide range of proteins were identified including numerous receptors, regulatory proteins, extracellular proteins, and signal transduction proteins, demonstrating the usefulness of this application in the global analysis of human cells and for future comparative studies in elucidating potential cellular markers.

Materials and Methods Human Mammary Epithelial Cells. The nontumorigenic human mammary epithelial cell line HMEC 184 AIL5,24 provided by Dr. Lee Opresko, Pacific Northwest National Laboratory, was used in these studies. Cells were routinely cultured in DFCI-1 medium as previously described.25 Fresh medium was supplied every other day and samples for proteome analysis were taken from culture in late logarithmic growth and of >98% viability as determined by trypan blue exclusion. Protein Preparation. Cell pellets were washed three times in 1 mL ice-cold phosphate buffered saline (PBS), pH 7.2, followed by centrifugation at 10 000 × g. Lysis buffer (10 mM sodium phosphate, pH 7, 0.5% sodium dodecyl sulfate) was added to the cell pellets and the cells were lysed using sonication on ice for 5 min. The lysate was centrifuged for 15 min at 4 °C, 14 000 × g to pellet any cell debris. The lysate sample was denatured thermally (100 °C for 5 min) and reduced with 10 mM fresh DL-dithiothreitol (DTT, Boehringer Mannheim, Indianapolis, IN) for 1 h at room temperature (RT), followed by separation and alkylation of one aliquot with 32 mM iodoacetamide for 1 h at RT. Excess alkylation material was quenched by the addition of fresh 10 mM DTT to the samples (with incubation for 1 h at RT). Protein Digestion and Separation. Size-exclusion chromatography (SEC) was performed at the intact protein level for both the alkylated and nonalkylated aliquots as follows. Sample was injected onto a Phenomenex (Torrance, CA) BioSep-SEC-S 2000, 600 × 21.2 mm column preceded by a 75 × 21.2 mm guard column with SEC separation utilizing a Shimadzu LC10A (Columbia, MD) system with an isocratic gradient at a rate of 5 mL/min consisting of 100 mM NH4HCO3, pH 6.8, for the mobile phase. The ultraviolet (UV) spectra were observed at 210 and 280 nm to determine fraction collection time points. Fraction collection was performed manually, and each fraction was terminated when a significant decrease in absorbance was observed in the UV spectra. The fractions had volumes ranging from 7.1 to 84 mL which were lyophilized to reduce sample volume before protein digestion. Sequencing grade, modified porcine trypsin (Promega, Madison, WI) was added at a trypsin: protein ratio of 1:50 and incubated at 37 °C for 16 h, after which the samples were lyophilized to dryness and stored frozen at -80 °C.

research articles The nonalkylated peptide sample fractions were reconstituted with approximately 1.0 mL of 10 mM ammonium formate, 25% acetonitrile (ACN), pH 3.0, and injected for strong cation exchange chromatography (SCX) onto a PolyLC (Columbia, MD) Polysulfoethyl A 200 × 9.4 mm column preceded by a 10 × 10 mm guard column with a flow rate of 4 mL/min. The separations were performed with a Shimadzu LC-10A system utilizing a Unicam 4225 (Thermo Electron, Waltham, MA) UV/vis detector with mobile phases consisting of solvent A: 10 mM ammonium formate, 25% ACN, pH 3.0, and solvent B: 500 mM ammonium formate, 25% ACN, pH 6.8. Once loaded, the run was isocratic for 5 min at 100% solvent A, followed by a gradient from 100% solvent A to 100% solvent B over 35 min. The gradient was then held at 100% solvent B for 60 min, followed by a 20 min gradient back down to 100% solvent A, and the column was reequilibrated with 100% solvent A for 20 min before the start of another run. Two mL fractions were collected using a Shimadzu FRC-10A fraction collector, after which each fraction was lyophilized to dryness and stored at -80 °C until analyzed. The alkylated peptide sample fractions were developed via strong cation exchange chromatography, similar to the nonalkylated sample, except for the changes described below. The alkylated sample was separated using a strong cation exchange analytical column provided by Applied Biosystems (Foster City, CA) with a flow rate of 0.4 mL/min, utilizing the following mobile phases: Loading Buffer (Buffer C): 10 mM KH2PO4, 25% ACN, pH 3.0; Elution Buffer (Buffer D): 350 mM KCl, 10 mM KH2PO4, 25% ACN, pH 3.0; and Cleaning Buffer (Buffer E): 1 M KCl, 10 mM KH2PO4, 25% ACN, pH 3.0. Gradient flows were run similar to the nonalkylated sample with the exception of 100% Buffer E being held for 12 min before the final equilibration step with Buffer C. Fractions were collected and stored as described above for the nonalkylated sample. Different columns were utilized between the alkylated and nonalkylated samples primarily due to availability. Column differences could correlate to a decrease in the overlap between the two samples, but the total number of observed results (unique peptides and proteins identified) between the two samples is similar overall. Reversed Phase Separation and MS/MS Analysis. This method has been previously reported,30 but briefly, the reversed phase capillary liquid chromatography system was made inhouse using a 150 µm i.d. × 360 µm o.d. × 65 cm capillary (Polymicro Technologies Inc., Phoenix, AZ) fitted with a 2-µm retaining mesh and packed with 5 µm Jupiter C18 stationary phase (Phenomenex, Torrence, CA). Mobile phase F consisted of 0.05% trifluoroacetic acid (TFA), 0.2% acetic acid in water, and mobile phase G consisted of 0.1% TFA, 90% ACN in water. The exponential gradient mixing of mobile phase F with mobile phase G (flow of 1.8 µL/min) began while maintaining constant pressure (5000 psi) 20 min following a 10 µL injection of the sample (1.0 µg/µL). The capillary column was interfaced with a Finnigan LCQ ion trap mass spectrometer (ThermoFinnigan, San Jose, CA) using an electrospray ionization source manufactured in-house. The initial MS scan utilized an m/z range of 400-2000, after which three of the most abundant ions were selected for MS/ MS analysis using a collisional energy set of 45%. Dynamic exclusion was used to prevent repeated analysis of the same high abundant ion. Data Analysis. The SEQUEST analysis software31 was used to match the MS/MS fragmentation spectra with peptides from a human protein database. Criteria used for filtering strictly Journal of Proteome Research • Vol. 3, No. 1, 2004 69

research articles followed previously published methods.8 Briefly, peptide identifications were retained if their ∆Cn value was >0.1 and if any of the following criteria applied: Xcorr > 1.9 with charge state 1+ and fully tryptic cleavage, Xcorr >2.2 with charge state 2+ and fully or partially tryptic cleavage, Xcorr > 3 with charge state 2+ for all peptides, or Xcorr > 3.75 with charge state 3+ and fully or partially tryptic cleavage. When analyzing the alkylated samples with SEQUEST, a dynamic modification was used to identify both normal and iodoacetamide labeled cysteine containing peptides. The human protein database was generated from the NCI Frederick ABCC nonredundant database and was created by selectively filtering entries to retain only “human” descriptions, resulting in the accumulation of 76 402 FASTA entries. After the SEQUEST results were filtered and the unique peptides identified, the data were examined manually for redundancies in “unique” protein identifications (initially 2074 IDs). Numerous protein identifications were found to have multiple entries within the database, i.e., the peptide had been correctly assigned, but the exact peptide appeared in more than one reference. These database redundancies were addressed by searching manually in order to assign only a single protein per peptide identification. In most instances, the ID assignment could be reduced to one protein identification and when possible, a SwissProt entry was used for the given assignment. In some instances, a single peptide would correspond to different proteins, at which point this peptide was not used for protein identification. This analysis reduced the 2074 filtered ID assignments down to 1700 unique protein identifications. An additional criterion, the parameter of the peptide LC normalized elution time (NET), was implemented, screening the peptides by the difference in their observed NET value and the calculated NET value obtained by the use of an artificial neural network (ANN).26 Peptides that had an observed NET that agreed within (10% of the predicted NET were retained and used as identifying peptides. This analysis removed 1672 total peptides, lowering the unique protein identifications from 1700 down to the 1574 proteins confidently reported. Protein Classification. Proteins were classified using Gene Ontology (GO) identifications. To customize the classification for the data reported here, a limited number of changes were incorporated to better represent the data. These changes included; a separate category for ribosomal proteins as a cellular component, the combination of the Energy Pathways and Metabolism categories into a single category in Biological Process, and the creation of a Cellular Processes category to better organize proteins involved in overall cellular function.

Results Separation and Detection Experiments. The flowchart in Figure 1 shows the experimental progression from the cell lysate and proceeding through to peptide identification. Briefly, following cell lysis, protein samples were divided into two aliquots with each aliquot being denatured and reduced. One protein aliquot was alkylated using iodoacetamide, whereas the other was left untreated. The proteins from each aliquot were then separated using size exclusion chromatography (SEC) with 5-6 fractions collected for each sample. These protein fractions were separately digested using trypsin, and the resulting tryptic peptides further separated using strong cation exchange (SCX) chromatography, yielding a total of 114 peptide fractions for each original protein aliquot. The number of peptide fractions retained during each SCX separation ranged between 15 and 70

Journal of Proteome Research • Vol. 3, No. 1, 2004

Jacobs et al.

Figure 1. Diagram of sample separation and analysis. Shown is the flow of experimental information starting from cell lysate through data analysis. Two HMEC global protein samples were prepared, either alkylated or nonalkylated, and were subjected to size exclusion chromatography, separate tryptic digestion of each SEC fraction, and strong cation exchange chromatography. The subsequent 114 peptide fractions were then analyzed via reversed-phase liquid chromatography coupled with MS/MS, resulting in spectra that were analyzed using SEQUEST. Results were then screened using charge state, Xcorr, ∆Cn, and tryptic state criteria resulting in >5800 unique peptides, correlating to 1574 unique protein identifications.

30, due to variations in the complexity of each SEC fraction. Each SCX peptide fraction was analyzed via high resolution, reversed phase capillary liquid chromatography (RP-LC) coupled with electrospray ionization tandem mass spectrometry. A large number of MS/MS spectra were generated (∼700 000), which correlated into ∼16 800 peptide identifications representing ∼5800 different peptides. An example of the experimental flow for separation and identification of peptides is given in Figure 2. Figure 2A shows the SEC protein separation of the nonalkylated HMEC sample. The arrow denotes the peak collected as fraction 2, which was determined to be the most complex of the SEC fractions. The SCX peptide separation of SEC protein fraction 2 is shown in Figure 2B, along with the corresponding number of peptide identifications per SCX fraction. The arrow in Figure 2B denotes the SCX peptide fraction 16, a representative fraction which eventually was determined to contain ∼400 peptide identifications. The RP-LC base peak chromatogram of SCX peptide fraction 16 is shown in Figure 2C and one time point (scan# 1107, 35.6 min) represents a single MS scan (Figure 2D). The MS/MS analysis of a single precursor ion (m/z 572) from scan #1107 is shown in Figure 2E with the identifying b and y fragmentation ions labeled. Following MS/MS analysis, the spectra were analyzed using the SEQUEST software which generated general cross correlation scores (Xcorr) from each spectrum as well as deltacorrelation scores (∆Cn), representing the difference in Xcorr values between the highest and second highest scoring peptide identifications. Each peptide score was then screened using previously published criteria,8 based upon the charge state of the peptide, Xcorr, cleavage state, and a ∆Cn value of >0.1. In

research articles

Proteome Analysis of Human Mammary Epithelial Cells

Figure 2. Separation Chromatograms and LC-MS/MS analysis. (A) Size exclusion chromatography (SEC) separation of nonalkylated HMEC cell lysate. Isocratic elution of the proteins was observed using A280 with brackets representing the collected fractions. A total of 6 fractions were collected. Fraction 1 was discarded due to lack of protein detected. Protein separation of the alkylated sample was similar, but with a total of 7 fractions collected. (B) Strong cation exchange (SCX) chromatography of tryptically digested SEC Fraction 2. The total number of peptide identifications for each peptide fraction is shown as a bar superimposed over the SCX absorbance. (C) Reversed phase LC-MS/MS analysis of SCX peptide fraction 16. The base peak chromatogram represents gradient elution of 10% to 60% acetonitrile over time. (D) MS scan taken at time point 35.62 min during RP LC-MS/MS analysis. Three peaks are selected for each MS scan for further identification via collision induced dissociation. (E) MS/MS scan of parent ion m/z 572.2. All major peaks have been labeled as either b- or y- ions indicating that the parent ion is the fully tryptic peptide R.FLIVAHDDGR.W, originating from protein FSC1_HUMAN.

addition to screening SEQUEST parameters, peptides were further screened to eliminate peptide identifications that had a predicted normalized elution time (NET)26 more that 10% different from the measured value, significantly increasing the overall confidence for the dataset (see Materials and Methods). Comparison of Alkylated and Nonalkylated Results. To determine the effects of cysteine alkylation in regard to peptide identifications in a multidimensional study, a comparison of nonalkylated and alkylated samples is shown in Table 1. Most notable is the difference in detection of cysteine containing peptides, with a significant 8-fold increase in detected cysteine containing peptides in the alkylated sample. Although an increase in the detection of cysteine containing peptides with the alkylated sample was expected, the data demonstrates the poor identification of nonalkylated cysteine containing peptides in analyses that does not account for their potential modified forms. Cysteine alkylation went to near completion (98%) for the detected cysteine containing peptides in the alkylated sample.

Table 1. Comparison of Results between Alkylated and Non-Alkylated Samples alkylated non-alkylated overlap total

unique peptides cysteine containing peptides alkylated cysteine peptides unique proteins a

3622 536 527 1220

3390 58 N/Aa 1084

1174 20 N/A 730

5838 575 527 1574

Not Applicable.

Effectiveness of Molecular Weight Separation. To investigate whether the size exclusion separation at the protein level correlated with the predicted molecular weights (MW) of identified proteins, the identified proteins of the alkylated sample were mapped back to their original size exclusion fraction and their average MW was graphed while superimposed upon the SEC separation plot (see Figure 3). The expected correlation between average molecular weight of the proteins and increasing SEC fractions is generally observed. However, fraction 5 appears to go against this trend, which is Journal of Proteome Research • Vol. 3, No. 1, 2004 71

research articles

Jacobs et al.

Figure 3. Molecular weight distribution in size exclusion fractions of identified proteins. The plot demonstrates the separation observed with regard to molecular weight in the alkylated SEC separation. The average MW was calculated using only the final unique proteins identified, which was correlated to the fraction that contained the majority of identifying peptides. The average MW data is superimposed upon the absorbance of the SEC separation to better visualize the protein separation that corresponds with the average MW detection. Error bars represent the standard error of mean (SEM).

Figure 5. Cellular Categorization of Identified Proteins. Categorization was achieved by correlating GO identification numbers corresponding to either cellular component or biological process with the identified proteins. Results shown represent approximately 87% of all proteins detected. Values in parentheses represent the percentage of total proteins found in that respective category for all human SwissProt GO identifications. This gives an approximate comparison of the representation of detected proteins in the global sample. (A) Identified proteins categorized based upon their cellular location. Proteins in various regions of the cell are both over and underrepresented, but an overall representation of protein localities in the cell is observed. (B) Identified proteins categorized based upon their predicted biological roles. Figure 4. Chromosome Distribution of Identified Proteins. Shown is the division of identified proteins based upon human chromosome gene location. An overall average of 4.1% coverage is observed based upon a total of 35 488 genes predicted by the National Center for Biotechnology and Information (NCBI). A total of 94% of the identified proteins were linked back to a chromosome location with 99 proteins not mapped.

potentially attributed to the late elution of certain proteins involved in nonspecific interactions with the SEC column that are independent of size separation. The MW breakdown of the nonalkylated detected proteins produced similar results (data not shown). Categorization of Detected Proteins. The breakdown distribution of identified proteins based upon gene location is displayed in Figure 4. The gene location was generated using the National Center for Biotechnology Information (NCBI) Locus Link (http://www.ncbi.nlm.nih.gov/LocusLink/) and is based upon a total of ∼35 000 mapped genes. The average percent chromosome coverage was determined to be 4.1% in addition to 99 proteins that could not be mapped. A 26% standard deviation is seen across the genome, with some 72

Journal of Proteome Research • Vol. 3, No. 1, 2004

specific chromosome levels varying by more than 3- fold (1.9% for chromosome 13 versus 5.8% for chromosome 17). Categorization of the function and location of the identified proteins was performed with the assistance of Gene Ontology identification numbers (GO ID #s) downloaded from EBI at http://www.ebi.ac.uk. Cellular location (cellular component) and process (biological process) categorization of the detected proteins were based upon GO ID #s, but some limited changes were made in categorization in an attempt to better represent the data (see Materials and Methods). Cellular location of the detected proteins is graphically illustrated in Figure 5A. Also shown for comparison in parentheses is an approximate % distribution of proteins based upon all SwissProt human entries with a GO categorization (total of 7,288 entries). As can be seen, the majority of the detected proteins (25.7%) fall in the cytoplasm with an approximate 18% determined to be of unknown location. Both of these categories are overrepresented in comparison with the total protein percentage, including ribosomal and cytoskeleton proteins. This overrepresentation probably stems from both the increased solubility of these

research articles

Proteome Analysis of Human Mammary Epithelial Cells

categorized proteins (and hence, the greater ability to digest and detect their peptides) as well as their overall higher abundance in the cell (ribosome and cytoskeleton components). The discrepancy in unknown proteins is probably due to a bias in comparing just SwissProt entries to the large database that was used in searching for identified peptides which contained many more unknown references. A reported 20% of detected proteins were classified as unknown in the proteomic survey of human heart mitochondria27 similar to our determined 18%. Under-represented categories include membrane associated proteins, extracellular proteins, and nuclear proteins. During the global sample preparation, no effort was made to target any specific subset of proteins, so it is believed that this under-representation is also affected by solubility conditions (membrane proteins) and lower abundance (extracellular and nuclear proteins) of these proteins. Not all of the detected proteins were able to be categorized using GO ID #s; only 64% (biological process) and 62% (cellular component) of the detected proteins were initially assigned a GO ID # with the downloaded annotation files. Additional GO ID # categorizations of detected proteins were assigned by employing homology to known proteins and literature searches, which provided a 22-27% increase in categorization. In total, the results shown in Figure 5 represent 87% of the total number of detected proteins. The diversity in biological function of the detected proteins is observed in Figure 5B demonstrating the wide range of proteins identified. The largest categorization was Unknown (233 proteins), as previously discussed, but other categories were also well represented. The Protein Biosynthesis category (177 proteins) contained 80 ribosomal proteins and isoforms (46 large subunit [60S] and 28 small subunit [40S]) with 5 designated to the mitochondria. Also observed were sixteen different amino acid tRNA synthetase enzymes, 23 different subunits of eukaryotic translation initiation factors 2 through 6, five elongation factor subunits, and prefoldin subunits 1-4, 6. Although this category appears to be over-represented in comparison to the classification of total proteins, as suggested above, the coverage of highly abundant ribosomal proteins most likely contributed to this value. The Signal Transduction and Regulation category (135 proteins) included 26 kinase and kinase inhibitory proteins, 8 phosphatases, 8 different forms of annexin, six Ras-related proteins, and seven apoptosis-related proteins. Transcription and RNA Modification (125 proteins) category encompassed 10 different splicing factor subunits, and nine snRNPs (representing proteins A, B, and D-G), six transcription factors, RNA polymerase I, II, and III polypeptides, and six poly(A) and poly(rC) binding proteins. Transport (121 proteins) contained five membrane associated ion channels, four nuclear pore complex proteins, three mitochondrial import inner membrane translocase subunits, and eight various transporter proteins. Cellular Process (159 proteins) was a diverse categorization that included detected proteins such as cell adhesion proteins (cadherins, integrins, PAM-1, ICAM-3) and numerous other proteins of interest including epithelial cell marker protein 1, EGF receptor, EGF receptor substrate 15, mitogen-activated protein kinase 3, ferritin, density-regulated protein (DRP1), calgranulins, and catenins. All four of these categories appear to be underrepresented in comparison with the total protein categorization. Energy Pathway and Metabolism (117 proteins) detected proteins included 31 enzymes and proteins related to glycolysis

Table 2. Peptide Coverage of Specific Proteins Categorized by Molecular Weight

name

no. of unique symbol peptides

MW < 25 kDa S100 calcium-binding protein A2 S102 peptidyl-prolyl cis-trans CYPH isomerase A calmodulin CALM glutathione S-transferase P GTP 60S acidic ribosomal protein P2 RLA2 average MW 25-50kDa annexin II ANX2 epithelial cell marker protein 1 143S fructose-bisphosphate aldolase A ALFA glyceraldehyde 3-P dehydrogenase G3P2 triosephosphate isomerase TPIS average MW 50-90 kDa 78 kDa glucose-regulated protein GR78 heat shock cognate 71 kDa protein HS7C keratin, type II cytoskeletal 7 K2C7 heat shock protein HSP 90-alpha HS9A 60 kDa heat shock protein CH60 average MW >90 kDa elongation factor 2 EF2 myosin heavy chain, type A MYH9 protein AHNAK AHNK alpha-actinin 1 AAC1 endoplasmin ENPL average total average

MW

% coverage

22 16

10 986 17 881

69 85

18 16 14 17

16 706 23 224 11 665 16 092

75 60 70 72

48 21 25 32 17 29

38 472 27 774 39 289 35 922 26 538 33 599

75 60 76 70 72 71

29 35 28 37 32 32

72 333 70 898 51 286 84 542 61 054 68 023

50 56 57 41 66 54

35 48 73 34 23 43 30

95 207 226 531 312 487 102 974 92 468 165 933 70 911

43 31 37 53 38 40 59

and carbohydrate metabolism, 17 different enzymes and isoforms involved in the TCA cycle (including the pyruvate dehydrogenase complex), 17 different enzyme subunits in oxidative phosphorylation, 14 different subunits and isoforms specific for ATP synthase, and four enzymes of the pentose phosphate pathway. Cytoskeleton Organization and Biogenesis (117 proteins) was well represented and included a number of common structural and motility proteins along with numerous growth associated proteins. Proteins detected include actin, dynactin, cofilin, dynein, keratin, kinesin, laminin, myosin, tropomyosin, and tubulin. Protein catabolism (68 proteins) was dominated mainly by 32 proteasome isoforms and subunits, and 13 ubiquitin related proteins. The remaining categories (>120 proteins), though less in overall percentage, compare well with the anticipated percentage levels of total protein categorization. Peptide Coverage of Specific Proteins. The data presented here not only demonstrates the quantity of proteins detected, but the qualitative aspect of peptide coverage per single protein. Over 120 detected proteins had 10 or more unique peptide identifications, while more than 300 detected proteins had at least 5 or more unique peptide identifications (data not shown). Table 2 is a list of high coverage proteins taken from the subset of proteins with a minimum of 10 unique peptide identifications. This list is not inclusive, but does represent higher coverage proteins that have been divided into four molecular weight categories, with an average calculated for each division. Results show an average of greater than 70% coverage for < 50 kDa MW proteins with coverage decreasing for the larger MW categories to 54% and 40%. An overall average of 59% is observed across all listed proteins demonstrating the range of coverage from small to large molecular weights. Most of the above listed proteins are believed to be in “high abundance” per cell which would likely lend itself to a higher number of Journal of Proteome Research • Vol. 3, No. 1, 2004 73

research articles detected peptides and subsequent higher peptide coverage. For example, calmodulin has been estimated to constitute 0.11.0% of the total cellular protein.28 A supplemental table has been furnished that contains all of the unique peptides identified (5838) categorized by the identifying protein (see the Supporting Information). Xcorr values, ∆Cn values, and the number of multiple hits per peptide have been included, allowing these data to be searched by those interested in the specific coverage of any peptide or protein.

Jacobs et al. Table 3. Breakdown of Protein Identifications Removed Using Peptide LC Normalized Elution Time (NET) Criterion

protein IDs

g 2 identifying peptides 1 identifying peptide, observed g 2 times 1 identifying peptide, observed only once. total

initial results (Sequest criteria removed via % of protein only) NET correction IDs retained

884 166

6 13

99 92

650

107

83

1700

126

93

Discussion To date, there have been a number of multidimensional peptide identification studies done with various organisms.16,18-20 The work described here demonstrates the effectiveness of this approach toward the global proteome analysis of human cells. Similar work, recently done on the characterization of 615 proteins isolated from purified human heart mitochondria, represents an extensive proteomic identification of a single human organelle.27 In the present study, out of the 90 proteins that were categorized as located within the mitochondria, ∼85% were detected in the specific analysis of the human heart mitochondria. Such sub-organelle fractionation greatly enhances the coverage of organelle specific proteins, so it was expected that our global results would only represent a limited mitochondrial subproteome. Nevertheless, the globally analyzed proteins detected here overlap well with the previously reported results. We were also able to analyze the iodoacetamide alkylation of cysteine residues in the context of a large scale multidimensional experiment. Upon the basis of the results shown here, there was a greater than 8-fold increase in the detection of cysteine containing peptides with the use of iodoacetamide alkylation, raising the percentage of detected cysteine containing peptides from 1.7% to 14.8% respectively. In comparison, an in silico tryptic digest of the SwissProt human database entries (8584 proteins) yielded 313 977 fully tryptic peptides, with 70 703 (22.5%) containing cysteine. This theoretical 22.5% is reasonably comparable to the observed 14.8% cysteine containing iodoacetamide alkylation. Upon further analysis of the identified protein results, it was observed that 345 unique proteins were confidently identified from the 527 iodoacetamide labeled cysteine peptides. A large number of these cysteine containing proteins had previously been identified using other peptides in both the alkylated and nonalkylated sample (271 proteins) leaving 74 cysteine containing proteins, (∼4.7%) of the total protein identifications, only identified through alkylation. The percentage of proteins overlapping in both the alkylated and the nonalkylated samples was observed to be 46%; so overall, more identifications could be attributed to the duplication of the analysis than due to sample alkylation. On the basis of these results, we recommend the use of cysteine alkylation when performing large scale proteomic analysis; however, repeated “shotgun” analysis will likely generate more unique proteins than a single experiment employing cysteine alkylation alone. The protein identifications reported here were based solely upon unmodified peptide sequences and did not include searches for any post-translational modifications (e.g., phosphorylation or glycosylation). We assume that a significant number of peptides were not identified due to their modified state. This reservoir of modified peptides potentially entails a volume of information (additional protein identifications and active site identifications), and their future identification could 74

Journal of Proteome Research • Vol. 3, No. 1, 2004

result in a significant number of additional proteins being identified. This assumption was further strengthened by a limited manual search of the MS/MS data that showed numerous spectra correlating well with the parent MS peak, but were either not given a score or were scored too low by SEQUEST for further analysis. We anticipate utilizing this large library of MS/MS data in attempts to further search and characterize modification states present in the detected proteins. The strict use of SEQUEST screening criteria >0.1 ∆Cn eliminated numerous peptides (∼900) that would have substantially increased the number of detected proteins by an estimated 500 proteins (data not shown), but we maintained the use of the ∆Cn cutoff to minimize the number of misidentifications possibly introduced due to the large size of the human protein database. Use of the large ABCC nonredundant database proved to be challenging in overcoming its inherent protein reference overlaps as well as in categorizing and making use of the data after analysis (see Materials and Methods). We utilized the large human database, ∼76 000 entries, in an attempt to be as inclusive as possible of all sequences for peptide identification, but additional effort was required in data analysis to remove redundant identifications. The utilization of peptide LC NET values as an added criterion for screening MS/MS identified peptides, adds significantly to the confidence of the reported dataset. Table 3 demonstrates the added value of further filtering peptides after SEQUEST parameter screening by employing the NET parameter. We observed a higher percentage of loss among proteins identified using a single peptide compared to less than 1% loss of multiple-peptide identified proteins. A total of 1672 peptides were removed through the process of applying the NET filter which corresponded to a decrease of 126 protein identifications. The utilization of the NET criterion independently of the database MS/MS identification software greatly enhances confidence in proteins reported as identified. Implementation of size exclusion chromatography allowed the cellular lysate to be separated at the intact protein level prior to tryptic digestion, creating a subset of samples based upon protein size. No direct comparison is given in the present study between separations using SEC coupled with SCX versus SCX alone, but we assert that whole protein, pre-trypticdigestion separation is an effective method in detection of proteins, specifically in those biological systems in which separation of higher abundant proteins would be beneficial. A strong correlation is shown in Figure 3 between separation of proteins via molecular weight and downstream MS/MS identifications. These results effectively demonstrate the utility of such a technique for early protein separation to be coupled with later downstream identification methods. The application of the accurate mass and time (AMT) tag approach15,17 is being utilized to create a peptide “tag” database of the HMEC data reported here. We also anticipate that

research articles

Proteome Analysis of Human Mammary Epithelial Cells

information obtained from Fourier transform ion cyclotron resonance (FTICR) mass spectrometry analysis will assist in the quantitation of detected proteins. It is believed that the creation of the HMEC AMT tag database will be an essential tool in future proteomic studies involving detection of known proteins from this cell line and variants thereof. The further growth and development of such a database will offer a baseline for protein analysis and will assist in the identification and characterization of proteins of interest. Also observed was the under-representation of membrane associated and extracellular proteins in our set of detected proteins. It is important to better address this area using subcellular fractionated samples for future comparative work. To directly capture outer-surface exposed and membrane proteins, a novel method is being utilized involving in situ cell surface labeling to detect outer-membrane associated proteins in HMEC cultures.29 With the use of various growth conditions and/or treatment with chemical messengers coupled with the analysis described here, it is believed that specific membrane and extracellular proteins involved in differentiating cellular pathways and functions will be elucidated. Future studies will augment the existing HMEC proteome database and could assist in identifying novel cellular markers involved in numerous biological processes, as well as broaden the proteomic baseline for comparative studies involving malignant mammary epithelial cells.

Acknowledgment. This research was supported by the U.S. Department of Energy (DOE), Office of Biological and Environmental Research as well as the National Institutes of Health, through NCI (CA86340). The research was performed at the W.R. Wiley Environmental Molecular Sciences Laboratory (EMSL), a DOE national scientific user facility at the Pacific Northwest National Laboratory (PNNL) which is operated by Battelle Memorial Institute for the U.S. DOE under contract DE-AC06-76RLO 1830. Supporting Information Available: A table containing all of the unique peptides identified (5838) categorized by the identifying protein. Xcorr values, ∆Cn values, and the number of multiple hits per peptide have been included. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) (2) (3) (4) (5) (6) (7) (8)

Stover, C. K., et al. Nature 2000, 406, 959-964. Heidelberg, J. F., et al. Nature Biotechnol. 2002, 20, 1118-1123. Goff, S. A., et al. Science 2002, 296, 92-100. Holt, R. A., et al. Science 2002, 298, 129-149. White, O., et al. Science 1999, 286, 1571-1577. Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. Hubbard, M. J. Proteomics 2002, 2, 1069-1078. Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd Nat. Biotechnol. 2001, 19, 242-247.

(9) Wolters, D. A.; Wahsburn, M. P.; Yates, J. R., 3rd Anal. Chem. 2001, 73, 5683-5690. (10) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd. Nat. Biotechnol. 1999, 17, 676-682. (11) Zhou, H.; Watts, J. D.; Aebersold, R. Nat. Biotechnol. 2001, 19, 375-378. (12) Gygi, S. P.; Corthals, G. L.; Zhang, Y.; Rochon, Y.; Aebersold, R. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 9390-9395. (13) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994-999. (14) Smith, R. D.; Pasa-Tolic, L.; Lipton, M. S.; Jensen, P. K.; Anderson, G. A.; Shen, Y.; Conrads, T. P.; Udseth, H. R.; Harkewicz, R.; Belov, M. E.; Masselon, C.; Veenstra, T. D. Electrophoresis 2001, 22, 1652-1668. (15) Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Pasa-Tolic, L.; Shen, Y.; Conrads, T. P.; Veenstra, T. D.; Udseth, H. R. Proteomics 2002, 2, 513-523. (16) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (17) Lipton, M. S.; Pasa-Tolic, L.; Anderson, G. A.; Anderson, D. J.; Auberry, D. L.; Battista, J. R.; Daly, M. J.; Fredrickson, J.; Hixson, K. K.; Kostandarithes, H.; Masselon, C.; Markillie, L. M.; Moore, R. J.; Romine, M. F.; Shen, Y.; Stritmatter, E.; Tolic, N.; Udseth, H. R.; Venkateswaran, A.; Wong, K. K.; Zhao, R.; Smith, R. D. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 11 049-11 054. (18) Koller, A.; Washburn, M. P.; Lange, B. M.; Andon, N. L.; Deciu, C.; Haynes, P. A.; Hays, L.; Schieltz, D.; Ulaszek, R.; Wei, J.; Wolters, D.; Yates, J. R., 3rd. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 11 969-11 974. (19) Florens, L.; Washburn, M. P.; Raine, J. D.; Anthony, R. M.; Grainger, M.; Haynes, J. D.; Moch, J. K.; Muster, N.; Sacci, J. B.; Tabb, D. L.; Witney, A. A.; Wolters, D.; Wu, Y.; Gardner, M. J.; Holder, A. A.; Sinden, R. E.; Yates, J. R.; Carucci, D. J. Nature 2002, 419, 520-526. (20) Mawuenyega, K. G.; Kaji, H.; Yamuchi, Y.; Shinkawa, T.; Saito, H.; Taoka, M.; Takahashi, N.; Isobe, T. J. Proteome Res. 2003, 2, 23-35. (21) Kislinger, T.; Rahman, K.; Radulovic, D.; Cox, B.; Rossant, J.; Emili, A. Mol. Cell. Proteomics 2003, 2, 96-106. (22) Venter, J. C., et al. Science 2001, 291, 1301-1351. (23) Yaswen, P.; Stampfer, M. R. Int. J. Biochem. Cell Biol. 2002, 34, 1382-1394. (24) Stampfer, M. R.; Bartley, J. C. Proc. Natl. Acad. Sci. U.S.A. 1985, 82, 2394-2398. (25) Band, V.; Sager, R. Proc. Natl. Acad. Sci. U.S.A. 1989, 86, 12491253. (26) Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; PasaTolic, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E. F.; Shen, Y.; Zhao, R.; Smith, R. D. Anal. Chem. 2003, 75, 1039-1048. (27) Taylor, S. W.; Fahy, E.; Zhang, B.; Glenn, G. M.; Warnock, D. E.; Wiley, S.; Murphy, A. N.; Gaucher, S. P.; Capaldi, R. A.; Gibson, B. W.; Ghosh, S. S. Nature Biotechnol. 2003, 21, 281-286. (28) Yin, D.; Kuczera, K.; Squier, T. C. Chem. Res. Toxicol. 2000, 13, 103-110. (29) Chen, W. U.; Li-Rong, Y.; Strittmatter, E. F.; Thrall, B. D.; Camp, D. G., II.; Smith, R. D. Proteomics 2003, 3, 1647-1651. (30) Shen, Y.; Zhao, R.; Belov, M. E.; Conrads, T. P.; Anderson, G. A.; Tang, K.; Pasa-Tolic, L.; Veenstra, T. D.; Lipton, M. S.; Udseth, H. R.; Smith, R. D. Anal. Chem. 2001, 73, 1766-1775. (31) Eng, J. K.; McCormack, A. L.; Yates, J. R., 3rd J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.

PR034062A

Journal of Proteome Research • Vol. 3, No. 1, 2004 75