Characterization of Human Skeletal Muscle Biopsy ... - ACS Publications

Apr 21, 2009 - We characterized the human muscle proteome by studying muscle biopsy specimens through four different workflows, using 1 or 2D peptide ...
0 downloads 0 Views 1MB Size
Characterization of Human Skeletal Muscle Biopsy Samples Using Shotgun Proteomics† Kenneth C. Parker,*,‡,§,|,⊥ Ronan J. Walsh,§,| Mohammad Salajegheh,§,| Anthony A. Amato,| Bryan Krastins,‡,# David A. Sarracino,‡,# and Steven A. Greenberg§,| Harvard Partners Center for Genetics and Genomics, Cambridge, MA, Department of Neurology, Brigham and Women’s Hospital, Harvard Medical School, and Informatics Program, Children’s Hospital Boston, Harvard Medical School, Boston, MA Received October 17, 2008

We characterized the human muscle proteome by studying muscle biopsy specimens through four different workflows, using 1 or 2D peptide separation, SDS gels, or differential solubilization. By performing MS/MS analyses of 178 4-h LC separations derived from 31 patients, we identified more than 2000 proteins, and determined how 370 very abundant proteins behave upon differential solubilization. The resulting semiquantitative database should serve as a resource for muscle biochemistry. Keywords: muscle • biopsy • shotgun • abundance • protein correlation profiling • sarcomere • workflow comparison

Introduction Shotgun proteomics technology has been developed as a means of determining relative abundances of all of the proteins among a set of samples.1 Unlike 2-dimensional electrophoresis technology (2D gels), nearly all of the proteins in a sample are amenable to analysis regardless of their size or solubility. For the study of human muscle biopsy samples, this is an important consideration because the different forms of some sarcomeric protein like myosin are so abundant compared to many interesting proteins that they significantly distort 2D gels, and other sarcomeric proteins like titin are so large that they cannot be readily separated using the same electrophoretic conditions that are appropriate for most other proteins.2 Moreover, some isoforms of proteins like collagen remain insoluble in the presence of chaotropes or denaturing detergents.3 In the simplest form of shotgun proteomics, the sample is digested using specific proteases (usually trypsin), and the peptides are then separated and analyzed by mass spectrometry. The major disadvantage of this approach is that nearly all protein isoformspecific information is lost. In some cases, peptides are identified that can equally well be assigned to proteins that are produced by multiple, distinct genes (see ref 4 for a discussion of the implications of this problem). Recently, it has been shown that shotgun proteomics is not only useful for protein identification, but can also be used to * To whom correspondence should be addressed. E-mail: kenneth.aprker@ virgininstruments.com. † Originally submitted and accepted as part of the “Tissue Proteomics and Metabolomics” special section, published in the April 2009 issue of J. Proteome Res. (Vol. 8, No. 4). ‡ Harvard Partners Center for Genetics and Genomics. § Brigham and Women’s Hospital, Harvard Medical School. | Children’s Hospital Boston, Harvard Medical School. ⊥ Current affiliation: Virgin Instruments, Sudbury, MA. # Current affiliation: Thermo-Fisher, Cambridge, MA. 10.1021/pr800873q CCC: $40.75

 2009 American Chemical Society

estimate relative protein abundance.5 This is of particular interest in muscle proteomics, because much of muscle tissue consists of repeating units of an exceeding large protein complex, the sarcomere (see refs 6 and 7 for recent reviews). One of the common features of MS/MS mass spectrometry using electrospray is the problem of random sampling. In any one separation, due to the high complexity of biological samples, the mass spectrometer has time to select only a small subset of the peptides that have sufficient abundance for detection for fragmentation.8,9 By performing repeated analyses of muscle biopsies derived from 31 different individuals with several different muscle diseases, most abundant precursors have been sampled. By this means, we found that in one disease, inclusion body myositis (IBM), there was a loss of muscle protein isoforms predominant in fast-twitch muscle fibers with a corresponding gain of slow-twitch muscle fiber proteins that had not previously been recognized.10 In this paper, we compare the data obtained from 1-dimensional peptide separations to the much greater amount of data obtained using 3 more extensive workflows on 2-5 samples. The first workflow, which was used to differentiate IBM from other inflammatory muscle diseases, consists of direct tryptic digests of whole muscle biopsy tissue lysates followed by LC-MS/MS. The second workflow started with a similar tryptic digest preparation, but peptides were then separated by cation exchange chromatography prior to LC-MS/MS. In the third workflow, 18 gel slices were analyzed from the soluble compartment only. This workflow is rather similar to that used recently to study the muscle proteome by the use of 1D SDS gels only.11 In the final workflow, 3 gel slices were analyzed from both the nonionic detergent soluble compartment and the corresponding insoluble compartment. We have used shotgun proteomics technology to determine which proteins are the most abundant in muscle biopsies based on all of these Journal of Proteome Research 2009, 8, 3265–3277 3265 Published on Web 04/21/2009

research articles techniques. Additionally, the use of shotgun proteomics together with limited fractionation on four different samples enables us to use protein correlation profiling12 to propose a list of candidates for the major structural components of the sarcomere.

Methods and Materials Patients. Muscle biopsies were performed for clinical diagnostic purposes on patients with a range of muscle diseases according to standard clinical criteria. All patients with inflammatory myopathies additionally met research criteria as previously described.13 Patients provided informed consent and institutional review board approval was obtained for research studies. The following experiments were carried out over a period of ∼9 months. Table S1 lists the patients and their diagnoses (see Supporting Information). Workflow A: Direct Sample Lysis. Nine 20 µm slices of frozen muscle tissue from muscle biopsy samples were collected using a cryostat in a microcentrifuge tube at -20 °C, resulting in a yield of 300 µg of biopsy wet weight. Fourhundred microliters of buffer containing 1% SDS, 10 mM DTT and 50 mM triethanolamine bicarbonate pH 8.0 (TEAB) was added to the muscle tissue and boiled at 100 °C for 20 min. Protein content was assayed by removing 1 µL for analysis with the Pierce bicinchoninic acid (BCA) protein assay kit. Muscle lysate was reduced with an additional 10 mM DTT, and then alkylated with 50 mM iodoacetamide. Following acetone precipitation, the pellet was redissolved in 20 mM TEAB. Porcine trypsin (Promega) was added (1:50 by weight) and the digest was incubated overnight at 37 °C. This procedure was followed over a period of several months on biopsies from 27 different individuals. Workflow B: 2-Dimensional Peptide Separation. Muscle biopsies from three individuals were homogenized using a scalpel and macerated using glass beads in a microcentrifuge pestle. Trypsin digests (500 µg) were then reduced, alkylated and digested as for Workflow A. Samples were separated into 12 fractions with an AKTA HPLC system with a cation exchange column equilibrated in 5% isopropanol. Peptides were fractionated by a salt gradient from 0 to 50 mM NaCl in 0.1% TFA. The samples were briefly dissolved in 0.1% TFA containing 5% urea to promote solubilization. The OD 280 of each fraction was used to estimate the appropriate peptide concentration for subsequent electrospray analysis. Workflows C and D: Differential Solubilization and SDS-PAGE. About 1 mg of muscle tissue from 6 individuals was collected as for Workflow B, except the tissue was resuspended in 20 mM Tris pH 8.0 containing 0.1% Triton X-100. The samples were centrifuged for 10 min at 14 000 rpm in a microcentrifuge and the supernatant was removed to a separate tube. SDS (1%) 20 mM Tris pH 8.0 was then added to both the supernatant and the pellet fractions, followed by reduction with DTT and alkylation with iodoacetic acid. SDS Gel Electrophoresis. For Workflow C, about 500 µg of protein was loaded onto 9 lanes of a 10% Laemmli SDS gel (Invitrogen). Alternatively, for Workflow D, about 50 µg of supernatant and also 50 µg of the corresponding pellet were loaded onto an SDS gel. The gels were stained with simply Coomassie brilliant blue (Invitrogen). In Workflow C each sample was split into 18 slices that spanned the entire separating gel, and the corresponding pellet was not analyzed. In Workflow D, the myosin and actin migration positions were used to direct the slicing of the gel. From both the supernatant 3266

Journal of Proteome Research • Vol. 8, No. 7, 2009

Parker et al. and pellet factions, three slices were analyzed: the region above myosin, between actin and myosin, and below actin. Standard gel extraction and porcine trypsin (Promega) digestion was performed on each slice.14 Sample Cleanup. For Workflows A and B, sample cleanup was performed with VYDAC Solid phase Extraction (SPE) C18 13 µm columns. Samples were acidified after digestion with 0.5% trifluoroacetic acid (TFA). Cartridges were conditioned with 2 mL of 100% acetonitrile (ACN), followed by 3 mL of 0.25% TFA (v/v). Samples were loaded at 2 mL/min followed by rinsing with 5 mL of 0.25% TFA. Samples were eluted in 200 µL of 75% ACN, 0.1% formic acid, and dried. Mass Spectrometry. Samples (∼1-20 µg) were resuspended in 150 uL of 5% ACN, 0.25% formic acid, and 10 µL was injected using a Famos Autosampler onto a 75 um × 25 cm fused silica capillary column packed with C18 media (Michrom, Magic C18 beads). Samples were separated at ∼200 µL/min using a 100:1 split with a 190 min gradient from 5% ACN/0.35% formic acid to 30% ACN/0.35% formic acid., and data were collected using a Finnegan LTQ-FT. Details regarding the mass spectrometry acquisition parameters varied across the projects. Pertinent settings are listed in Table S2 (Supporting Information). Typically, 10 000 spectra of sufficient quality for database searching were collected. Database Searching. A Visual C+ program (Jethro written by Thomas Patterson) was used to generate Mascot mgf files containing header information about each MS/MS spectrum, and the peak list data. Using VBA scripts, the header information was loaded into the Spectrum table in Access, linked to the scan number and mgf file name. Thus the spectrum table has an entry for each MS/MS spectrum whether or not it was identified. Originally the spectra were searched with Sequest and Mascot, and some results from these searches were used to filter out known false identifications, as described below. To estimate the false positive rate, the peak lists for each raw file were separately searched using the Tandem search engine15 with a 20 ppm parent mass tolerance (50 ppm for project D1),and a 1.1 amu fragment mass tolerance. Cysteine alkylation was a fixed modification, and modifications of oxidized methionine, n-terminal pyroglutamic acid or cyclocarbamidomethyl cysteine, and n-terminal protein acetylation were considered. Semitryptic peptides were also considered during the refinement stage of this search process, which matches additional peptides to the proteins that had been identified with confidence. The primary database was a human IPI database (release v3.18) with added common contaminant proteins like trypsin. This database also contained reversed sequences for each entry so that the number of false positives could be assessed.16 A consolidated peptide table was generated that contained the best sequence identified from each spectrum from all 178 HPLC runs. The table was filtered so as to contain only sequences that were supported by a spectrum that had a maximal Tandem expectation value of -1, a minimal MatSc17 of 5000 and a minimal peptide mass of 800 amu. For this purpose, doubly and triply charged fragment ions were sought for peptide fragments containing at least 10 or 15 aa, or if there was an internal lysine or arginine residue. The ion series considered were the following: y, b, a, y-17 and b-17, although -18 neutral losses were found to be common. At this point there were many false-positive identifications, evident because they were associated with reversed sequence accession numbers. All isoleucine residues were converted to leucine residues to disallow meaningless duplications. The list of peptides was

research articles

Characterization of Human Skeletal Muscle Biopsy Samples then remapped (using string matching in Visual Basic) against an all leucine form of the IPI database linked to Entrez gene ID numbers and gene symbols to obtain a table that contained each peptide linked to as many distinct database entries as possible. This list was grouped by the combination of sequence and Entrez gene ID, and then each sequence was assigned to the gene to which the largest number of distinct peptides could be mapped, in a fashion similar to Protein Prophet.4 This was accomplished by means of a series of SQL queries, yielding 3620 proteins, including 447 reversed database hits. Finally, the gene list was filtered to include only genes supported by more than 1 peptide. This final constraint had the effect of reducing the protein list to 2134 proteins, including 14 reversed database hits. The orginal database searching had been carried out on selected files using Sequest or Mascot on different input FASTA files, for example, human SwissProt, and various NCBI databases, using a wide variety of different criteria, including wider parent tolerance. A special database consisting of collagen gene sequences was also searched for hydroxyproline-containing peptides. The Spectrum table has a field that designates which peptide sequence is the best match. To resolve discrepant identifications made by these multiple rounds of searches, the fit of each MS/MS spectrum to the proposed sequence was assessed by calculating MatSc for each proposed sequence.17 In most cases, MatSc selected what appeared to be the correct identification when there were several proposed sequences from different searches, based on similarity to higher quality unambiguous identifications that had similar precursor masses and retention times. For the tabulations in this manuscript, no additional peptide or protein identifications were allowed. The results from these additional searches were used solely to eliminate proteins that would have otherwise passed the 2 peptide requirement. By this process, some matches to either forward or reversed database sequences were reassigned to peptides from compelling proteins. No effort was made to eliminate reversed database matches by this means, although examination of reversed database hits indicated that they often could be readily assigned to modified or unmodified peptides from abundant proteins. Prior to collecting the statistics on the number of identifications, peptides, and proteins, all spectra that mapped to certain proteins were purged from the database. These proteins included porcine trypsin, all human keratins and hornerin, which probably derived from human skin-related contamination especially during SDS gel slice processing, and occasional nonhuman contaminant proteins. This filtered protein list consisted of 2095 proteins, including 8 reversed database hits. The spectrum database presented here consists only of spectra that map to proteins that pass the twopeptide rule. Thus, all alternatively spliced forms (and polymorphic variants when they could be identified as such) together constitute one protein. The original ipi accession numbers were retained for each sequence, but were not used for protein counting purposes. In some instances (for example, tropomyosins TPM1-TPM4), there is ample evidence for multiple splice variants. Thus two distinct stages of peptide-to-protein mapping were performed: peptide sequence to protein sequence, and protein sequence to Entrez gene ID/gene symbol. When peptides could be mapped to multiple genes, the number of genes to which they could be mapped is listed by the parameter “#genes” in the peptide table. Throughout the discussions that follow, because peptides and proteins were detected, we refer

Figure 1. Graphical description of Workflows A, B, C and D. Workflow A results in a single HPLC run per sample. In contrast, Workflow B results in 12 runs, Workflow C results in 18 runs, and Workflow D results in 6 runs. In Workflow D, the top, middle, and bottom SDS gel slices were selected based on the position of myosin and actin. The myosin and actin bands themselves were not analyzed from the pellet.

to the counting statistics in terms of proteins, but all protein isoforms have been grouped and counted based on the minimal number of genes that encode them. K Means Clustering. There were 370 proteins identified from project D2 based on at least 10 spectra overall. The number of spectra per protein identified from each individual in either the soluble for insoluble fraction from both project D1 and D2 was then tabulated. The percentage of spectra identified from each of the 8 compartments (soluble and insoluble for biopsies from four individuals) was calculated. The percentage data was analyzed using K Means clustering in Spotfire allowing for 6 clusters, labeled K1-K6, thereby grouping proteins together that had the same distribution across patients and with regard to solubility.

Results Overall Data Organization. Four different workflows (A-D) were tested to investigate the human muscle proteome (Figure 1). We studied 31 muscle biopsy samples from patients with a range of diseases and from patients that did not have a muscle disorder by clinical and pathological criteria, resulting in 178 4-h HPLC runs. These four workflows were used in 6 different sets of experiments (projects A1, A2, B, C, D1 and D2), generating 1 861 361 spectra of suitable quality for database searching, resulting in 321 669 identifications, 26 857 distinct peptides and 2095 distinct proteins, 8 of which derived from the reversed sequence database. Therefore, we estimate that about 2070 of these correspond to true positive identifications. Most of the specific data that was gathered from these experiments is documented in the Supplementary Tables. These data are much more useful as Excel files that can be filtered and sorted, and these tables are available upon request. Not all of the information in these tables is mentioned in the discussion, but the description for each field is listed in the table legends. The data for the positively identified spectra are housed in Table S3 (peptides), Table S4 (proteins) and Table S5 (HPLC runs) (see Supporting Information). In addition, a MS-Access database named Parker-Muscle_Proteomics.mdb houses these tables, as well as the spectra table. These data are available at https://proteomecommons.org/tranche/ by searching for muscle proteomics. Journal of Proteome Research • Vol. 8, No. 7, 2009 3267

research articles

Parker et al.

Figure 2. Venn diagram of proteins identified as a function of Workflow: 2095 proteins (8 reversed), and 26 857 (16 reversed) peptides were identified across all experiments. Six-hundred seventy-nine proteins and 2867 peptides were identified in each workflow. WfA indicates Workflow A. AD3 indicates that 3 proteins were identified in both Workflow A and Workflow D, but not in B or C. The second number indicates the number of reversed database hits. If there is no second number, then no reversed hits were found.

Project A1. Workflow A consists of the simplest possible approachsa 1D peptide separation starting from a tryptic digest of a muscle biopsy homogenate. We studied the reproducibility of the workflow by preparing samples from 4 adjacent sets of slices from a single biopsy. Each set of slices was digested separately, then each preparation was analyzed by MS/MS at 3 different concentrations, corresponding to 5, 10, or 20 µg of protein. From all 12 runs, 485 distinct proteins were identified based on 4584 distinct peptides (Figure 2). Table S6 (Supporting Information) contains columns that list the average number of identifications and the standard deviation and standard error in the spectral counts for these proteins. On average, 276 proteins were detected in each of 12 runs. The 4-fold concentration difference had a minor impact only on protein identification, but the higher sample loading did adversely impact chromatographic separation (data not shown). For the top 45 proteins, 10 or more spectra were identified, and the average standard error (100 × standard deviation/average) for these 45 was 19.9%. Four or more peptides were identified for the next 46 most abundant proteins, and the standard error increased to 34.5%. Thus we conclude that for the most abundant proteins reproducibility is acceptable. Project A2. As the 1-dimensional approach appeared to be robust, we tested a panel of biopsies from 27 individuals. Differences were found in many protein levels, some of which correlated with disease. Certain proteins were identified based on more than 3 peptides from one or a few individuals, but were undetectable in most individuals. The details and implications of these data are presented elsewhere.10 In this paper, the primary emphasis is on how the peptide results compare to the other workflows. Some of the samples in these experiments were injected 3 -5 times, increasing the number of proteins identified using Workflow A, resulting in 47 HPLC runs altogether. Project B. This project was a standard 2D peptide separation experiment (cation exchange and reversed phase), which is similar in concept to MudPit.18 Thus the peptides could be derived from any of the proteins in the biopsy, but more proteins (on average 1226) are identified than by the 1D peptide separation approach (on average 276). Samples deriving from 3 different individuals, 2 of which had inflammatory myopathies, were measured by this technique, accounting for 36 LC-MS runs. Table S7 (Supporting Information) lists the 3268

Journal of Proteome Research • Vol. 8, No. 7, 2009

number of spectra, identifications, distinct peptides, and proteins for each iex fraction. Also shown are the results from combining these results by individual and across all three individuals. From all 3 individuals together, 363 838 spectra were searched with 43 377 of suitable quality for identification of 11 655 peptides and 1378 proteins, of which 455 were defined by a single peptide within Project B. All protein identifications are supported by one additional peptide from some other Workflow. In IEX fractions 2-9, which contained the bulk of the peptides, between 621 and 1474 peptides were identified per fraction (Table S7, Supporting Information). Typically, there was about 53-63% agreement at the peptide level, defined as the percentage of peptides identified in one fraction compared to the union of identifications from all 3 individuals for the same fraction. There was a consistently higher level of agreement (58-73%) at the protein level for each fraction. Moreover, there was slightly more peptide agreement (57-67%) and substantially more protein agreement (69-84%) when the identifications from all 12 fractions were combined, because some peptides or proteins that were identified in one IEX fraction were found in a different (usually adjacent) IEX fraction from a different individual. Table S7 (Supporting Information) also reports what percentage of the peptides identified in each fraction were found in all three individuals (the intersection). Project C. This project was deigned to maximize protein identification from the soluble compartment, using SDS gel electrophoresis to separate proteins prior to digestion (Workflow C). Muscle slices deriving from 2 individuals were solubilized in nonionic detergent-containing solution and centrifuged, leaving behind in an unstudied pellet fraction containing many of the sarcomeric proteins. Table S8 (Supporting Information) lists the protein and peptide composition obtained from each slice, from each individual, as well as from both slices of the same molecular weight range, numbered starting from the top of the gel. Between 98 and 434 proteins were identified per slice per individual. The slice with the smallest number of identifications had large amounts of hemoglobin, which is the most abundant protein in the soluble compartment in these muscle biopsy samples. As in Workflow B, some of the shared peptides and proteins were not identified from the same slice from both individuals, probably because the slices were not cut in exactly the same place. Thus, protein agreement between corresponding slices

research articles

Characterization of Human Skeletal Muscle Biopsy Samples Table 1. Identification Statistics for Project D2 region

# samples

type

average # proteins

average # peptides

high high intermediate intermediate low low sum union

4 4 4 4 4 4 4 4

insoluble soluble insoluble soluble insoluble soluble all 6 all 6

54 182 170 256 126 261 1049 677

1409 1088 614 793 769 1120 5793 4473

Table 2. Reproducibility of Protein and Peptide IDs Across Projects project

Protein a1 a2 b c d1 d2 union of all 6 intersection % in union Peptide a1 a2 b c d1 d2 union of all 6 intersection % in union

a1

a2

b

c

d1

d2

485

441 902

468 804 1378

467 842 1288 1982

418 618 738 786 820

437 694 861 970 738 1024

3498 7036

3330 4278 11655

3225 4594 5574 17547

2551 3189 3501 4356 5901

2892 4032 4333 5677 4358 8137

2095 375 17.9 4584

26857 1513 5.6

averaged 77% (Table S8, Supporting Information), while peptide agreement averaged 68%. As in Project B, the overall protein and peptide composition was more similar, with ∼93% agreement at the protein level, and ∼77% agreement at the peptide level. On average, 56% of the proteins identified in one slice (intersection %) were identified from both individuals. Table S9 (Supporting Information) lists how many spectra from each protein were identified as a function of slice, sorted with the longest protein at the top. It is apparent that most proteins were identified most strongly at an appropriate SDS gel position. Very abundant proteins like hemoglobin (HBA1, HBB) were most prominent at the bottom of the gel where they belonged, but were also readily detectable in every slice. Project D. The final Workflow we tested was intended to determine whether a smaller number of SDS gel slices coupled with differential solubilization would result in more protein identifications at lower expense. A specific goal was to identify proteins other than actin and myosin; therefore the SDS gel slices containing these proteins was omitted, and the extracts from the three regions demarcated by these proteins were pooled. The experiment was performed in duplicate and for the second analysis a larger quantity of the digest was injected onto the HPLC. The larger sample load resulted in an increase in protein identifications and because of this Project D2 was examined in more detail as described below. Both the results from both Project D1 and Project D2 were analyzed so that the issue of reproducibility of identification could be addressed, resulting in 820 and 1024 protein identifications, respectively.

average #ids

total # spectra

1953 1233 808 1065 1024 1427 7510

16512 10541 9608 11084 7120 7971 62836

In general, there were fewer proteins identified in the insoluble compartment, especially in the region above myosin, where there were on average 54 insoluble proteins identified vs 182 soluble ones (Table 1). Overall, there were on average 677 proteins identified from each of the four individuals, whereas 1049 proteins would have been expected by summing the number of identifications from the 6 compartments, indicating that many proteins were detected in multiple compartments. Another way to analyze the data is to pool the peptide and protein identifications from these gel slices before scrutinizing the data. 370 proteins from which at least 10 identifications had been made are listed in Table S10 (Supporting Information). The table has been sorted by name so that related proteins could more easily be compared. Unlike all of the other tables in this paper (including Table 1), the data in Table S10 (Supporting Information) derive from the original Sequest and Mascot searches, and therefore the requirements for identification are slightly different than in the other tables, and there are minor differences in mapping between peptides and proteins. It was apparent from studying Table S10 (Supporting Information) that some proteins were distributed in different ways between the eight channels derived from the four individuals and the soluble vs insoluble compartments. In order to determine which proteins had similar distribution profiles, the percentage of the total identifications in each of the eight compartments was calculated, and K means clustering was performed to group the proteins into 6 nonoverlapping categories. The results from this classification process are discussed below.

Discussion Using the 4 Workflows described, the most abundant proteins from muscle biopsies have been identified. Table 2, and Table S11 (Supporting Information) with additional details, list how many proteins and how many peptides were identified as function of project (in yellow), and also how many of each were shared. In these tables, the identifications from all of the runs within the project were pooled together. The smallest number of proteins was identified from Project A1. Theoretically, there might have been evidence of differential protein expression between adjacent tissue slices due to changes in cellular composition, but no differences of this sort were apparent. It is possible that such differences would become evident if opposite sides of the same biopsy were examined (separated by mm rather than by tens of micrometers), or different biopsies from the same individual. In contrast, when the samples derived from different individuals with different levels of disease, differences were readily apparent, and some of them correlated with disease status.10 The number of proteins identified as a function of HPLC run is plotted in Figure 3A, starting from the workflow in which the fewest Journal of Proteome Research • Vol. 8, No. 7, 2009 3269

research articles

Figure 3. (A) Number of proteins identified as a function of cumulative HPLC runs is shown. The runs were ordered starting from the run in which the most proteins were identified to the fewest proteins within each project. The project order was C, B, D2, D1, A2, A1. (B) Same, except runs were ordered starting from the fewest proteins to the most proteins. The project order was A1, A2, D1, D2, B, C.

number of proteins were identified (Workflow A1) to the workflow in which the greatest number were identified (Workflow C). In Figure 3B, the number of proteins identified as a function of HPLC run is plotted in the opposite order to Figure 3A. The trend line suggests that the number of additional proteins detected in Project A1 is just beginning to increase less rapidly after 12 runs. All of the proteins identified in Project A1 were also detected in at least one of the other experiments, which involved more extensive fractionation. Moreover, at least 86% of the proteins identified in Project A1 were identified in each of the other five projects (see Figure 2A). Using the same 1D peptide separation approach based on samples derived from multiple patients, accounting for 47 LC-MS/MS runs, 902 distinct proteins were identified. This corresponds to 43% of the proteins across all experiments. As with Project A1, most of these proteins were also identified in Project B (89%) or in Project C (93%). However, about 1/3 (476 of 1378) of the proteins identified by 2D peptide separations, and about 1/2 (1084 of 2029) of the proteins identified by SDS gel separation of the supernatant appear to be beyond the 3270

Journal of Proteome Research • Vol. 8, No. 7, 2009

Parker et al. range of sampling in one-dimensional analyses, even after 50 injections. One of the reasons that as many as 902 proteins were identified from Project A2 alone is that the samples derive from individuals with different inflammatory myopathies. Because each muscle biopsy is heterogeneous with regard to tissue composition, and especially so when there is inflammation, a larger number of proteins have probably been identified than would have resulted from 47 technical replicates derived from the same healthy muscle biopsy preparation. In both Project B and C, more of the same proteins were identified from each individual tested when all of the slices or fractions were combined than when individual fractions or slices were compared. Moreover, the conservation was greater at the protein level than at the peptide level. We found it difficult to quantify using ion current methods, because many the peaks for many precursors that were identified easily by MS/MS were lost in the noise, as has been found by many others.5 Substantially more proteins were identified by the SDS gel approach (Workflow C) from two individuals (1982 proteins) than by the 2D peptide separation approach (Workflow B) from 3 individuals (1378 proteins). One reason for this is that there is a higher degree of fractionation (18 vs 12). Second, many abundant, large sarcomeric proteins were depleted by removal of the nonionic detergent insoluble pellet. Third, each soluble protein localizes predominantly to a few bands and therefore should not interfere with protein identifications in distant bands. In spite of this expectation, some abundant proteins like hemoglobin were easily identifiable in every slice, though hemoglobin was much more abundant at the bottom of the gel where it should migrate under these conditions. As in Workflow B, there were tantalizing differences between the two individuals regarding protein expression levels, but we cannot tell whether these differences reflect differences that would be reproducible in disease, or simply protein preparation-specific differences that may be based on the cellular composition of the biopsy. While we were performing these experiments, a publication reported advances in the number of identifications that have been obtained from human muscle biopsies.11 The authors report identification of 954 proteins from three healthy patients starting from 20-24 slices of an SDS gel using an HPLC fractionation and MS/MS detection strategy very similar to what we describe here. Although they did not count proteins according to the number of genes that encode them as we have done, we calculate from examination of the peptide sequences that they reported that they would have counted about 920 proteins. Of these, all but 44 of them correspond to the proteins that we identified, while we found 5655 of the 6925 peptides they found. Some of the proteins they identified that we did not find are from related gene families to other proteins both of us identified. See Table S4 (Supporting Information) for comparisons of distinct peptide sequences from their paper corresponding to proteins we identified. In conclusion, the proteins we identified are consistent with their findings.

Workflow Efficiency The direct digest 1D peptide separation approach (Workflow A) in most cases led to the identification of about 275-300 proteins, vs 1817 for Workflow C (Table 3), but at 1/18th the mass spectrometer analysis time, and with fewer chances for unintended irregularities of differential solubilization. Project B resulted in about 1000 protein identifications. The attempt

research articles

Characterization of Human Skeletal Muscle Biopsy Samples Table 3. Statistics of Identifications per Run per Project project

# patients

runs per sample

runs per project

proteins per sample

proteins per run

peptides per sample

peptides per run

IDs per sample

IDs per run

A1 A2 B C D1 D2

1 23 3 2 4 4

1 1 12 18 6 6

12 47 36 36 23 24

276 308 1022 1817 496 677

276.0 308.0 85.2 100.9 82.7 112.8

1780 1582 7280 13509 2854 4473

1780.0 1582.0 606.7 750.5 475.7 745.5

4682 3804 14459 55855 4537 7510

4682.0 3894.0 1204.9 3103.1 756.2 1251.7

to maximize identifications while reducing analysis time in Project D was only partially successful; in Project D2, 677 proteins on average were identified, at six times the expense. The average number of proteins divided by the number of HPLC separation runs was remarkably constant at 82-112 proteins for projects B-D2. These results indicate that the limiting factor in protein identification is in fact the mass spectrometer analysis time, and that achieving a front-end protein separation strategy that gets beyond this roadblock is difficult to achieve. Thus, there seems to be diminishing returns in efficiency with an increased amount of separation to plumb deeper into the proteome, as others have found.19

Clustering Proteins by Solubility and Between Individuals Although Workflow D was relatively unsuccessful in delving deeper into the proteome, it was apparent that many of the most abundant proteins were reproducibly detected solely or primarily in the soluble compartment. Other proteins were detected primarily in the insoluble compartment, and still others were distributed nearly evenly between both compartments. Table S10 (Supporting Information) lists those proteins for which there were at least 10 identifications. Muscle samples used in this workflow were derived from two individuals who had inclusion body myositis (IBM), and from two individuals without an inflammatory myopathy. The observation that individuals with IBM have depleted amounts of fast-twitch muscle proteins like MYH217 is evident in the samples from Workflow A2, Workflow B (data not shown), and Workflow D (Table S10, Supporting Information). We sought to determine what other patterns of protein distribution might be present in these data. This situation is analogous to determining which subcellular fraction a protein belongs to, which has been termed protein correlation profiling.12 To accomplish this, K means clustering allowing for 6 clusters was performed. This approach takes advantage of the inevitable differences in tissue composition in the starting muscle biopsies, as well as the protein fractionation step. It does not require that anything be known about the disease status of the individuals in question. Each set of proteins with a similar profile gets assigned to the same cluster, as shown in Figure 4. The color pattern for these clusters was aligned to the rainbow, so that the most soluble proteins (clusters K1 and K2) are at the red end of the spectrum, while the most insoluble proteins (clusters K5 and K6) are green and blue. The patterns for clusters K2, K4 and K5 show nearly even distribution between individuals. Many of the proteins in cluster K2 were abundant proteins that were to some degree detected in all 8 channels, but were enriched in the soluble compartment. In contrast, the cluster K1 pattern is highest in the supernatant of the diseased individuals (individuals 3 and 4). At the other end of the spectrum, cluster K6 is highest in the pellet of the diseased

individuals. Finally, cluster K3 is high only in the soluble compartment of individual 4, while cluster K4 is nearly evenly distributed across all 8 compartments. These generalizations are crude, because not all proteins in each cluster really have the exact same pattern. In fact, there is evidence that many more distinct patterns are present, with some proteins having patterns that do not fit with any other proteins. To determine what these data signify, the abundant proteins were classified into 11 nonoverlapping categories, based on a combination of the traditional gene ontology (GO) classifications of pathway, structure, and subcellular location (Table S10, Supporting Information). All of the proteins that map to clusters K5 and K6 are also listed in Table 4. Subcellular location is the most important of these, because protein solubility was used to separate the proteins in Workflow D. Another constraint was that each category had to consist of at least 3 proteins, because singleton and doubleton categories would be uninformative. Figure 5 consists of 12 subpanels that portray the relationship between category and cluster, arranged roughly in order of increasing insolubility. Subpanel Figure 5A shows the distribution of all abundant proteins for comparison. Details as to which protein is in which category and in which cluster are listed in Table S10 (Supporting Information). A category of sarcomeric proteins was defined according to whether they were described as major components of the

Figure 4. K cluster patterns for differential solubility. K means clustering was performed to search for 6 clusters using the data centroid based search method starting from the percentage of spectra that were detected from each of 391 proteins detected in the eight compartments for project D2. Each protein is represented by a trace. The clusters were arranged so that the top left traces (Cluster K1) correspond to proteins enriched in the soluble compartment, and particularly in the third and fourth channels, which correspond to individuals with inflammatory myopathies. At the other extreme is cluster K6, which is enriched in the insoluble compartment. The color pattern is violet, K1; red, K2; orange, K3; yellow, K4; green, K5; blue, K6. Journal of Proteome Research • Vol. 8, No. 7, 2009 3271

research articles

Parker et al. a

Table 4. Proteins that are Preferentially Insoluble in Project D2

insoluble K

symb

5

ACTA1

5 5 5 5

ACTN1 ACTN2 ACTN3 AMPD1

5

APCS

5

APOBEC2

5

CAPZA2

5

CAPZB

5

CHCHD3

5 5 5 5 5

DCN DPT DES DMN GYS1

5

H1F0

5

HSPA2

5

HIST1H2B\?\

5

HIST2H2A\?\

5 5

H4FM KBTBD10

5

KBTBD5

5 5 5

LMNA LAMA2 LAMC1

5

LGALS1

5

LDB3

5

MYOM2

5

MYOM1

5

MYOM3

5

MYBPC2

5

MYLPF

3272

name

norm-1

actin, alpha 1, 411 skeletal muscle actinin, alpha 1 5 actinin, alpha 2 217 actinin, alpha 3 36 adenosine 9 monophosphate deaminase 1 (isoform M) amyloid P 1 component, serum 11 apolipoprotein B mRNA editing enzyme, catalytic polypeptide 2 capping protein 8 (actin filament) muscle Z-line, alpha 2 capping protein 5 (actin filament) muscle Z-line, beta coiled-coil-helix3 coiled-coil-helix domain containing 3 decorin 6 dermatopontin 5 desmin 78 desmuslin 12 glycogen 7 synthase 1 (muscle) H1 histone 7 family, member 0 heat shock 70 11 kDa protein 2 histone cluster 1, 10 H2bh histone cluster 2, 4 H2aa3 histone H4 18 kelch repeat and 21 BTB (POZ) domain containing 10 kelch repeat and 7 BTB (POZ) domain containing 5 lamin A/C 26 laminin, alpha 2 8 laminin, gamma 2 1 (formerly LAMB2) lectin, 8 galactoside-binding, soluble, 1 LIM domain 83 binding 3 myomesin 54 (M-protein) 2, 165 kDa myomesin 1, 185 54 kDa myomesin 29 family, member 3 myosin binding 24 protein C, fast type myosin light 85 chain 2 fast skeletal

Journal of Proteome Research • Vol. 8, No. 7, 2009

soluble

norm-2

ibm-3

ibm-4

ins

norm-1

norm-2

ibm-3

ibm-4

sol

all

378

294

272

1355

46

74

37

43

200

1555

4 174 102 5

2 177 32 7

2 146 27 0

13 714 197 21

0 13 0 5

0 6 2 0

0 11 0 0

0 17 0 0

0 47 2 5

13 761 199 26

2

3

4

10

2

0

2

1

5

15

4

16

7

38

7

2

2

6

17

55

11

7

5

31

0

0

0

0

0

31

13

6

6

30

0

0

1

0

1

31

5

2

3

13

1

3

0

2

6

19

0 1 65 8 8

9 4 113 10 5

3 3 96 14 6

18 13 352 44 26

5 0 1 0 4

0 0 1 0 5

3 0 7 0 1

1 1 10 1 5

9 1 19 1 15

27 14 371 45 41

0

4

3

14

1

0

0

0

1

15

11

3

1

26

6

4

2

5

17

43

7

9

11

37

0

0

0

0

0

37

4

6

6

20

1

0

0

0

1

21

12 24

20 19

22 24

72 88

0 4

1 4

1 3

0 3

2 14

74 102

3

7

5

22

1

0

0

2

3

25

16 1 3

33 3 6

33 4 2

108 16 13

0 2 1

0 0 1

1 0 0

1 0 0

2 2 2

110 18 15

7

9

8

32

2

7

2

6

17

49

56

60

47

246

29

26

11

26

92

338

88

55

38

235

42

31

20

45

138

373

50

55

43

202

25

12

7

25

69

271

9

23

24

85

5

0

3

6

14

99

53

9

1

87

8

24

0

0

32

119

51

27

18

181

19

41

6

12

78

259

research articles

Characterization of Human Skeletal Muscle Biopsy Samples Table 4. Continued insoluble K

symb

5

MYH1

5

MYH2

5

MYH7

5

MYL1

5

MYL2

5

MYL3

5 5 5 5 5

MYOT MYOZ1 MYOZ2 MYOZ3 NDUFA5

5 5

NEB LOC10013

5

PDLIM5

5

RPL12

5

RPS3

5

RPS8

5

RPLP0

5

SMYD1

5

SH3BGR

5 5

SYNPO2 SYNPO2L

5 5

TTN TMOD4

5

TPM1

5

TPM2

5 5 5

TPM3 TPM4 TNNC1

5

TNNC2

5

TNNI1

5

TNNI2

5

TNNT1

5

TNNT3

name

myosin, heavy chain 1, skeletal muscle, adult myosin, heavy chain 2, skeletal muscle, adult myosin, heavy chain 7, cardiac muscle, beta myosin, light chain 1, alkali; skeletal, fast myosin, light chain 2, regulatory, cardiac, slow myosin, light chain 3, alkali; ventricular, skeletal, slow myotilin myozenin 1 myozenin 2 myozenin 3 NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 5 nebulin nebulin similar to PDZ and LIM domain 5 ribosomal protein L12 ribosomal protein S3 ribosomal protein S8 ribosomal protein, large, P0 SET and MYND domain containing 1 SH3 domain binding glutamic acid-rich protein synaptopodin 2 synaptopodin 2-like titin tropomodulin 4 (muscle) tropomyosin 1 (alpha) tropomyosin 2 (beta) tropomyosin 3 tropomyosin 4 troponin C type 1 (slow) troponin C type 2 (fast) troponin I type 1 (skeletal, slow) troponin I type 2 (skeletal, fast) troponin T type 1 (skeletal, slow) troponin T type 3 (skeletal, fast)

norm-1

soluble

norm-2

ibm-3

ins

norm-1

norm-2

sol

all

90

313

102

ibm-4

38

543

10

25

ibm-3

3

ibm-4

2

40

583

1316

1042

552

349

3259

185

121

98

99

503

3762

679

583

890

756

2908

66

60

102

115

343

3251

84

74

57

48

263

18

21

7

8

54

317

67

34

83

81

265

30

24

22

36

112

377

43

42

54

70

209

4

6

6

8

24

233

32 40 10 5 5

21 33 8 7 2

38 18 11 4 2

20 20 6 4 1

111 111 35 20 10

2 0 0 0 3

0 2 0 0 1

0 0 0 0 0

0 0 0 0 2

2 2 0 0 6

113 113 35 20 16

207 24

246 36

181 21

149 23

783 104

0 0

0 0

0 0

3 0

3 0

786 104

45

32

20

25

122

3

11

0

1

15

137

4

1

3

3

11

0

0

1

2

3

14

6

6

7

5

24

2

1

1

2

6

30

4

2

3

2

11

0

0

0

0

0

11

2

4

5

0

11

0

0

0

0

0

11

2

3

5

1

11

2

0

0

2

4

15

4

2

1

1

8

2

3

0

0

5

13

21 10

13 4

9 6

5 7

48 27

0 0

0 0

0 0

0 0

0 0

48 27

1083 6

1144 13

936 4

1070 7

4233 30

167 0

35 0

190 0

381 0

773 0

5006 30

125

163

91

58

437

14

21

8

3

46

483

82

95

72

53

302

9

9

7

13

38

340

54 2 30

60 2 21

59 2 41

69 2 48

242 8 140

1 0 2

4 0 5

6 2 4

5 0 5

16 2 16

258 10 156

52

29

18

9

108

17

21

5

4

47

155

28

35

43

56

162

3

3

3

4

13

175

18

34

12

7

71

0

5

0

0

5

76

32

25

32

27

116

2

1

0

1

4

120

87

93

46

20

246

6

7

1

3

17

263

Journal of Proteome Research • Vol. 8, No. 7, 2009 3273

research articles

Parker et al.

Table 4. Continued insoluble K

symb

6

ACTC1

6 6

BGN CMA1

6

COL1A2

6

COL4A1

6

COL6A1

6

COL6A2

6

COL6A3

6 6

FBN1 FGA

6

FGB

6

FGG

6

GSN

6

HIST1H1D

6

LAMB2

6

LCN1

6

MYBPH

6

MYH3

6

MYH8

6

MYL4

6

MYL6B

6 6

OGN PRELP

6

VIM

name

actin, alpha, cardiac muscle 1 biglycan chymase 1, mast cell collagen, type I, alpha 2 collagen, type IV, alpha 1 collagen, type VI, alpha 1 collagen, type VI, alpha 2 collagen, type VI, alpha 3 fibrillin 1 fibrinogen alpha chain fibrinogen beta chain fibrinogen gamma chain gelsolin (amyloidosis, Finnish type) histone cluster 1, H1d laminin, beta 2 (laminin S) lipocalin 1 (tear prealbumin) myosin binding protein H myosin, heavy chain 3, skeletal muscle, embryonic myosin, heavy chain 8, skeletal muscle, perinatal myosin, light chain 4, alkali; atrial, embryonic myosin, light chain 6B, alkali, smooth muscle and nonmuscle osteoglycin proline/ arginine-rich end leucine-rich repeat protein vimentin

norm-1

norm-2

ibm-3

soluble ibm-4

ins

norm-1

norm-2

ibm-3

ibm-4

sol

all

1

3

8

6

18

0

0

0

0

0

18

0 0

0 0

6 8

5 8

11 16

0 0

0 0

0 0

1 0

1 0

12 16

1

0

8

11

20

0

0

0

0

0

20

3

1

9

3

16

0

0

0

0

0

16

12

1

38

21

72

0

0

0

0

0

72

1

1

15

9

26

0

0

0

1

1

27

26

6

115

83

230

3

0

0

0

3

233

4 11

0 0

13 74

3 18

20 103

0 1

0 1

0 4

0 2

0 8

20 111

3

0

49

12

64

0

1

5

0

6

70

4

0

37

8

49

0

1

5

2

8

57

0

3

9

3

15

2

0

3

3

8

23

5

0

6

3

14

0

0

0

0

0

14

2

3

15

8

28

1

0

0

0

1

29

0

0

12

6

18

0

1

4

1

6

24

0

0

6

11

17

0

0

2

5

7

24

2

2

20

3

27

0

1

0

0

1

28

1

1

8

19

29

0

0

1

6

7

36

\ill\

1

4

4

10

0

0

0

0

0

10

10

8

24

28

70

0

0

1

2

3

73

8 2

0 0

16 6

13 8

37 16

0 0

0 0

0 2

0 1

0 3

37 19

12

7

66

39

124

1

0

18

16

35

159

a The 90 proteins from Project D2 from which were identified by at least 10 spectra, and that mapped to K cluster 5 or 6 are listed. The first column lists the K cluster, the second column lists the gene symbol, and the 3rd column lists the protein name. The column was sorted first by K, and then by name. The remaining columns indicate the # of spectra mapped to each protein in each of the 8 compartments. The total number of identifications in the insoluble compartment, the soluble compartment, and overall are also listed. This same data is also represented in Table S6 (Supporting Information), along with the proteins from clusters K1-K4. The proteins in bold are proposed as novel candidates for major components of the sarcomere.

myofibrillar compartment of the sarcomere in ref 6 (Figure 5J). Another category was defined as cytoskeletal proteins that are not obviously components of the sarcomere (Figure 5I). Intermediate filament proteins were placed into a separate category (Figure 5G) so that they could better be compared among themselves. Extracellular matrix proteins including collagen made up the final category of primarily structural, insoluble proteins (Figure 5L). A broad category of soluble proteins (Figure 5E) was defined whose most abundant members are glycolytic enzymes, glycogen phosphorylase, creatine kinase, heat shock proteins, and redox proteins. This category 3274

Journal of Proteome Research • Vol. 8, No. 7, 2009

also houses ribosomal proteins, and other cytosolic enzymes that are less abundant. Mitochondrial proteins were split into two categories; electron transport proteins (Figure 5H) and other mitochondrial proteins (Figure 5D). A category of plasma proteins (Figure 5B) was defined based on a recent survey of plasma proteins.20 Other categories included nuclear proteins (Figure 5K) and membrane proteins (Figure 5C). All remaining proteins were defined as unclassified proteins (Figure 5F). Using this classification scheme, more than half of the proteins in all but one category (cytoskeletal) fit into a single cluster, not surprising as proteins with similar function are expected

Characterization of Human Skeletal Muscle Biopsy Samples

Figure 5. Each pie chart indicates the proportion of proteins that reside in each cluster. The color scheme for the clusters is the same as in Figure 4. Starting from the top center, the color pattern is violet, K1; red, K2; orange, K3; yellow, K4; green, K5; blue, K6. The number in parentheses indicates the number of proteins in each category. Table S10 (Supporting Information) lists the 370 proteins and lists category and cluster.

to partition in similar ways. However, some of the exceptions are noteworthy. To determine how well our data corresponded to what is known for many of these proteins, we relied on the databases maintained by Expasy (http://www.expasy.ch/), NCBI (www.ncbi.nlm.nih.gov), and Gene Card (www.genecards. org) to extract relevant information, unless specifically cited below. Plasma Proteins and Hemoglobins. Our initial expectation was that these proteins would be highly enriched in the soluble compartment, as they mostly derive from extracellular fluid or microvasculature. In fact, these proteins are predominantly in cluster K1, which is enriched in the diseased soluble compartment. However, Figure 5B shows that some plasma proteins are primarily insoluble in muscle biopsies; namely, the 3 fibrinogens (category 12), which presumably become insoluble upon clotting. Another exceptional plasma protein is gelsolin (GSN), which has two alternatively sliced forms. One splice variant of gelsolin causes it to be secreted, which explains why it is commonly observed in plasma proteomic studies.20 A second splice variant is not secreted, and binds to both actin and calponin.21 This second form is presumably the more abundant splice variant in muscle biopsies. Serum amyloid component P (APCS) also was found in cluster K5, perhaps due to association with extracellular matrix components or fibrinogen.22 The more abundant the protein, the more likely a protein should be at least somewhat distributed in both compartments, because it is so far above the threshold for detection. This presumably explains why hemoglobin and albumin are readily detected in the insoluble compartment. However, it is notable that all six immunoglobulins were readily detected in the insoluble compartment of the second individual, but not any peptides from TF,C3, ApoB, or A2 M (Table S10, Supporting Information). This may indicate that immunoglobulins are bound to antigens in these autoimmune inflammatory myopathy samples. Extracellular Matrix Proteins. Extracellular matrix proteins and collagens are prominent in connective tissue, and were largely insoluble as expected (Figure 5L). An exception was COL14A1, also known as undulin, which was exclusively soluble, and also specific to the individuals with IBM (Cluster

research articles K1). The remaining collagens are assigned to cluster K6, which is enriched in the insoluble compartment, especially in the second individual, perhaps as a consequence of diseaseassociated fibrosis. Laminin, fibrillin and osteoglycin were relatively insoluble, while lumican, fibronectin, and perlecan were present in both compartments. Intermediate Filament Proteins. Although there were only 5 intermediate filament proteins annotated as such in the list of 370 proteins, they split themselves into 4 separate clusters (Figure 5G), indicating that these proteins are acting very differently from one another. The most commonly encountered intermediate filament proteins in many proteomics experiments are cytokeratins, which often derive from contamination upon handling from flecks of skin. In these experiments, many different cytokeratins were identified, but they all displayed a typical contamination profile- they were most abundant in Workflows B and D, which involve more extensive manipulation, and they also were commonly identified together in the same gel slice, regardless of the location of the slice, and did not correlate with individuals. In Workflow D, this wide distribution caused the cytokeratins to cluster to K4. Although they are expected to be contaminants, they have been left in Figure 5 and Table S10 (Supporting Information) (but are not included in the count of 370 proteins), to document the extent of this contamination. Two of the 5 remaining intermediate filament proteins were enriched in cluster 5, namely desmin and desmuslin. Desmin is known to interact with Z-disk proteins and may synchronize contraction between adjacent myofibrils. 6 Desmuslin appears to behave similarly. Similarly, vimentin, though largely insoluble, is highly variable between individuals (cluster K6). This is not surprising, as it is often associated with actively dividing cells and repair processes. In contrast, nestin (NES) and restin (RSN) are largely soluble, but appear to be individual-specific but not correlated with one another (cluster K3). NES was observed in certain individuals using Workflow A, but RSN was never observed using Workflow A, though it was very prominent in Workflow C. Another intermediate filament protein, lamin A/C (LMNA), was classified as a nuclear protein (Figure 5K) and will be discussed below. Enzymes. Most enzymes not known to be associated with organelles were classified into cluster K2, which is relatively evenly distributed in the soluble compartment of the four individuals (Figure 5E). An exception was adenosine monophosphate deaminase (AMPD1), which is known to bind to myosin heavy chain.23 Similarly, glycogen synthase was largely insoluble, which is consistent with previous studies.24 Apolipoprotein B mRNA editing enzyme (APOBEC) was also for unclear reasons substantially insoluble, and ended up in cluster K5. Heat Shock Proteins. There appear to be two distinct categories of heat shock proteins (hsp) (Table S10, Supporting Information). The hsp90 family representatives, PPIA, and VCP, are mostly soluble (cluster K2), whereas the hsp27 and hsp70 family members appear to be evenly distributed between the soluble and insoluble compartment (cluster K4). Of these, HSPA2 appears to be least soluble, as well as more differentially expressed between individuals, so that it was mapped to cluster K5. Mitochondrial, Membrane, Nuclear, and Ribosomal Proteins. Most mitochondrial (Figure 5D) and membrane proteins were well solubilized with Triton X-100 (cluster K2), whereas the most prominent nuclear proteins (histones and Journal of Proteome Research • Vol. 8, No. 7, 2009 3275

research articles lamin) remained insoluble (cluster K5). This may have been due to association with undegraded DNA. Note that AHNAK nucleoprotein (AHNAK) is found in the soluble cluster K1, which is not surprising because in spite of its name it is commonly cytoplasmic, and interacts with annexin A2,25 also in K1. We left this protein classified as undefined, because there is little published evidence that AHNAK is expressed primarily in the nucleus, and some evidence that it interacts with cytoskeleton. When less abundant nuclear proteins that did not make it into Table S10 (Supporting Information) were examined (data not shown), the other histones and lamin B2 (LMNB2) were found to be also mostly insoluble, but two karyopherins (especially KPBB1) and some hnrps (like HNRPD) were mostly soluble, suggesting that nuclei did not remain intact in the solubilization process. Surprisingly, the few ribosomal proteins that passed our criteria for high abundance were also in cluster K5. Sarcomeric Proteins. Most traditional sarcomeric proteins, like myosin, actin, troponin and tropomyosin, as well as Z disk associated proteins like actinin, CapZ, myozenin, myotilin and titin map to cluster K5 (Table 4 and Figure 5J). However other Z disk associated proteins like filamin C (FLNC) were nearly half-soluble (cluster K4). Beta actin (but not skeletal alpha 1 actin), cofilin2, CSRP3, MYBC1 (but not MYBC2), obscurin, PDLIM3, and plectin behaved similarly to filamin C. Other proteins known to be connected in some fashion with the sarcomere, like ankyrins, dystrophin, the ryanodine receptor, and spectrins were mostly soluble, presumably due to a looser association with the sarcomere. Interestingly, some myosin isoforms map to the disease-associated cluster K6 (MYH3 and MYH8), while other myosin isoforms were readily solubilized (MYH9, MYH11, and MYH14) (Table S10, Supporting Information), and were therefore placed in the cytoskeletal category. As these latter three proteins and filamin A are mainly expressed in smooth muscle or nonmuscle cells, this suggests that much of the material that derives from the cytoskeleton of other tissues in the biopsy besides muscle cells can readily be distinguished from the true sarcomeric proteins. Following this logic any abundant protein that is tightly associated with the sarcomere should most likely be in cluster K5, or if more disease-specific, in cluster K6. Unclassified Proteins. Figure 5 shows that overall the unclassified proteins have a similar distribution to the soluble proteins (Figure 5E vs 5F). However, chymase 1 and lipocalin 1 were in cluster K6, suggesting a tight association with the disease-specific extracellular matrix proteins. The remaining proteins in cluster K5 (see Table 4), namely CHCHD3, KBTBD5, PDLIM5, SMYD1, SH3BGR, SYNPO2L, are good candidates for additional structural components of the sarcomere, but are not described in either Clark et al.,6 although for several of them there is evidence that they are highly expressed in skeletal muscle (SwissProt, Gene Card). Additional directed studies would need to be performed to confirm whether these proteins indeed form tighter interactions in skeletal muscle than the other known sarcomeric proteins that are more soluble in our experiments. In conclusion, muscle biopsies are heterogeneous. Overall, there is an excellent correspondence between the proteins detected in this study with those detected by Hojlund et al.;11 however, the present study offers numerous additional protein identifications, and documents the reproducibility of identification using four different workflows. Not surprisingly, the more separations that are performed, the larger the list of 3276

Journal of Proteome Research • Vol. 8, No. 7, 2009

Parker et al. identified proteins. The large amount of supplemental data in this paper documents that by far the majority of proteins identified in less extensive workflows are readily detectable in more exhaustive workflows. The most abundant proteins derive largely from both the sarcomeres and cytoplasm of multinucleated muscle cells, as well as nonmuscle tissue like red blood cells, inflammatory cells, and connective tissue. Regardless of solubility considerations, every abundant sarcomeric protein is likely to be present in Table S10 (Supporting Information) in one cluster or another (or else it is not readily detected by trypsin-based mass spectrometry). Therefore, any model for the structural proteins of sarcomeres (as opposed to signaling pathways which usually contain many less abundant proteins) should be consistent with these identifications.

Acknowledgment. We thank Stephen Hattan for making useful comments on the manuscript. Supporting Information Available: Supplmentary tables. The supplementary files can be downloaded from https://proteomecommons.org/tranche/. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) McDonald, W. H.; Yates, J. R., 3rd. Shotgun proteomics: integrating technologies to answer biological questions. Curr. Opin. Mol. Ther. 2003, 5 (3), 302–9. (2) Zhao, C.; Denison, C.; Huibregtse, J. M.; Gygi, S.; Krug, R. M. Human ISG15 conjugation targets both IFN-induced and constitutively expressed proteins functioning in diverse cellular pathways. Proc. Natl. Acad. Sci. U.S.A. 2005, 102 (29), 10200–10205. (3) Garcia, B. A.; Platt, M. D.; Born, T. L.; Shabanowitz, J.; Marcus, N. A.; Hunt, D. F. Protein profile of osteoarthritic human articular cartilage using tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2006, 20 (20), 2999–3006. (4) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419–40. (5) Washburn, M. P.; Ulaszek, R. R.; Yates, J. R. 3rd, Reproducibility of quantitative proteomic analyses of complex biological mixtures by multidimensional protein identification technology. Anal. Chem. 2003, 75 (19), 5054–61. (6) Clark, K. A.; McElhinny, A. S.; Beckerle, M. C.; Gregorio, C. C. Striated muscle cytoarchitecture: an intricate web of form and function. Annu. Rev. Cell. Dev. Biol. 2002, 18, 637–706. (7) Fraterman, S.; Zeiger, U.; Khurana, T. S.; Wilm, M.; Rubinstein, N. A. Quantitative proteomics profiling of sarcomere associated proteins in limb and extraocular muscle allotypes. Mol. Cell. Proteomics 2007, 6 (4), 728–37. (8) Ahn, N. G.; Shabb, J. B.; Old, W. M.; Resing, K. A. Achieving indepth proteomics profiling by mass spectrometry. ACS Chem. Biol. 2007, 2 (1), 39–52. (9) Desai, S. D.; Haas, A. L.; Wood, L. M.; Tsai, Y.-C.; Pestka, S.; Rubin, E. H.; Saleem, A.; Nur-E-Kamal, A.; Liu, L. F. Elevated expression of ISG15 in tumor cells interferes with the ubiquitin/26S proteasome pathway. Cancer Res. 2006, 66 (2), 921–928. (10) Parker, K. C.; Kong, S. W.; Walsh, R. J.; Salajegheh, M.; Moghadaszadeh, B.; Amato, A. A.; Nazareno, R.; Lin, Y. Y.; Krastins, B.; Sarracino, D. A.; Beggs, A. H.; Greenberg, S. A. Fast-twitch sarcomeric and glycolytic enzyme protein loss in inclusion body myositis. Muscle Nerve 2009Mar 16 epub. (11) Hojlund, K.; Yi, Z.; Hwang, H.; Bowen, B.; Lefort, N.; Flynn, C. R.; Langlais, P.; Weintraub, S. T.; Mandarino, L. J. Characterization of the human skeletal muscle proteome by one-dimensional gel electrophoresis and HPLC-ESI-MS/MS. Mol. Cell. Proteomics 2008, 7 (2), 257–67. (12) Andersen, J. S.; Wilkinson, C. J.; Mayor, T.; Mortensen, P.; Nigg, E. A.; Mann, M. Proteomic characterization of the human centrosome by protein correlation profiling. Nature 2003, 426 (6966), 570–4. (13) Greenberg, S. A.; Sanoudou, D.; Haslett, J. N.; Kohane, I. S.; Kunkel, L. M.; Beggs, A. H.; Amato, A. A. Molecular profiles of inflammatory myopathies. Neurology 2002, 59 (8), 1170–82. (14) LaRocque, R. C.; Krastins, B.; Harris, J. B.; Lebrun, L. M.; Parker, K. C.; Chase, M.; Ryan, E. T.; Qadri, F.; Sarracino, D.; Calderwood,

research articles

Characterization of Human Skeletal Muscle Biopsy Samples

(15) (16)

(17)

(18)

(19)

(20)

S. B. Proteomic analysis of Vibrio cholerae in human stool. Infect. Immun. 2008, 76 (9), 4145–51. Craig, R.; Beavis, R. C. TANDEM: matching proteins with mass spectra. Bioinformatics 2004, 20, 1466–7. Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of Multidimensional Chromatography Coupled with Tandem Mass Spectromety (LC/LC-MS/MS) for Large-Scale Protein Analysis: The Yeast Proteome. J. Proteome Res. 2003, 2, 43– 50. Parker, K. C,; Patterson, D.; Williamson, B.; Marchese, J.; Graber, A.; He, F.; Jacobson, A; Juhasz, P.; Martin, S. Depth of proteome issues: a yeast isotope-coded affinity tag reagent study. Mol. Cell. Proteomics. 2004, 3, 625–59. Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001, 19 (3), 242–7. Hattan, S. J.; Marchese, J.; Khainovski, N.; Martin, S.; Juhasz, P. Comparative study of [Three] LC-MALDI workflows for the analysis of complex proteomic samples. J. Proteome Res. 2005, 4 (6), 1931– 41. Hattan, S. J.; Parker, K. C. Methodology utilizing MS signal intensity and LC retention time for quantitative analysis and precursor ion

(21)

(22) (23)

(24)

(25)

selection in proteomic LC-MALDI analyses. Anal. Chem. 2006, 78 (23), 7986–96. Ferjani, I.; Fattoum, A.; Maciver, S. K.; Manai, M.; Benyamin, Y.; Roustan, C. Two distinct sites of interaction form the calponin: gelsolin complex and two calcium switches control its activity. Biochim. Biophys. Acta 2007, 1774 (7), 952–8. Pepys, M. B. Pathogenesis, diagnosis and treatment of systemic amyloidosis. Philos. Trans. R. Soc. Lond. B: Biol. Sci. 2001, 356 (1406), 203–10, discussion 210-1. Hisatome, I.; Morisaki, T.; Kamma, H.; Sugama, T.; Morisaki, H.; Ohtahara, A.; Holmes, E. W. Control of AMP deaminase 1 binding to myosin heavy chain. Am. J. Physiol. 1998, 275 (3 Pt 1), C87081. Taylor, A. J.; Ye, J. M.; Schmitz-Peiffer, C. Inhibition of glycogen synthesis by increased lipid availability is associated with subcellular redistribution of glycogen synthase. J. Endocrinol. 2006, 188 (1), 11–23. Benaud, C.; Gentil, B. J.; Assard, N.; Court, M.; Garin, J.; Delphin, C.; Baudier, J. AHNAK interaction with the annexin 2/S100A10 complex regulates cell membrane cytoarchitecture. J. Cell Biol. 2004, 164 (1), 133–44.

PR800873Q

Journal of Proteome Research • Vol. 8, No. 7, 2009 3277