Place of Pattern in Proteomic Biomarker Discovery† Michael A. Gillette,*,‡,§ D. R. Mani,*,‡ and Steven A. Carr*,‡ The Broad Institute of MIT and Harvard, 320 Charles Street, Cambridge, Massachusetts 02141, and Massachusetts General Hospital, 55 Fruit Street, Boston, Massachusetts 02114 Received April 8, 2005
The role of pattern in biomarker discovery and clinical diagnosis is examined in its historical context. The use of MS-derived pattern is treated as a logical extension of prior applications of non-MS-derived pattern. Criticisms pertaining to specific technology platforms and analytic methodologies are considered separately from the larger issues of pattern utility and deployment in biomarker discovery. We present a hybrid strategy that marries the desirable attributes of high-information content MS pattern with the capability to obtain identity, and explore the key steps in establishing a data analysis pipeline for patternbased biomarker discovery. Keywords: biomarker discovery • proteomics • diagnostic • mass spectrometry • serum/plasma • pattern • machine learning • pattern recognition • feature selection • classification
Introduction: Biomarkers, Patterns, and Diagnosis Biomarkers are indicators of relevant biological conditions. In medical and pharmacogenomic applications, they are intended to provide answers, or to substantially increase the probability of particular answers, to a variety of essential questions: Does an individual have a given disease? Is the disease improving or worsening? Is therapy having the desired effect? Biomarkers have a role in risk assessment, disease prediction, early detection, diagnosis, prognosis, disease monitoring, and evaluation of therapeutic response. As such, improved biomarkers are urgently needed to facilitate both clinical care and biomedical research. Significant efforts are being made to find novel biomarkers in a plethora of fields including cardiovascular disease, inflammatory disease, neurodegenerative disease, and cancer. Biomarker discovery has historically been dominated by targeted approaches, in which candidates derived from biological knowledge are evaluated for their correlations with biological conditions. The relative paucity of markers that have made it into clinical practice from these approaches has led many to the conclusion that unbiased discovery strategies, unencumbered by the constraints of current biological knowledge, are better suited to the development of novel biomarkers. Such markers may derive from many biological sources. In cancer, for instance, important diagnostic possibilities have emerged from the domains of cytogenetics, DNA methylation, mRNA expression, and protein analysis. Whatever the “unbiased” source of biomarker candidates, the fundamental observation of a reliable state-specific difference is virtually always at the level of pattern recognition; †
Part of the Biomarkers special issue. * To whom correspondence should be addressed. E-mails: gillette@ broad.mit.edu,
[email protected];
[email protected]. ‡ The Broad Institute of MIT and Harvard. § Massachusetts General Hospital. 10.1021/pr0500962 CCC: $30.25
2005 American Chemical Society
Figure 1. Diagnostic patterns in clinical medicine. The use of pattern is widespread in clinical medicine, and often precedes molecular understanding of disease pathogenesis. (a) Plain chest radiograph of sarcoidosis; (b) Trisomy 21 karyotype of Down’s syndrome; (c) mRNA expression heat map in lung adenocarcimona vs control; (d) Serum-derived mass spectra in disease and control.
interpretation, if it comes at all, is a subsequent event (Figure 1). Cytogenetic abnormalities, for example, are often determined to be diagnostic of disease long before the associated genetic mechanisms are understood. Gene expression analyses, popular and promising sources for candidate biomarker signatures in recent years, proffer patterns in high dimensionality expression space for disease classification and prognosis. Though the signatures come with gene labels for the constituent components, the existence of those labels does not improve their performance characteristics as biomarkers per se. Indeed, labels may be wrong (not uncommon in early versions of commercial arrays), or essentially arbitrary (as with extended sequence tag designations), without prejudice to the test. Were Journal of Proteome Research 2005, 4, 1143-1154
1143
Published on Web 07/20/2005
reviews there no labels, what would principally be sacrificed is not biomarker qualification but the possibility of their interpretation, and hence the ability to translate them to different analytical platforms, to use them to gain biological insight, or to derive from them candidates for therapeutic intervention. As anyone faced with the prospect of determining the significance of a diagnostic gene list from a microarray experiment will confirm, such interpretation is in any case often highly speculative, or utterly elusive. Far from being unique to modern biomarker discovery, pattern recognition has always been a central part of medical diagnostics. Hippocrates coined the Greek term karkinos (cancer in Latin) to refer to malignant growth, reputedly because the pattern of growth (or of surrounding vessels) reminded him of the legs of a crab. Galen recognized the pattern of rubor, dolor, calor, and tumor (redness, pain, heat, and swelling) that to this day form the cornerstones of the diagnosis of local inflammation. Virchow’s artful use of microscopic pattern advanced cellular pathology as the standard for medical classification and diagnosis. Despite the generally more sophisticated understanding of disease pathogenesis, pattern recognition retains its importance in contemporary diagnostics: chest pressure with arm radiation, nausea, and diaphoresis are suspicious for myocardial ischemia; bilateral hilar lymph node enlargement on an otherwise unremarkable chest X-ray suggests sarcoidosis; a particular histopathologic appearance with standard stains conveys the diagnosis of a particular lung cancer subtype. Technological advances in magnetic resonance imaging (MRI), positron emission tomography (PET), and computer-aided tomography (CAT or CT) have provided a suite of noninvasive tools for generating image-based patterns of disease presence and stage in both research and clinical settings (see Chandra et al., this Special Issue). Throughout the evolution of medical diagnostics, recognition of and reliance upon pattern has typically precededsand supplied the impetus fors a deeper understanding of the underpinnings of that pattern, but those diagnostics retain their utility even when that understanding is not forthcoming. The venerable, comfortable interplay between pattern and diagnostics that has extended naturally to most domains of biomarker discovery has quite recently evolved into an uneasy tension with respect to protein biomarkers. As proteins are key structural elements, catalysts, and communication lines in biological systems, and disturbances in proteins fundamental to disease, they have long been thought to provide a particularly rich source of biomarkers. Since at least the 1950s there has been support for the idea that plasma protein pattern might provide important insight into the presence and activity of disease. Assays to measure >100 different proteins in blood have been developed and are in routine use in clinical chemistry labs today.1 Differential two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) has been a mainstay of proteomic biomarker discovery.2 Though fruitful, the method has faced challenges of limited sensitivity, reproducibility, and throughput. With advances in biological mass spectrometry methods and instrumentation, mass spectrometry (MS) of complex mixtures of proteins or peptides derived from blood or other readily accessible body fluids has now emerged as the preferred strategy for proteomic biomarker discovery. The landscape of analytical and computational methods being employed for biomarker discovery in serum, plasma or other biofluids is quite diverse. Currently available platforms for biomarker discovery 1144
Journal of Proteome Research • Vol. 4, No. 4, 2005
Gillette et al.
fall roughly into two categories: (1) pattern-based methods that focus on production of MS-derived protein pattern via SELDI,3,4 MALDI,5 or electrospray6 and (2) methods that rely on proteolytic digestion of plasma proteins to peptides with analysis by LC-MS/MS.7-11 This latter approach is “identity-based” in that the output consists of lists of peptide sequences which are associated with the proteins from which they are derived. There have been no solid data, reproducibility studies or, most importantly, follow-on validation studies to indicate with certainty which approaches are the most robust, and no clear consensus has emerged in the literature. Nevertheless, despite the long history of successful use of pattern in diagnosis, the suggestion that MS-derived pattern can lead to biomarkers has become a focus of considerable contention in the field.12,13 In this paper, we give careful consideration to the role of pattern in protein biomarker discovery, espousing the view that the apt question is not whether MS pattern is useful but how best to use it. In particular, we believe that hybrid strategies that combine the best attributes of pattern- and identity-based methods may provide an optimal approach to the discovery of proteomic biomarkers by mass spectrometry.
Relative Quantitation from Pattern in Mass Spectrometry The fundamental output of all mass spectrometers, regardless of type, ionization mode or performance characteristics is a spectrum plotting mass-to-charge ratio on the x-axis versus detected ion flux on the y-axis. Subsequent interpretation of spectra, using characteristics such as isotope distribution, accurate mass, and sequence information (in tandem MS experiments) may allow portions of the spectrum to be labeled with a protein or peptide identity. Whether it carries an identity label or not, what gives a peptide or protein the status of a biomarker is its consistent variation in some fundamental characteristic, such as abundance, between two states, such as presence or absence of disease. Such abundance variation can occur via many mechanisms, including differential expression, sequestration, secretion, leakage, cleavage, etc. Though static elements in the sampled proteome, or those whose changes are random with respect to the distinction of interest, may be of interest or import for some other reason, they do not advance the cause of biomarker discovery. Estimation of differential abundance requires relative quantitation, a difficult thing to do precisely by MS. A variety of methods have been developed for quantitation of proteins in complex mixtures that involve stable isotopic labeling.14,15 An important aspect of these methods is that quantitation requires digestion to peptides of a size amenable to sequencing (typically 6-20 residues). Because MS-based peptide sequencing typically provides much less than 50% sequence coverage of any given protein, one seldom knows how much of the sequence of a given protein is actually present, or if a mixture of varying length proteins related to the parent protein is present. Alternatively, the intensity or peak area (in ion counts) of parent ions (as well as the fragment ions derived from specific parent ions by MS/MS) can be used as a semiquantitative estimate of the change in abundance of the peptides or proteins they represent (for a peptide example, see MacCoss et al.16). While less precise than the isotopic labeling methods described above, ion current-based abundance measurements can be made on intact proteins without digestion to peptides, while isotopic labeling methods cannot. Regardless of whether it is derived from peptides or proteins, m/z and abundance pattern information is exploited in virtually all MS based biomarker
Pattern in Proteomic Biomarker Discovery
discovery efforts to obtain relative quantitative differences between the measured states (e.g., diseased and control). What is principally at issue between advocates and detractors of pattern-based methods is thus less the use of pattern than how the patterns are generated, their quality (resolution, precision, accuracy, and reproducibility), and whether and how identity is interposed between pattern and diagnostic. This can be more clearly appreciated by considering two broad applications of MS pattern: pattern as biomarker, and pattern to guide or enrich biomarker discovery. Pattern As Biomarker. In the most direct application of mass spectrometric pattern to biomarker discovery, a class-specific differential pattern extracted from primary mass spectra can itself serve as a complex biomarker. Proponents of this approach generally build their case on a number of observations or assumptions central to the larger biomarker discovery effort. First, they contend that an MS profile with suitable performance characteristics (sensitivity and specificity) with respect to a disease would represent a functional diagnostic suitable for clinical validation and deployment. Second, they note the broad consensus view that the heterogeneity and complexity of host, environment, and disease imply that individual markers will rarely be sufficient for establishing diagnosis or prognosis, but that instead, panels of markers will typically be required. Because biomarker profiles are extracted from complex MS data with powerful machine learning algorithms, and because those profiles likely reflect contributions from a number of different molecular entities, this approach embraces the idea of multiple complementary markers at a very fundamental level. Third, they note that the same heterogeneities that suggest the need for multiple markers imply that large numbers of samples will need to be analyzed for effective marker discovery. Since generally speaking the move from primary mass spectrum to protein identification is time-consuming, largely because it almost always is done in combination with lengthy LC fractionation, there is a sample throughput cost to identity that may be both prohibitive and unnecessary. While this strategy is not dependent upon or constrained to any particular technology platform, it has been pioneered and particularly championed by Petricoin and Liotta at the National Institutes of Health, Grizzle at the University of Alabama, Birmingham, Semmes at Eastern Virginia Medical School, and Chan at Johns Hopkins, together with scientists from Ciphergen Biosystems (Fremont, California), all using some variation of Ciphergen’s surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS) approach. SELDI is a variant of MALDI in which the metal target surface is coated with a chromatographic stationary phase such as reverse phase, ion exchange, or immobilized metal affinity. In the Ciphergen instrument, these targets (or “chips” in their nomenclature) have 8 or 16 discrete target spots. These surfaces are intended to selectively retain a subset of proteins from complex biomaterials such as serum or cell lysate. Typically, samples are analyzed using a relatively low performance time-of-flight mass spectrometer, though Liotta and Petricoin have more recently moved to a higher data quality hybrid quadrupole time-of-flight tandem mass spectrometer. After initial data processing, a wide array of machine learning approaches has been used to extract informative features and build accurate classifiers (see below). Although not the first account of SELDI-derived biomarker patterns, the landmark publication in Lancet by Liotta and Petricoin described a biomarker panel consisting of five m/z peaks capable of discriminating serum from women with and
reviews without ovarian cancer with 100% sensitivity, 95% specificity, and a nominal positive predictive value of 94%.4 Scores of other publications have described discriminant SELDI derived patterns for breast cancer,17,18 lung cancer,19,20 prostate cancer,21,22 pancreatic cancer,23,24 bladder cancer,25 gastric cancer,26 colon cancer,27,28 kidney cancer,29,30 brain cancers,31 and hepatocellular carcinoma,32 as well as such nonmalignant diseases as severe acute respiratory syndrome (SARS),33,34 urolithiasis,35 and liver cirrhosis.36,37 The lofty promise and high profile of the Lancet paper led it to come under particularly close scrutiny and repeated reanalysis, and we and others have noted a significant number of serious concerns. For instance, information in the noise region of the spectra allows effective classification; small numbers of randomly selected mass/charge values markedly outperform chance and may perform as well as the features that were used in the paper for classification; sample subsets have markedly different global spectral characteristics; and nominally discriminant features from the published experiment are uninformative in a subsequently analyzed dataset generated by the same authors with the same technology and made available online (ref 38; Gillette and Mani, unpublished analysis). These analyses, facilitated by the commendable decision of the authors to provide public access to all primary data (http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp) raise concerns regarding intrinsic data bias (likely artifactual, though of uncertain provenance), inadequate statistical rigor, and lack of generalizability of results. The claim of 94% positive predictive value has also been discredited, as it is based on the 43% “prevalence” of diseased samples in the analysis, rather than the 0.04% prevalence of the disease in a low risk clinical setting.39 The resulting cautionary tale is a reminder that chance and bias must be avoided if possible through careful study design, that data should be systematically scrutinized for evidence of bias, that analyses should be conducted with statistical rigor, and that discovered patterns should ultimately be validated in totally independent sample sets.40,41 Some of these points are addressed at greater length in the section on data analysis (below). Interested readers are also referred to the study by Zhang and colleagues for an example of a study in which chance and bias were properly considered.42 Additional, more general criticisms have been directed at the SELDI technology itself. The low resolution, precision, and accuracy of the mass spectrometer may diminish the yield of detectable peptide and protein ions and complicate crossspectral comparisons. Reproducibility studies show higher spectral correlations from different spots on the same chip than across chips, suggesting possible room for improvement in chip manufacturing tolerances.43 A major concern is that minimal sample fractionation coupled with low binding capacity of the chip chromatographic surfaces make it inevitable that abundant proteins will (perhaps overwhelmingly) predominate in the spectra, an expectation that appears to be borne out in the very small number of SELDI studies in which identification was pursued and achieved.44-46 This is important because some changes in abundant proteins are known or may prove to be nonspecific markers of illness or inflammation, and so may be much less informative once outside the artificial constraints of a biomarker discovery experiment, in which a comparison is often staged between specific disease samples and healthy controls, rather than between different types of disease. On the other hand, protein biomarkers identified by mass spectrometry are virtually never fully characterized, and there have been Journal of Proteome Research • Vol. 4, No. 4, 2005 1145
reviews suggestions that disease specificity may be conferred by different isoforms or differential cleavage of abundant proteins.47-49 In the latter case, the abundant protein may act as a biological amplifier of a subtler signal of differential enzyme abundance or activity. Unfortunately, concerns about data quality and analysis, highly publicized skepticism about the specific findings of the landmark Lancet paper, dramatic representations of the results in lay publications, and the dearth of examples of clinically validated diagnostic MS patterns, have conspired with general reservations about the performance characteristics of the SELDI platform, (particularly among expert biological mass spectrometrists), to lead to a serious and sometimes unreflective skepticism regarding the potential of mass spectrometric pattern as biomarker. However, the merits of individual biomarker discovery efforts and of SELDI or other specific platforms should be evaluated separately from the general concept and strategy of pattern-based diagnostics, and the shortcomings in any one should not be held as indictments of the others. Experimental approaches to pattern-based biomarker discovery can be designed to substantially address the aforementioned concerns. Depth of coverage can be improved with abundant protein depletion and separation using multiple dimensions of orthogonal chromatography with bead-based or other high binding capacity materials (though at some cost in sample throughput). Over the course of multiple biomarker discovery efforts, it should become clearer whether and where there is a “sweet spot” between throughput and depth, and between simplicity of processing and information content of spectra. High information content pattern data can be generated using mass spectrometers capable of high mass accuracy and high resolution to facilitate alignment across spectra (see below). The ionization method and analyzer employed can also be selected to provide high dynamic range within each mass spectrum (whether MALDI, SELDI or LC-ESMS) to facilitate detection of minor components in the presence of major ones. Transparent, well-validated machine learning algorithms can be employed, and resulting diagnostic profiles can be evaluated with statistical rigor. Some positive steps in this direction were recently taken by Paul Tempst and colleagues.5 They used bead based rather than flat surface affinity chromatography, increasing their binding capacity by more than 2 orders of magnitude, and coupled this with relatively high performance MALDI-TOF MS. This yielded a feature space of over 400 well-defined peaks, richer than with comparable SELDI strategies, with instrument precision that facilitated alignment across spectra. Electrospray based methods in which accurate retention time information supplements mass/charge and intensity data have also been recently described.7,8 Electrospray has some specific advantages for the analysis of high-complexity biological samples which will be described below. Although methods for robustly detecting high quality discriminant MS patterns in biofluids may negate the absolute requirement for identification of the peptides and proteins that constitute the pattern, identity retains advantages. Knowing the identity of the constituent peptides and proteins may increase confidence in the robustness of the assay, provide biological insight into disease pathogenesis, suggest therapeutic targets, and create the opportunity for transfer of the assay to an alternative technology platform (e.g., enzyme-linked immunosorbent assays (ELISAs)). The latter is important as the ability to utilize an MS platform in a clinical chemistry setting for MS-pattern based diagnostics is essentially untested. Though 1146
Journal of Proteome Research • Vol. 4, No. 4, 2005
Gillette et al.
identification does not enhance biomarker performance per se, it may be difficult to protect intellectual property around an otherwise uncharacterized MS profile, decreasing the likelihood of resource commitment to the validation and clinical development of a pattern-based diagnostic (Leigh Anderson, personal communication). Pattern to Guide or Enrich Biomarker Discovery: Impact of Dynamic Range, Resolution and Mass Accuracy. Generally speaking, what is of interest in biomarker discovery is only that set of molecular species that varies in a detectable and consistent way between relevant states, such as disease and control. Those who advocate the use of identified biomarkers are thus likely to care about the identities of only a small subset of the constituents of a complex sample. Time and resources directed to the identification of other components are in this sense wasted. Herein lies the motivation for the use of mass spectrometric pattern to guide or to enrich biomarker discovery. The distinction between these turns on whether any MS/ MS data are obtained concurrently with primary MS data collection. In the “pattern to guide” approach, full scan mass spectra are subjected to feature selection and machine learning methods, and only those peaks that appear to be informative are selected for identification. In the “pattern to enrich” approach, tandem mass spectra acquired in a data-dependent manner during primary data acquisition are supplemented in a secondary, guided analysis after the MS data have been mined with pattern recognition methods (see further discussion below). Both approaches are facilitated by use of high quality, high information content data. The dynamic range, mass resolution and mass accuracy of the mass spectrometer employed have a significant impact on the information content of the pattern obtained. High instrumental dynamic range is necessary to deal with the enormous dynamic range of proteins present in plasma.2 Many highly abundant proteins and their metabolic breakdown products may be present at 3 to 5 orders of magnitude higher concentration than proteins of likelier direct biological relevance to disease. There are two categories of dynamic range important to biomarker discovery: dynamic range within a spectrum or scan, and dynamic range obtained across an entire experiment consisting of many scans. In the latter case, dynamic range is improved beyond single spectrum values through chromatographic separation that effectively separates in time the more abundant from the less abundant components. Dynamic range within a spectrum is strongly affected by the ionization mode employed, with MALDI/SELDI TOF resulting in the lowest dynamic range, typically on the order of low parts per hundred. By contrast, the dynamic range of electrospray in full scan mode is typically on the order of parts per thousand, while for targeted analyses such as multiple reaction monitoring MS for detection of peptides in plasma, electrospray can achieve up to four orders of linear dynamic range.50 The ability to align, detect abundance changes in, and confidently track these patterns across multiple patient samples are the central analytical requirements for proteomic pattern or “feature” recognition. The higher the instrumental mass resolution and mass accuracy, the easier and more reliable these tasks become. Assignment of accurate masses to molecular species observed on low performance instruments is generally acknowledged to be impossible. However, there is a standard misconception about the capabilities of intermediate resolution (10-20 K) and mass accuracy (5-50 ppm) instruments, particularly orthogonal quadrupole/quadrupole time-
Pattern in Proteomic Biomarker Discovery
of-flight MS systems such as the QStar (Applied Biosystems) and Q-Tof (Waters). These instruments have the capability of allowing “exact” mass measurements of pure or nearly pure chemical species with uncertainties of less than 10 ppm.51-55 However, it must be borne in mind that such performance, which requires measurement accuracy and reproducibility to within 1% of the peak widths obtained, is only achievable in the absence of other mixture components of similar mass. In other words, these instruments will often be inaccurate in complex mixtures such as minimally fractionated plasma, since adjacent peaks will not be resolved. As with the approach of using pattern as biomarker, the general strategy of guiding biomarker discovery by pattern does not constrain the technologies used for the discovery effort. Indeed, this is the strategy that has been employed in those cases in which biomarker candidates selected by SELDI were subsequently identified. It is important to point out that a wide disparity exists between our present relative inability to define sequence information sufficient to identify proteins (>10 kDa) by direct MS analysis and the relative ease with which this can be done for peptides (