Detection and Validation of Non-synonymous Coding SNPs from Orthogonal Analysis of Shotgun Proteomics Data Maureen K. Bunger,†,‡ Benjamin J. Cargile,†,‡ Joel R. Sevinsky,† Ekaterina Deyanova,§ Nathan A. Yates,§ Ronald C. Hendrickson,§ and James L. Stephenson, Jr.*,† Mass Spectrometry Research Program, Proteomics Research Center, Research Triangle Institute, Research Triangle Park, North Carolina 27709, and Molecular Profiling Proteomics, Merck Research Laboratories, Merck & Company, Rahway, New Jersey 08854 Received February 16, 2007
Orthogonal analysis of amino acid substitutions as a result of SNPs in existing proteomic datasets provides a critical foundation for the emerging field of population-based proteomics. Large-scale proteomics datasets, derived from shotgun tandem MS analysis of complex cellular protein mixtures, contain many unassigned spectra that may correspond to alternate alleles coded by SNPs. The purpose of this work was to identify tandem MS spectra in LC-MS/MS shotgun proteomics datasets that may represent coding nonsynonymous SNPs (nsSNP). To this end, we generated a tryptic peptide database created from allelic information found in NCBI’s dbSNP. We searched this database with tandem MS spectra of tryptic peptides from DU4475 breast tumor cells that had been fractioned by pI in the firstdimension and reverse-phase LC in the second dimension. In all we identified 629 nsSNPs, of which 36 were of alternate SNP alleles not found in the reference NCBI or IPI protein databases. Searches for SNP-peptides carry a high risk of false positives due both to mass shifts caused by modifications and because of multiple representations of the same peptide within the genome. In this work, false positives were filtered using a novel peptide pI prediction algorithm and characterized using a decoy database developed by random substitution of similarly sized reference peptides. Secondary validation by sequencing of corresponding genomic DNA confirmed the presence of the predicted SNP in 8 of 10 SNP-peptides. This work highlights that the usefulness of interpreting unassigned spectra as polymorphisms is highly reliant on the ability to detect and filter false positives. Keywords: LC-MS/MS • single nulceotide polymorphism • false-positives • isoelectric focusing • pI filtering • population proteomics
Introduction Mass spectrometry-based proteomic platforms have the capacity to identify large numbers of proteins in parallel from a single experiment.1-3 Unfortunately, because shotgun proteomics relies on a searchable database of known proteins, many high-quality spectra go unassigned due in part to absence of proteins from databases, presence of amino acid substitutions (that may result from a SNP), post-translational modifications, and alternative splicing of mRNA transcripts. One approach to discovering the information hidden in unassigned spectra is to develop searchable databases that incorporate sequences representing modified peptides and proteins.4-7 In this regard, by extending genome information into proteomic database development and search strategies, it becomes possible to obtain genomic-based data from existing MS/MS * To whom correspondence should be addressed: James L. Stephenson, Jr., Ph.D. Senior Program Director for Mass Spectrometry Research Research Triangle Institute 3040 Cornwallis Road Research Triangle Park, North Carolina 27709. E-mail, stephensonjl @rti.org. † Research Triangle Institute. ‡ Authors contributed equally. § Merck & Company. 10.1021/pr0700908 CCC: $37.00
2007 American Chemical Society
datasets and increase the depth of information gained from proteomics experiments. Single nucleotide polymorphisms (SNPs) are the most abundant form of genomic sequence variation among populations of individuals.8-10 Genotyping SNPs is generally used to monitor linkage disequilibrium and haplotype structure in positional localization of disease-related genes. NCBI has compiled a dataset of over 10 million SNPs throughout the entire human genome resulting from publicly and privately funded genome sequencing projects (dbSNP).10-13 Among these are approximately 65 000 coding region SNPs that code for an amino-acid polymorphism (nonsynonymous SNP: nsSNP). The importance of finding nsSNPs is not only to discover variation in amino acid sequences that have functional consequences but also to provide information regarding the genetic, and possibly phenotypic, variability within the population of samples.12,14-19 Multiple cost-effective DNA targeted methods, including MSbased technologies, have been developed to discover and monitor SNPs.18,20,21 Although DNA genotyping scans perhaps have the greatest utility in defining haplotype structure on a genome-wide scale, because proteins are a major functional Journal of Proteome Research 2007, 6, 2331-2340
2331
Published on Web 05/09/2007
research articles component of most disease states, information gained from being able to reliably monitor SNPs in proteomic data allows more functional inference to be assigned to particular expressed alleles. In this regard, the utility of monitoring expressed SNPs in proteomics will be in integrating protein expression analysis with genome information. Such analysis can reveal differential allelic expression that can be correlated to phenotypic variation between individuals. Recent interest in differential allelic expression has been driven by the discoveries that 45-56% of heterozygous alleles in humans are differentially expressed by a factor of 2 or more.22,23 In these analyses, oligonucleotide chips representing targeted SNPs were probed with RNA from white blood cells or embryonic kidney and liver. These methods are also being applied in Arabidopsis and yeast.24,25 Determining allelic expression differences by protein analysis can directly complement these RNA-based strategies and provide additional information on the scope of translation of differential allele expression to the protein level. Mass difference (∆M) approaches have been previously applied in detecting both modifications and polymorphisms in MS/MS data. Savitski and colleagues developed a tool, ModifiComb, to search MS/MS data for families of closely similar spectra that differ by a pre-defined ∆M. Peptides corresponding to PTMs were reported, and the capacity of the technique to detect polymorphisms was discussed.26 Roth and colleagues reported utilizing a top-down approach using accurate mass (FTMS) and selected ion targets to detect several amino acid differences by including known SNPs in the protein database search.27 One of the first descriptions of automated polymorphism identifications in MS/MS data involved developing a database of translated products from exhaustive in silico mutagenesis of the hemoglobin genes.28 A more recent approach used computational methods to identify high-quality spectra and then performed iterative searching of unassigned spectra against several types of databases.5 Although a SNPspecific database was not included, searches against an EST derived peptide database revealed several potential polymorphisms. Despite these efforts, a comprehensive analysis of proteome-wide SNP-peptide identifications has not yet been described, nor have previous reports fully addressed false positives by either computational means or by validation of the SNP at the DNA level. We present a refined approach to SNP annotation in an individual shotgun LC-MS/MS proteomic experiment using searches against reference protein databases and a separate SNP database created from peptides from the NCBI dbSNP database. Use of peptide pI filtering and extensive cross referencing between searches resulted in identification of 629 nsSNPs, including 36 that represented alternative alleles from the reference sequence with a low false positive rate. Importantly, a subset of SNP-peptides was confirmed to contain variation at the DNA level.
Materials and Methods Peptide Fractionation and Identification. Whole cell protein extracts were obtained from DU4475 human breast tumor cells by sonication of whole cells in a buffer of 8 M Urea, 25 mM Tris, pH 7.6, and 100 mM NaCl. Urea concentration was then adjusted to 1 M using 25 mM Tris. Total protein was quantified using a Bradford protein assay (Pierce), and a total of 1 mg of protein was digested with trypsin at a ratio of 50:1, O/N at 37 °C. Digests were desalted using a C18-“light” Sep-pak 2332
Journal of Proteome Research • Vol. 6, No. 6, 2007
Bunger et al.
(Waters). Peptides were separated using IPG-IEF, (24 cm, pH 3.5-4.5), using an Ettan IPGPhor II manifold (Amersham) and fractionated into 60 4 mm gel pieces.29,30 Peptides were extracted from IPG strips with 200 µL each 0.1% Tri-fluoroacetate (TFA), 0.1% TFA in 50% acetonitrile, and 0.1%TFA in 100% acetonitrile successively. Combined eluates were dried and resuspended in 0.1%TFA and were cleaned and desalted using Waters Oasis HLB Extraction Plates. Fractions were resuspended in 0.1%TFA and analyzed by reverse phase nanoLCMS/MS using an LTQ-FT hybrid or an LCQ Deca XP Plus mass spectrometer (Thermo Finnigan). For analysis using the LTQFT hybrid, an HP1100 capillary pump (Agilent) was used. Solvent A was 0.1 M acetic acid in H2O, and solvent B was 0.1 M acetic acid in 90% aceteonitrile/10% H2O. The gradient used was 0-3.0 min, 0% B; 3.0-39.1 min, 0-30% B; 39.1-59.0 min, 30-90% B; 59.0-59.1, 90-0% B; and 59.1-75.5, 0% B. The flow rate was 1 µL/min. A Famos 15 auto-injector (Dionex Corp.) was used to load the samples onto a trap column (100 um ID, 2.5 cm, New Objective) packed with ProteoPepII C18 media, and the injection volume was 1 µL per sample. The peptides were eluted out through a spraying column (100 um ID, packed in house with POROS R2 media). A hybrid linear ion trap-FTMS (Thermo Electron) was used to acquire MS data. One full FTMS scan followed by three data-dependent ion trap MS/MS and an ion trap full scan were continuously acquired. The key settings for FTMS were: AGC ) 1E + 6, maximum injection time ) 1.0 s, and resolution ) 50 000. For analysis using the LCQ-Deca, the nanoLC system used was an LC Packings Ultimate Pump, Switchos column switching device and Famos Autosampler (Dionex Corporation, Sunnyvale, CA) and was coupled to the mass spectrometer using a nanospray interface. The column consisted of a 100 µm ID piece of fused silica (10 cm length) packed with a monodisperse 5 µm polymeric packing material (Source 5RPC, gift from Amersham BioSciences, Piscataway, NJ). Approximately 1% of each fraction (which was 0.250 µL diluted into 4.75 µL water, 0.1% TFA) was loaded onto a capillary trap (same material as the column) and washed briefly with 0.1% aqueous formic acid (5 min) before switching inline with the analytical column. The nanoLC gradient was 80 min in length and progressed from 15 to 50% B (A: Water with 0.1% formic acid, B: 70% ACN with 0.1% formic acid). The flow rate of the gradient was 250 nL min. The ion trap mass spectrometer was programmed to take 1 full scan mass spectrum over the mass range of 400-1500 m/z followed by three tandem mass spectra of the three most intense ions. The NCBI (human build 36) and the IPI (human, v3.19) protein databases were reversed and indexed for tryptic peptides and searched with MS/MS spectra from each instrument using TurboSEQUEST (BioWorks, Thermo Finnigan). Data were then subjected to reverse database and pI filtering using inhouse developed software (IDSieve).31 The algorithm used in pI prediction is described in Cargile et al. (submitted 2007). Actual SEQUEST cross-correlation score (xcorr) cutoffs were determined for each fraction based on the Xcorr of the first reverse database hit. Database Creation for nsSNPs. The database of known SNPs (K-SNPdb) was generated from filtering the NCBI dbSNP (build 126) for nsSNPs.13 Individual ASN-1 flat files for each human chromosome in dbSNP were downloaded by ftp. For each flat file, each rs number was queried for function class “nonsynonymous”. For each nsSNP entry, the protein accession number, location of nsSNP in the protein sequence, and the
research articles
Proteome-Based nsSNP Detection
identity of the amino acid change were then used to cross reference the NCBI protein database (human build 36), and a tryptic peptide was derived containing the reference and alternative alleles of each nsSNP and written to a new database in fasta format. Substitutions involving nonsense mutations (any amino acid and a “stop”) and single nucleotide deletions or insertions resulting in a frameshift were not included. Total database size was 125 622 tryptic peptides representing each allele of 62 811 nsSNPs. It should be noted that a polymorphism refers to a location of genetic variability and that each person will have one allele per chromosome at each SNP location. Typically a protein database only incorporates one allele for each nsSNP that is determined the “reference” allele. This distinction is not currently associated with whether the allele is major or minor, which can only be determined by a population-based analysis. Overall genotype and haplotype of an individual is determined by incorporating all SNP identities, regardless of whether the allele is reference or alternative. Because we utilized two different reference databases, final determination of reference status in our SNPdb was empirical following query of each reference databases as outlined below. To generate the FalseSNPdb, an equal number of peptides that were mass-equivalent to individual peptides in the K-SNPdb were selected from the IPI protein database. One amino acid in each IPI-derived peptide was then randomly substituted with another amino acid creating a database with 125 622 decoy SNP peptides. Search Parameters. Databases were searched with MS/MS data using TurboSEQUEST. LTQ-FT data was searched with a parent mass tolerance of 20 ppm (chosen due to mass drift noted over the course of the experiment), mass range 6004500 Da, whereas LCQ-Deca generated MS/MS spectra were searched using a 2 Da parent mass tolerance, mass range 6003500 Da. All databases were fully tryptic with one missed cleavage allowed. Reference and alternative SNPs were determined by comparing Xcorr values derived from searching the K-SNPdb and the reference database which was derived from either NCBI (build 36) or IPI (v3.19). If Xcorr values for matches of the same “.dta” file were the same between the respective reference database (either NCBI or IPI) and the K-SNPdb search, the SNP was determined to be a “reference allele” SNP. If the Xcorr for each match against the K-SNPdb was greater than 15% higher than the corresponding reference hit, that peptide was assigned as an “alternative allele” SNP. The selection of 15% seemed to give the best balance between falsepositives and false negatives when the data was examined manually. False positive identifications were analyzed by comparing results to searches performed against the FalseSNP database. All resulting datasets were filtered for pI by calculating the maximum pI range of each fraction from reference database searches empirically after removal of outliers. Outlier removal was performed by quartile filtering using standard 1.5 times the interquartile range (IQR) as cutoffs for elimination of outliers. The final accepted range was then set to the remaining high and low pI values. The pI range determined for each fraction following searches of the IPI-protein database were used as filtering criteria for the smaller, less complex K-SNPdb searches. PCR-Sequencing. Genomic DNA from DU4475 cells was isolated using Trizol reagent (Invitrogen, Carlsbad, CA) according to manufacturer’s instructions. Primer sequences (available upon request) were designed from genomic DNA sequence flanking putative SNPs using Primer3.32 Amplification was
Figure 1. Workflow. (A) MS/MS data is collected from tryptic peptides that were fractioned by IPG-IEF in the first dimension and reverse phase LC (RP-LC) in the second dimension. Data was searched against tryptic peptide databases derived from the IPI or NCBI protein databases and against a peptide database derived from dbSNP resulting in 2 sets of “.out” files representing peptide identifications from each database. (B) Each set of results is filtered for pI by fraction as determined from the reference database searches. Reference database results were filtered by reverse database filtering and K-SNPdb results were filtered by a random substitution decoy database. (C) Each spectra corresponding to a nsSNP was cross-referenced by Xcorr to the reference database results for the same spectra leading to lists of reference and alternative nsSNPs.
carried out by PCR and resulting single product bands were excised from agarose gels, purified using Qiaex spin columns (Qiagen), and sequenced in both the forward and reverse directions over the SNP by dye-terminator based sequencing (MWG, Winston-Salem, NC).
Results Overall Scheme. The overall scheme of the database development and searching is outlined in Figure 1. DU4475 cell extracts were digested with trypsin and peptides were fractionated using immobilized pH gradient isoelectric focusing (IPGIEF). Resulting fractions were analyzed by LC-MS/MS using two different ion trap instruments, a LTQ-FT hybrid and LCQ Deca XP Plus (LCQ). Data from the LCQ and the LTQ-FT instruments were searched separately due to differences in mass measurement accuracy. Spectra were searched against Journal of Proteome Research • Vol. 6, No. 6, 2007 2333
research articles
Bunger et al.
Table 1. Results of Independent Database Searches LTQ-FTa
IPI (v3.19) NCBI (build 36.1) a
LCQ-deca
both
peptides
proteins
peptides
proteins
peptides
3344 3397
1591 1576
6241 7027
2423 2449
1891 2112
Only 40 fractions of 60 were analyzed.
both NCBI and IPI protein databases and a comprehensive database of known nsSNPs (K-SNPdb). Analysis involved two general search schemes and cross comparisons between searches. Comparisons of identifications between each of the reference databases (Table 1) and the K-SNPdb led to determination of “reference” vs “alternative” peptides (Tables 2 and 3). Although this work only utilized MS/MS analysis from a single sample, performing the searches in this manner specifically identifies allelic peptides that vary from the reference sequence (alternative allele peptides) and therefore represent spectra that would not have been assigned had only the reference database been searched. These spectra must be more carefully scrutinized as false positives are more likely to arise from the presence of post-translation or artifact modifications to the reference peptide that may result in a mass identical to the alternative allele. Establishment of Reference Peptide Datasets. MS/MS spectra from both instruments were used to search both the IPI (v3.19) and NCBI (build 36.1) protein databases. Each fraction was treated independently to determine Xcorr cutoffs and pI ranges. The Xcorr cutoffs for each peptide charge state were determined relative to the first hit from a reverse database version of each reference database.33 The pI range was determined by calculating the pI for each peptide identified, identifying outliers by quartile filtering, and setting the range to the remaining high and low pIs. These search parameters included the pI filtering and resulted in an estimated false positive identification rate less than 1%. The numbers of peptides and proteins identified are presented in Table 1. Also calculated and shown in Table 1 is the number of peptides that were identified in both runs. SNP Peptide Identifications. Each dataset was then searched against the K-SNPdb consisting only of SNP-derived peptides. To set Xcorr cutoff thresholds, the average pI was calculated for each fraction based on the reference database search (outlined above). For each putative SNP peptide, the difference from the average was calculated. This number was then plotted by Xcorr as shown in Figure 2. For the LCQ and LTQFT data respective Xcorr cutoffs were set based on the percentage of peptides outside fraction-set pI ranges. Setting LCQ cutoff to 3.2 and LTQ-FT cutoff to 2.5 resulted in a 12 and 7%, respectively, of peptide identifications that fell outside the pI range calculated for their respective fractions counting redundant identifications from multiple spectra only once. Xcorrfiltered data was then filtered by pI within each individual fraction Following Xcorr and pI filtering, the search against the K-SNPdb identified 516 peptides within the LCQ data and 311 peptides within the LTQ-FT data. Of all of these peptides, 198 were overlapping between datasets (Table 2 and Supplemental Table 1, Supporting Information). The average peptide length was noted to be much longer in the LCQ-deca results than the LTQ-FT (Table 2, 17.7 vs 14.5 respectively). This is likely due to the narrow parent mass tolerance in LTQ-FT data that does not account for the C13 isotope peaks that become the most prevalent isotope in higher mass peptides. Therefore, longer 2334
Journal of Proteome Research • Vol. 6, No. 6, 2007
peptides have lower Xcorrs by LTQ-FT analysis than when using a 2 Da parent mass tolerance, as is used when searching LCQDeca data. Searches against each database produced similar numbers of reference and alternative SNP peptides. Total nonredundant identifications were calculated from both instruments and both reference databases (Table 2). In the published analysis of NCBI’s build 35 of the human genome, it was estimated that the protein coding region amounted to 34 Mb, or 1.2% of the genome.34 The current number of nsSNPs in NCBIs dbSNP database is approximately 65 000, indicating that there is a nsSNP every 523 bases of protein coding DNA. Comparing the number of SNP containing peptides to the number of total identified peptides within each dataset, we detected a SNP-peptide ranging from 1 in 12.8 to 1 in 11.7 peptides. The average length of peptides detected in our analysis was 15.9 amino acids. This translates to a nucleotide SNP frequency ranging from 1 in 610 to 1 in 559 (Table 2). When only considering peptides that overlapped between datasets, the frequency was slightly higher (1 in 482 and 1 in 534 for IPI and NCBI, respectively). This indicates our method results in detection of SNP-peptides at a level comparable to what is expected from the total number of peptides identified and the currently known nsSNP frequency among protein coding regions of the genome. Put another way, based on our SNP density, it could be expected that a similar number of the currently known SNPs would be detected if one were to sequence an equivalent amount of coding DNA. Comparing K-SNPdb peptide identifications to those from either the NCBI or IPI reference search resulted in 33 and 36 alternative allele SNP-peptides and a total of 54 nonredundant peptides between the data sets. All spectra corresponding to these 54 identifications were visually examined to determine quality as well as whether the mass shift corresponding to the polymorphism occurred in the spectrum at the proper position in the y or b ion series. From the accurate mass dataset (LTQFT), 10 of 33 did not meet criteria. Seven of these 10 failed spectra were V to L shifts (+14 m/z units) that appeared to be inaccurately assigned. The 14 Da mass shift appeared to occur on either an E or a D or at the C-terminal K rather than a true V to L substitution indicating probable methylation of these residues can occur and match a theoretical spectrum for a SNP peptide (Supplementary Table 1 and Supplementary Figure 1, Supporting Information). From the LCQ dataset, 2 of 36 spectra failed to meet criteria. These two peptides were large peptides (>32 amino aicds) that received high Xcorrs using the SEQUEST algorithm despite fewer than 25% of peaks matched theoretical spectra. Fifteen identifications overlapped between the two datasets resulting in a total of 36 highly confident, nonredundant alternative allele identifications (Table 2). Upon beginning these experiments, we noticed several instances in which each reference database differed in its reference sequence. Of the 42 alternative allele peptides that passed criteria outlined above, 27 were not found in either database and 15 peptides were found in only one database. If considered in terms of the 629 total SNP peptides identified from the K-SNPdb search alone, this indicates that approximately 2.4% of SNPs in the IPI and NCBI reference databases are different with respect to determination of the reference allele. Of six that were uniquely identified as alternative alleles with respect to the NCBI database, three were found as N-terminal methionine processed forms in the IPI database, indicating these are reference SNPs with respect to both databases, but NCBI did not include the processed form. The
research articles
Proteome-Based nsSNP Detection Table 2. Results from K-SNPdb Searches
K-SNPdb (peptides) Average peptide length IPI (v3.19)
NCBI (Build 36.1)
Combined NRd
Reference alleles Alternative alleles SAP frequency (in peptides)b Est. SNP density (in nucleotides)c Reference alleles Alternative alleles SAP frequency (in peptides)b Est. SNP density (in nucleotides)c Alternative alleles
LTQ-FTa
LCQ-deca
total NRd
311 14.5 285 26 1 in 11.7 1 in 559 284 27 1 in 11.9 1 in 567 33
516 17.7 487 29 1 in 12.8 1 in 610 483 33 1 in 12.9 1 in 615 36
629 15.9 585 44 580 49 54
both
total database overlap
198 187 11 1 in 10.1 1 in 482 187 11 1 in 11.2 1 in 534 15
27
a Only 40 fractions of 60 were analyzed. b Derived from number of peptides identified from reference database divided by number of reference peptides with a SAP. c Estimated using the average peptide length of 15.9 derived from combined nonredundant reference database search results. d NR: Non-redundant.
Table 3. Identified SNP-Peptides Not Present in Reference Database Sequences peptidea
rs#
gene symbol
AA pos.
AA changec
mass difference (potential modification)b
minor allele (freq)e
Both allele IDs rs17375461 rs11557488 rs10904516
EALDVLGAVLKd SEALPTDLPTPSAPDLTEPK LISEVDSDGDGEISFQEFLTAAK
MRPS27 PRKCSH CALML5
284 291 74
D/G A/T K/R
rs11546426 rs10137921
LISEVDGDGDGEISFQEFLTAAR MDTESELDLISR
CALML5 MTHFD1
58 761
S/G T/M
-58 +30 +28 (formylation, dimethylation) -30 +46
D (32%) uk uk uk M (4%)
Alternative allele only IDs PGPTAESASGPSEDPSVNFLKd rs11549892 rs11549969
VGEEFEEQTVDGR DDANNDPQWCEEQLIAAKd
SQSTM1 CRABP2 PAICS
238 77 167
E/K L/V S/C
rs1799822 rs2275272 rs2280084
ALEDVFDALEGK LIDIFYPGDQQSVTFGIKd NPLLDLAAYDQEGRd
CPT2 ALDH18A1 NUP210
647 299 790
M/V T/I R/L
rs3816497 rs4000311
PADVYLIDEPSAYLDSEQRd SGDAAIVDMVPGKd
ABCE1 XP_943717.1
489 202
C/S A/V
rs10471371 rs11136334 rs11546937 rs11549185
EGILGQHQFLEGPEGIENTR EQLQQEQALLEEIER DLFANTVLSGGTTMYPGIADR YPIEHAIVTNWDDMEK
ITGA2 PLEC1 ACTB ACTG1
534 1276 294 74
K/E R/Q Y/F G/A
rs11550818 rs11553920 rs11558357 rs2275687 rs2275689 rs7459439 rs848209
NMITGTFQADCAVLIVAAGVGEFEAGISK EMYTLGITNFPIPGEPGFPLNAIYAK RTATESFASDPILYR AGALMMPLVDQLENR NFGAENPDPFVPVLSTAVK sfPSNIGQAQPIIDQLAEEAR LGEPAGESVENQEVQSK
Q2F837_HUMAN ARPC3 PKM2 HEATR_HUMAN HEATR_HUMAN SIAHBP1 SPEN
107 101 92 2017 1694 218 1091
S/F A/T P/R E/G N/S A/V L/P
N/A -14 +16 (hydroxylation, oxidation) -32 +12 Miscleave & -43 -16 +28 (formylation, dimethylation) N/A Miscleave & -28 -16 +28 (formylation, dimethylation) +60 +30 N/A -72 -27 -28 -16
uk uk uk V (10%) I (5%) R (38%) uk uk K (8%) uk uk uk uk uk uk G (46%) N (33%) uk L (43%)
Multiple loci ID rs11549175 rs11546901 rs12132484, rs16914872, rs11554076, rs6533, rs9808251, rs11538705
}
}
TTVIVMDSGDGVTHTVPIYEGYALPHAILR
Mutliple actin isoforms
-
G/V, H/R
-
uk
GFGFVTYATVEEVDAAMNARd
Multiple
-
V/G
-42
-
a Amino acid sequences of peptides with SNP-residue in bold. b Mass difference from reference sequence with potential false positive modifications in parentheses. N/A is not applicable due to creation of new cleavage site. c Annotated as reference/alternative allele amino acid. d Peptide found in both datasets. e Minor allele determined by average frequency of allele in all populations tested as annotated in NCBI dbSNP. uk, unknown population frequency.
rest (those remaining unique to each IPI or NCBI databases) were found to be simply different with respect to what is considered the reference allele in each database. These 27 alternative SNP-peptides identified that were not present in either reference database are shown in Table 3. Among these, three were found in both their reference and alternative allelic
forms suggesting heterozygosity. For most of those identifications, population-based allele frequency characterization is unknown (Table 3, last column). Determination of False Positives. A major consideration in the fidelity of polymorphism identification from MS/MS data is the role that mass shifts corresponding to various peptide Journal of Proteome Research • Vol. 6, No. 6, 2007 2335
research articles
Figure 2. Using predicted peptide pIs in filtering SNP-peptides. An average pI was calculated for each individual fraction. The ∆pI was calculated for each putative SNP-peptide by subtracting the average pI for its fraction from the predicted pI for the peptide and then plotted by Xcorr. Horizontal lines indicate the average pI range for all fractions using reference peptides. (A) LCQ-deca using Xcorr cutoff of 3.2, average pI range ) ( 0.23 pI units. (B) LTQ-FT using Xcorr cutoff of 2.5, average pI range ) (0.08 pI units. Plots include all peptide identifications from K0SNPdb including ones identified by multiple spectra, thus the number of data points is much greater that the numbers reported in Table 2
modifications may contribute to false positives. Shown in Table 3 and Supplementary Table 2 (see Supporting Information) is the mass shift resulting from each polymorphism with respect to the reference peptide and potential covalent peptide modifications that could also result in such a shift. The most common added mass was 14 Da, which correlates to several amino acid changes, including V to L/I, S to T, G to A, and D to E that could result from a single base substitution. In our dataset, only V to L/I was found as a +14 Da shift and in 7 of 8 instances of this substitution both versions of the peptide were identified (Supplementary Table 2, Supporting Information). Manual interpretation of the spectra clearly indicates the mass shift occurred at a D or E residue, or at the C-terminus rather than at the V in the ion series (Supplementary Figure 1, Supporting Information) so these can be easily identified and filtered out. Other common mass shifts observed that could 2336
Journal of Proteome Research • Vol. 6, No. 6, 2007
Bunger et al.
be modifications were +16 (oxidation, hydroxylation), +28 (formylation, di-methylation), and +42 (acetylation). Of those noted in Table 3, it was not possible to unambiguously determine if a mass shift corresponded to another modified residue by examining the spectrum. To determine a false positive identification rate from this type of search, we developed a FalseSNPdb by filtering IPI tryptic peptides by size and then substituting a random single amino acid in each. Data from the LTQ-FT and LCQ analyses were used to search the FalseSNPdb and then compared to identifications made by the IPI reference database search of the same MS/MS spectra. The same Xcorr and pI filtering criteria were applied resulting in identification of only 2 peptides from each dataset (data not shown). Four peptides from this decoy SNP peptide search remained following Xcorr and pI filtering compared to 54 from the comparable search and filtering against the K-SNPdb (Table 2), resulting in a theoretical false-positive identification rate of approximately 6%. The spectra for these peptides were examined manually. One matched the same spectra as a known SNP (rs4000311, XP_943717) but was derived from another gene in the gene family (XP_938616). Two are most likely modified peptides. One mapping to the highly abundant tubulin gene was a V to D substitution (+16 Da mass shift) that was next to a phenylalanine, which can be oxidized. The other was a V to L substitution producing a 14 Da mass shift that may represent a methylated normal peptide. As noted above, 7 V to L substitutions identified in the LTQ-FT dataset were also determined as likely false positives by manual interpretation of spectra. Thus, false positive identifications are most likely to arise from either modifications that mimic a substitution or ambiguous identifications due to similar protein sequences within gene families. Validation by PCR Sequencing. We selected 10 peptides from those listed in Table 3 for validation of the genomic DNA sequence. Of these, 3 peptides appeared to be heterozygous based on identifications of both SNP peptide alleles (one peptide appeared to be double heterozygous for two SNPs, rs10904516 and rs11546426), and others appeared only on the alternative allele list, indicating possible homozygosity. An example of a SNP-peptide confirmed to be heterozygous is shown in Figure 3. LTQ-FT MS/MS spectra for both alleles of rs17375461 (EALDVL(G/D)AVLK) revealed a mass shift that occurs at the y5 ion. DNA sequencing showed the presence of both an A and a G nucleotide at the SNP position in the sequence chromatogram confirming heterozygosity in this cell line (Figure 3B, inset). Of the other alleles that were sequenced, four were homozygous for the alternative allele as predicted and three were heterozygous. Two were homozygous for the reference allele, revealing these as false positives (Table 4). For most of those that we chose for sequencing, allele frequencies were not known (See Table 3). Among those with known allele frequencies, NPLLDLAAYDQEGR (rs2280084) was found to be homozygous for the L allele, which was the major allele with a frequency of 62%, whereas the peptide ALEDVFDALEGK (rs1799822), for which the V allele is the minor allele with frequency only 10%, was found to be heterozygous at the sequence level, (data not shown).
Discussion Nonsynonymous SNPs not only contribute to the complexity of the proteome but also provide significant insight into genetic variability when comparing individuals in a population of
research articles
Proteome-Based nsSNP Detection
Table 4. Summary of PCR Sequencing Results
RS #
peptidea
alleles genotype detected in (sequence)c proteome
confirmed heterozygous SNPs EALDVLGAVLK Het LISEVDSDGDGEISFQEFLTAAKb Het LISEVDGDGDGEISFQEFLTAARb Het ALEDVFDALEGK Het confirmed homozygous SNPs rs2280084 NPLLDLAAYDQEGR Hm (Alt) rs3816497 PADVYLIDEPSAYLDSEQR Hm (Alt) rs2275687 AGALMMPLVDQLENR Hm (Alt) rs11548633 PGPTAESASGPSEDPSVNFLK Hm (Alt) false positives rs10137921 MDTESELDLISR Hm (ref) rs11549969 DDANNDPQWCEEQLIAAKd Hm (ref)
rs17375461 rs10904516 rs11546426 rs1799822
Both Both Both Alt only Alt only Alt only Alt only Alt only Both Alt only
a Amino acid sequences with SNP-residue in bold. b Peptide was double heterozygous for each allele shown. c Het ) heterozygous, Hm ) homozygous, Alt ) alternative allele, Ref ) reference allele.
Figure 3. Sequence validation of heterozygous nsSNP. MS/MS spectra of peptides corresponding to rs17375461 with absolute ion intensity on y-axis and mass to charge (m/z) ratio on the x-axis. (A) Tandem MS spectrum of m/z 564.34 with assignment of y and b-ions corresponding to peptide EALDVLGAVLK. (B) Tandem MS spectrum of m/z 593.34 with assignment of y and b-ions corresponding to peptide EALDVLDAVLK. Arrows in both A and B point to y5 ion where a 58Da mass shift occurs corresponding to the substitution of glycine with aspartate. (B inset) Dye-terminator sequencing of PCR product from region surrounding the SNP with corresponding predicted amino-acid sequence.
proteomes. Even if a particular amino-acid substitution does not directly influence protein structure or function, its presence can be used to infer local haplotype structure when used in combination with information gathered through the ongoing International HapMap project.8,9 Moreover, if proteomic data is derived from actual diseased tissue, expressed SNPs may hold better clues to the genetics underlying that disease. It is estimated in any given tissue, an average 8,000 genes are expressed.35,36 Considering the latest estimate of 24 000 genes in the genome37 and assuming the 65 000 coding SNPs are evenly distributed among those genes, this suggests 21 645 nsSNPs are expressed in any given tissue and/or cell type. We detected 629 total nsSNPs over two replicates resulting in a total of 2.9%, or 1 in every 34.5 expressed SNPs. SNP-peptide based
approaches that target only expressed proteins effectively limit the search space in which to find disease relevant alleles thus certain nsSNPs identified in tissue-specific proteomic datasets can serve as “expressed” tag SNPs in genetic analysis. The challenge of identifying SNP-peptides in proteomics data is in part due to relatively low abundance and high general false-positive identification rates. The high false-positive rate mostly arises from post-translational modifications (PTMs) or peptide modifications during sample processing that result in similar mass shifts as true amino acid substitutions. Moreover, mass changes less than 2 m/z are difficult to confidently assign even under accurate mass conditions because C13 isotope peaks could be read as a 1 or 2 Da peptide mass shift. Peptides corresponding to alternative allele SNPs will result in a mass shift compared to its reference allele that can be easily mimicked by peptide modifications, both ex vivo (oxidation) and as post-translational modifications that occur in the cell (such as methylation and deamidation). Peptide pI filtering is highly useful to reduce false positive identification due to modifications.30,31,38 We used a narrowrange (pH 3.5-4.5) immobilized pH gradient gel as a first dimension separation prior to LC-MS/MS. We have also developed a highly accurate pI prediction algorithm (Cargile et al. submitted) that can predict the pI of a peptide to within 0.03 pH units. Dividing the gel into 60 fractions produces fractions with narrow interquartile pI ranges, from 0.03 to 0.12 pI units for the LTQ-FT fractions and 0.02 to 0.12 pI units for the LCQ analyzed fractions (not including fractions near the ends of the strips which contain peptides with pIs well outside the range of the strip) Using this pI prediction, we determined a maximum range for each fraction based on results from the reference database searches. Using these ranges as cutoff values, we can eliminate peptide identifications that do not correspond to the range of the fraction. Thus, modifications such as methylation or deamidation may result in an actual pI that does not match the predicted pI of a true amino acid substitution and thus migrate differently in the IEF separation and be more likely to be filtered in data processing. Our data shows that Xcorr cutoffs alone do not suffice to eliminate false positives in SNP analysis in that several high Xcorr peptides were significantly out of the range of pIs for the fraction (Figure 2). The effect was more pronounced for the LCQ data, (compare variability between Figure 2A and 2B), clearly indicating the value of pI filtering in SNP-peptide identifications, especially when analyzing lower-mass accuracy MS/MS data. Journal of Proteome Research • Vol. 6, No. 6, 2007 2337
research articles Another traditional method to identify false positives, reverse database filtering criteria, is not appropriate for these searches in that it is unlikely to account for false positives due to mass shift.33 Furthermore, because the K-SNPdb is a simple peptideonly database, diversity of reverse database peptides would not likely be sufficient to be useful in filtering false positives. Searches against the K-SNPdb discussed above were compared to the reference search results determining whether identifications associated with the same spectra had either the same Xcorr or higher Xcorr. Spectra matches with the same Xcorr between the K-SNPdb and the reference database (e.g., reference alleles) have the same false positive identification rate as the reference database search alone (less than 1%). However, because alternative allele SNPs were only identified in the K-SNPdb search and not the reference searches, filtering of the alternative allele K-SNPdb output relied entirely on Xcorr cutoffs and peptide pI filtering. A FalseSNP database was set up to determine likelihood of a random spectrum match by filtering the indexed IPI protein database for peptides that matched SNPdb peptides by size and then randomly substituting one amino acid in each. Because each peptide is only one amino acid different from a peptide in the reference database, it is more relevant to SNP searches than other methods such as amino acid scrambling or reversing. Results from these searches of LCQ and LTQ-FT data against the FalseSNPdb indicated there were two main types of false positives. First, modifications of +14 and +16 Da mass shifts were most prevalent as indicated both by the FalseSNPdb search and our manual interpretation of spectra. Of note, very few modification-based false positives were identified in LCQ data compared to LTQ-FT. The most prominent modification was a +14 Da mass shift, suggesting methylation of D, E, or the C-terminus. Methylation of glutamic acid has been reported in MS/MS data previously using ModifiComb.26 However, in our data, the frequency of the appearance of this mass shift in the LTQ-FT analysis was not replicated when the same samples were analyzed on another ion trap instrument. Therefore, although there is a possibility that this modification is biologically relevant, it may represent an artifact of processing and analysis. The second type of false positive in the FalseSNPdb searches arose from co-identifications of related peptides from similar gene families that differ by a single amino acid. It is important to note that in closely related families of proteins it is not uncommon that a polymorphism in one gene could identically match a peptide derived from a related gene. Although none of these type were confirmed in the K-SNPdb search, one peptide of note, rs12132484 (GFGFVTYATVEEVDAAMNAR), serves as a reference allele for 6 unique SNPs among 5 loci related to heterogeneous nuclear ribonucleoprotein A1 (Table 3). Therefore, it is unlikely from these types of identifications that genomic location can be unambiguously identified based solely on proteomic information. Confirmation of 10 of the SNP identifications at the DNA level was performed by PCR amplification of the genomic DNA surrounding the SNP. Previous reports of SNP identifications from tandem mass spectrometry of peptides were not confirmed at the DNA level.5,26-28 Because of the potential for false identification due to PTMs and other modifications, we feel that orthogonal confirmation is an essential component of proteomic-based SNP ID. Indeed, of the 10 selected for sequence analysis, two were confirmed as false-positives. One, rs10137921 (MDTESELDLISR), was found in both allelic forms 2338
Journal of Proteome Research • Vol. 6, No. 6, 2007
Bunger et al.
at the peptide level in both datasets, and therefore, it was surprising that DNA sequence revealed only the reference allele. It is important to note also that of the four that were heterozygous, only three of those were found in both allelic forms at the peptide level (Table 4). Thus, it cannot be assumed that failure to identify both peptide alleles of a particular SNP reference indicates homozygosity. Such a limitation will need to be acknowledged in any population-based proteomics experiment that utilizes this approach to SNP identifications. These observations lead to a number of recommendations for validating nsSNPs in MS/MS datasets. (1) Use of additional filtering criteria to reduce false positives, such as IPG-IEF and pI filtering, should be incorporated to increase overall confidence in identifications. Plotting the pI vs the Xcorr as we show in Figure 2 also aids in identifying the appropriate Xcorr cutoff value that will reduce false positives. (2) Researchers should be cautioned to be consistent in choice of a reference database, especially if monitoring SNPs within a population of samples quantitatively. (3) Frequency of a particular substitution should be compared to the frequency of substitutions in the dbSNP database. Over-representation of a substitution, for example V to L in our dataset, may indicate a predominant modification rather than a true SNP. As this represented half of the false positives, elimination of these by manual validation can increase confidence in the overall results. (4) Validation of SNP presence should be performed at the DNA level on at least a subset of identified polymorphisms. Although proteomics-based SNP identifications cannot reach the genome-wide density of SNP analysis that several DNAbased strategies can, there are certain advantages to analyzing SNPs at the expressed peptide level. First LC-MS/MS data can be queried even years later without repeating the data collection. Thus, as SNP databases grow and LC-MS/MS search software improves, additional SNPs can be recorded and added to previous datasets without repeating sample prep and analysis. Second, if diseased tissue is available for analysis, focusing efforts on alleles expressed in that tissue effectively limits the search space for finding alleles that exhibit linkage disequilibrium (LD) with respect to phenotypic outcome. Alleles exhibiting LD from a genome-wide analysis of a multi-allelic disease can be contributors to either disease risk or disease progression. For example, alleles associated with predisposition to addiction to nicotine may or may not be related to alleles affecting the risk of lung diseases as a result of smoking. Genome-wide analysis would likely result in intermingling of these identifications and a confusing LD profile. Expressed allele haplotyping in lung tissue may be more likely to result in identification of alleles only associated with the phenotype of interest, in this example lung disease. Expressed haplotypes therefore point to actual phenotypic contribution of expressed alleles in a given tissue rather than chromosomal distance to potential contributing alleles. Although expressed haplotyping can be gleened from massive resequencing of EST libraries to a degree similar to a shotgun proteomics experiment, a shotgun proteomic approach is less expensive and time-consuming. The critical limitation of using MS-based approaches is that there is no guarantee that from sample to sample one would detect the same modified peptides. Therefore, false negatives across samples could be high using a typical shotgun approach. An alternative approach would be to establish commonly found SNPs and only include these in a more targeted analysis. Performing such analysis on populations would be required to test such an idea.
research articles
Proteome-Based nsSNP Detection
Conclusions Systems biology combines proteomic and genomic experimental approaches to understanding biological processes and disease. The true value of systems biology approaches is built in the integration of coherent data sets. Multiple data types are complimentary and most powerful when used in combination. However, access to and the cost of utilizing systems biology approaches is prohibitive for many researchers. Thus, there is a great need for development of orthogonal data analysis methods that can be used to achieve systems biology level data from individual experimental platforms and existing datasets. This manuscript demonstrates a simple and novel way to identify nsSNPS from shotgun proteomics data sets and further establishes a reliable method of identifying and validating unassigned spectra in a typical shotgun experiment that represent alternative alleles of nsSNPs. Although detection of allelic diversity and heterozygosity within a sample is limited by the dynamic range of the typical LC-MS/MS proteomics experiment, allelic diversity among a population of samples can be inferred by comparing frequencies of allele identifications, leading to expressed haplotypes.
Acknowledgment. We thank Jonathan L. Bundy, Richard Schaat, and Fanyu Meng for critical review of the manuscript, Eric E. Schadt for critical discussion on the genetics, and David Kroll for supplying the DU4475 cells. Funding for this work was provided through the RTI Internal Research and Development program and a grant from Merck & Co. Supporting Information Available: Figure 1, Example of manual annotation of a false positive due to methylation; Table 1, Excel spreadsheet of all 629 SNP IDs; Table 2, +14 Da shift false positives. This material is available free of charge via the Internet at http://pubs.acs.org.
(10)
(11)
(12)
References (1) Domon, B.; Aebersold, R. Mass Spectrometry and Protein Analysis. Science 2006, 312(5771), 212-217. (2) Ferguson, P. L.; Smith, R. D. Proteome Analysis by Mass Spectrometry. Annu. Rev. Biophys. Biomol. Struct. 2003, 32(1), 399424. (3) Aebersold, R.; Mann, M., Mass spectrometry-based proteomics. Nature 2003, 422(6928), 198. (4) Mann, M.; Jensen, O. N. Proteomic analysis of post-translational modifications. Nat. Biotech. 2003, 21(3), 255. (5) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of posttranslational modifications, sequence polymorphisms, and novel peptides. Mol. Cell Proteomics 2006, 5(4), 652-70. (6) Creasy, D. M.; Cottrell, J. S. Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2002, 2(10), 1426-1434. (7) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd. Probability-based validation of protein identifications using a modified SEQUEST algorithm. Anal. Chem. 2002, 74(21), 5593-5599. (8) Altshuler, D.; Brooks, L. D.; Chakravarti, A.; Collins, F. S.; Daly, M. J.; Donnelly, P. A haplotype map of the human genome. Nature 2005, 437(7063), 1299-1320. (9) Gibbs, R. A.; Belmont, J. W.; Hardenbol, P.; Willis, T. D.; Yu, F. L.; Yang, H. M.; Ch’ang, L. Y.; Huang, W.; Liu, B.; Shen, Y.; Tam, P. K. H.; Tsui, L. C.; Waye, M. M. Y.; Wong, J. T. F.; Zeng, C. Q.; Zhang, Q. R.; Chee, M. S.; Galver, L. M.; Kruglyak, S.; Murray, S. S.; Oliphant, A. R.; Montpetit, A.; Hudson, T. J.; Chagnon, F.; Ferretti, V.; Leboeuf, M.; Phillips, M. S.; Verner, A.; Kwok, P. Y.; Duan, S. H.; Lind, D. L.; Miller, R. D.; Rice, J. P.; Saccone, N. L.; Taillon-Miller, P.; Xiao, M.; Nakamura, Y.; Sekine, A.; Sorimachi, K.; Tanaka, T.; Tanaka, Y.; Tsunoda, T.; Yoshino, E.; Bentley, D. R.; Deloukas, P.; Hunt, S.; Powell, D.; Altshuler, D.; Gabriel, S. B.; Qiu, R. Z.; Ken, A.; Dunston, G. M.; Kato, K.; Niikawa, N.;
(13)
(14)
(15)
(16)
(17)
(18) (19)
(20)
(21) (22)
Knoppers, B. M.; Foster, M. W.; Clayton, E. W.; Wang, V. O.; Watkin, J.; Gibbs, R. A.; Belmont, J. W.; Sodergren, E.; Weinstock, G. M.; Wilson, R. K.; Fulton, L. L.; Rogers, J.; Birren, B. W.; Han, H.; Wang, H. G.; Godbout, M.; Wallenburg, J. C.; L’Archeveque, P.; Bellemare, G.; Todani, K.; Fujita, T.; Tanaka, S.; Holden, A. L.; Lai, E. H.; Collins, F. S.; Brooks, L. D.; McEwen, J. E.; Guyer, M. S.; Jordan, E.; Peterson, J. L.; Spiegel, J.; Sung, L. M.; Zacharia, L. F.; Kennedy, K.; Dunn, M. G.; Seabrook, R.; Shillito, M.; Skene, B.; Stewart, J. G.; Valle, D. L.; Clayton, E. W.; Jorde, L. B.; Belmont, J. W.; Chakravarti, A.; Cho, M. K.; Duster, T.; Foster, M. W.; Jasperse, M.; Knoppers, B. M.; Kwok, P. Y.; Licinio, J.; Long, J. C.; Marshall, P. A.; Ossorio, P. N.; Wang, V. O.; Rotimi, C. N.; Royal, C. D. M.; Spallone, P.; Terry, S. F.; Lander, E. S.; Lai, E. H.; Nickerson, D. A.; Abecasis, G. R.; Altshuler, D.; Bentley, D. R.; Boehnke, M.; Cardon, L. R.; Daly, M. J.; Deloukas, P.; Douglas, J. A.; Gabriel, S. B.; Hudson, R. R.; Hudson, T. J.; Kruglyak, L.; Kwok, P. Y.; Nakamura, Y.; Nussbaum, R. L.; Royal, C. D. M.; Schaffner, S. F.; Sherry, S. T.; Stein, L. D.; Tanaka, T. The International HapMap Project. Nature 2003, 426(6968), 789-796. Sachidanandam, R.; Weissman, D.; Schmidt, S. C.; Kakol, J. M.; Stein, L. D.; Marth, G.; Sherry, S.; Mullikin, J. C.; Mortimore, B. J.; Willey, D. L.; Hunt, S. E.; Cole, C. G.; Coggill, P. C.; Rice, C. M.; Ning, Z.; Rogers, J.; Bentley, D. R.; Kwok, P. Y.; Mardis, E. R.; Yeh, R. T.; Schultz, B.; Cook, L.; Davenport, R.; Dante, M.; Fulton, L.; Hillier, L.; Waterston, R. H.; McPherson, J. D.; Gilman, B.; Schaffner, S.; Van Etten, W. J.; Reich, D.; Higgins, J.; Daly, M. J.; Blumenstiel, B.; Baldwin, J.; Stange-Thomann, N.; Zody, M. C.; Linton, L.; Lander, E. S.; Altshuler, D. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409(6822), 928-933. Wheeler, D. L.; Barrett, T.; Benson, D. A.; Bryant, S. H.; Canese, K.; Chetvernin, V.; Church, D. M.; DiCuccio, M.; Edgar, R.; Federhen, S.; Geer, L. Y.; Helmberg, W.; Kapustin, Y.; Kenton, D. L.; Khovayko, O.; Lipman, D. J.; Madden, T. L.; Maglott, D. R.; Ostell, J.; Pruitt, K. D.; Schuler, G. D.; Schriml, L. M.; Sequeira, E.; Sherry, S. T.; Sirotkin, K.; Souvorov, A.; Starchenko, G.; Suzek, T. O.; Tatusov, R.; Tatusova, T. A.; Wagner, L.; Yaschenko, E., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006, 34(Database issue), D173180. Livingston, R. J.; von Niederhausern, A.; Jegga, A. G.; Crawford, D. C.; Carlson, C. S.; Rieder, M. J.; Gowrisankar, S.; Aronow, B. J.; Weiss, R. B.; Nickerson, D. A. Pattern of sequence variation across 213 environmental response genes. Genome Res. 2004, 14(10A), 1821-1831. Sherry, S. T.; Ward, M. H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E. M.; Sirotkin, K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29(1), 308-311. Rebbeck, T. R.; Spitz, M.; Wu, X. Assessing the function of genetic variants in candidate gene association studies. Nat. Rev. Genet. 2004, 5(8), 589-597. Salisbury, B. A.; Pungliya, M.; Choi, J. Y.; Jiang, R.; Sun, X. J.; Stephens, J. C. SNP and haplotype variation in the human genome. Mutat. Res. 2003, 526(1-2), 53-61. Cargill, M.; Altshuler, D.; Ireland, J.; Sklar, P.; Ardlie, K.; Patil, N.; Shaw, N.; Lane, C. R.; Lim, E. P.; Kalyanaraman, N.; Nemesh, J.; Ziaugra, L.; Friedland, L.; Rolfe, A.; Warrington, J.; Lipshutz, R.; Daley, G. Q.; Lander, E. S. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 1999, 22(3), 231-238. Hinds, D. A.; Stuve, L. L.; Nilsen, G. B.; Halperin, E.; Eskin, E.; Ballinger, D. G.; Frazer, K. A.; Cox, D. R. Whole-genome patterns of common DNA variation in three human populations. Science 2005, 307(5712), 1072-1079. Suh, Y.; Vijg, J. SNP discovery in associating genetic variation with human disease phenotypes. Mutat. Res. 2005, 573(1-2), 41-53. Ng, P. C.; Henikoff, S. Predicting the Effects of Amino Acid Substitutions on Protein Function. Annu. Rev. Genomics Hum. Genet. 2006, 7, 61-80. Jurinke, C.; Denissenko, M. F.; Oeth, P.; Ehrich, M.; van den Boom, D.; Cantor, C. R. A single nucleotide polymorphism based approach for the identification and characterization of gene expression modulation using MassARRAY. Mut. Res./Fundam. Mol. Mech. Mutagen. 2005, 573(1-2), 83. Chan, E. Y. Advances in sequencing technology. Mut. Res./ Fundam. Mol. Mech. Mutagen. 2005, 573(1-2), 13. Lo, H. S.; Wang, Z.; Hu, Y.; Yang, H. H.; Gere, S.; Buetow, K. H.; Lee, M. P. Allelic Variation in Gene Expression Is Common in the Human Genome. Genome Res. 2003, 13(8), 1855-1862.
Journal of Proteome Research • Vol. 6, No. 6, 2007 2339
research articles (23) Pant, P. V. K.; Tao, H.; Beilharz, E. J.; Ballinger, D. G.; Cox, D. R.; Frazer, K. A. Analysis of allelic differential expression in human white blood cells. Genome Research 2006, 16(3), 331-339. (24) Ronald, J.; Akey, J. M.; Whittle, J.; Smith, E. N.; Yvert, G.; Kruglyak, L. Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 2005, 15(2), 284-291. (25) West, M. A. L.; van Leeuwen, H.; Kozik, A.; Kliebenstein, D. J.; Doerge, R. W.; St. Clair, D. A.; Michelmore, R. W. High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Res. 2006, 16(6), 787-795. (26) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell Proteomics 2006, 5(5), 935-948. (27) Roth, M. J.; Forbes, A. J.; Boyne, M. T., 2nd.; Kim, Y. B.; Robinson, D. E.; Kelleher, N. L. Precise and parallel characterization of coding polymorphisms, alternative splicing, and modifications in human proteins by mass spectrometry. Mol. Cell Proteomics 2005, 4(7), 1002-1008. (28) Gatlin, C. L.; Eng, J. K.; Cross, S. T.; Detter, J. C.; Yates, J. R., 3rd. Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry. Anal. Chem. 2000, 72(4), 757-763. (29) Cargile, B. J.; Sevinsky, J. R.; Essader, A. S.; Stephenson, J. L., Jr.; Bundy, J. L. Immobilized pH gradient isoelectric focusing as a first-dimension separation in shotgun proteomics. J. Biomol. Tech. 2005, 16(3), 181-189. (30) Cargile, B. J.; Talley, D. L.; Stephenson, J. L., Jr. Immobilized pH gradients as a first dimension in shotgun proteomics and analysis of the accuracy of pI predictability of peptides. Electrophoresis 2004, 25(6), 936-945.
2340
Journal of Proteome Research • Vol. 6, No. 6, 2007
Bunger et al. (31) Essader, A. S.; Cargile, B. J.; Bundy, J. L.; Stephenson, J. L., Jr. A comparison of immobilized pH gradient isoelectric focusing and strong-cation-exchange chromatography as a first dimension in shotgun proteomics. Proteomics 2005, 5(1), 24-34. (32) Rozen, S.; Skaletsky, H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. 2000, 132, 365-386. (33) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2003, 2(1), 43-50. (34) Lander, E. S.; Collins, F. S.; Rogers, J.; Waterston, R. H. Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011), 931-945. (35) Jongeneel, C. V.; Delorenzi, M.; Iseli, C.; Zhou, D.; Haudenschild, C. D.; Khrebtukova, I.; Kuznetsov, D.; Stevenson, B. J.; Strausberg, R. L.; Simpson, A. J. G.; Vasicek, T. J. An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res. 2005, 15(7), 1007-1014. (36) Su, A. I.; Wiltshire, T.; Batalov, S.; Lapp, H.; Ching, K. A.; Block, D.; Zhang, J.; Soden, R.; Hayakawa, M.; Kreiman, G.; Cooke, M. P.; Walker, J. R.; Hogenesch, J. B. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. U.S.A. 2004, 101(16), 6062-6067. (37) Collins, F. S.; Lander, E. S.; Rogers, J.; Waterston, R. H. Finishing the euchromatic sequence of the human genome. Nature 2004, 431(7011), 931-945. (38) Cargile, B. J.; Bundy, J. L.; Freeman, T. W.; Stephenson, J. L., Jr. Gel based isoelectric focusing of peptides and the utility of isoelectric point in protein identification. J. Proteome Res. 2004, 3(1), 112-119.
PR0700908