Anal. Chem. 2010, 82, 6347–6349
Letters to Analytical Chemistry Accurate Determination of Protein Secondary Structure Content from Raman and Raman Optical Activity Spectra Myra N. Kinalwa, Ewan W. Blanch, and Andrew J. Doig* Manchester Interdisciplinary Biocentre, The University of Manchester, 131 Princess Street, Manchester M1 7DN, United Kingdom Raman spectroscopy measures molecular vibrations triggered by the inelastic scattering of light, while Raman optical activity (ROA) measures a small difference in the Raman scattering from chiral molecules using circularly polarized light. We used Raman or ROA spectra to determine the secondary structure contents (helix, sheet, or other) of proteins. Forty-four ROA and 24 Raman protein spectra were converted into mean intensities within 10 cm-1 width bins. The partial least squares algorithm with 5-fold cross-validation was used to construct models to give secondary structure contents from spectral data. The optimized algorithm gives highly accurate secondary structure contents, with R2 and rmsd values of 0.99, and 0.6-1.7%, respectively, for second derivative Raman data when comparing predicted to experimental data. Using ROA data from 620 to 1850 cm-1 is almost as accurate. Analysis of amide I, II, and III and backbone spectral regions reveals the importance of each of these regions for secondary structure assignment. Raman and ROA may be the methods of choice for rapid measurement of protein secondary structure contents, since they have unprecedented accuracy. It is often valuable to quickly determine a protein’s secondary structure through methods such as circular dichroism (CD) or infrared spectroscopy. CD, in particular, has proven to be a standard method for rapid measurement of helix and sheet contents, with thousands of proteins studied in this way. Spectra are usually deconvoluted to three states of helix, sheet, and coil (other), using reference basis spectra,1,2 with the weightings of each basis spectrum giving the secondary structure contents. Raman spectroscopy measures molecular vibrations triggered by the inelastic scattering of light and is being increasingly used for protein analysis. Raman optical activity (ROA) measures a small difference in the Raman scattering from chiral molecules using * To whom correspondence should be addressed. E-mail: andrew.doig@ manchester.ac.uk. (1) Greenfield, N. J. Nat. Protoc. 2006, 1, 2876–2890. (2) Whitmore, L.; Wallace, B. A. Biopolymers 2008, 89, 392–400. 10.1021/ac101334h 2010 American Chemical Society Published on Web 07/13/2010
right- and left-circularly polarized light and provides a unique perspective on protein structure.3 Here, we show that Raman and ROA data can give highly accurate measurements of secondary structure contents. Ab initio calculations of Raman and ROA spectra of a small number of polypeptides have recently been reported,4,5 but due to the computational expense of modeling large molecules and hydration effects, these are not yet sufficient for accurate determinations of protein secondary structure. Data sets of 44 ROA and 24 Raman protein spectra for proteins, with typical concentrations of 10-100 mg/mL and data collection times of 4-24 h, were compiled from published studies as previously described6,7 (Supporting Information Tables 1 and 2). Both ICP and SCP ROA spectra are included, but in the far-fromresonance limit, which is the case for all samples here, ICP and SCP spectra are identical.8 When several spectra were available for the same protein, the spectrum that was obtained under conditions that were closest to physiological was retained and the others were removed. Residues were assigned into three states of H (R-helix), E (β-sheet), and O (other/disordered/coil), using the define secondary structure of proteins (DSSP) algorithm9 and crystal structures. Protein samples that lack high resolution structures were excluded. Raw spectral data is in the form of pairs of wavelengths and intensities. These were processed by averaging intensities in bins of 10 cm-1 width from 620 to 1850 cm-1, giving 123 bins in total. Each bin typically included three or four data points. Data were scaled using the formula (X - Xmin)/(Xmax - Xmin), where X is the raw bin value, Xmax is the maximum bin value for that spectrum, and Xmin is the minimum bin value for that spectrum, so that the bin with the highest mean intensity for each spectrum has a value of 1 and the bin with the lowest mean intensity of each spectrum has a value of 0. ROA data were Barron, L. D.; Hecht, L.; Blanch, E. W. Mol. Phys. 2004, 102, 731–744. Jacob, C. R.; Luber, S.; Reiher, M. J. Phys. Chem. B 2009, 113, 6558–6573. Luber, S.; Reiher, M. J. Phys. Chem. B 2010, 114, 1057–1063. McColl, I. H.; Blanch, E. W.; Gill, A. C.; Rhie, A. G. O.; Ritchie, M. A.; Hecht, L.; Nielsen, K.; Barron, L. D. J. Am. Chem. Soc. 2003, 125, 10019– 10026. (7) Zhu, F.; Isaacs, N. W.; Hecht, L.; Barron, L. D. Structure 2005, 13, 1409– 1419. (8) Nafie, L. A. Annu. Rev. Phys. Chem. 1997, 48, 357–386. (9) Kabsch, W.; Sander, C. Biopolymers 1983, 22, 2577–2637. (3) (4) (5) (6)
Analytical Chemistry, Vol. 82, No. 15, August 1, 2010
6347
Table 1. ROA Regression Performancea H I II III I + II II + III I + III I + II + III backbone whole
E
O
R2
δ
SD
ζ
R2
δ
SD
ζ
R2
δ
SD
ζ
0.72 0.29 0.84 0.84 0.87 0.89 0.90 0.67 0.98
12.3 19.4 9.3 1.9 8.2 7.7 7.4 13.3 2.9
23.3 23.3 23.3 23.3 23.3 23.3 23.3 23.3 23.3
1.90 1.20 2.52 12.1 2.83 3.03 3.14 1.75 8.09
0.81 0.32 0.81 0.84 0.87 0.91 0.94 0.81 0.98
7.7 14.5 7.8 7.0 6.4 5.3 4.3 7.7 2.5
17.8 17.8 17.8 17.8 17.8 17.8 17.8 17.8 17.8
2.32 1.23 2.29 2.56 2.80 3.40 4.11 2.32 7.22
0.44 0.33 0.75 0.67 0.78 0.80 0.83 0.01 0.96
10.8 11.8 7.2 8.2 6.9 6.5 6.0 10.8 2.7
14.6 14.6 14.6 14.6 14.6 14.6 14.6 14.6 14.6
1.35 1.24 2.02 1.77 2.11 2.25 2.45 1.35 5.34
a H ) R-helix; E ) β-sheet; O ) other (disordered). R2 ) correlation coefficient; RMS (δ) ) root mean squared deviation %; SD ) standard deviation % of secondary structure content for crystal structures of proteins in data set; ζ ) ratio of SD/RMS.
then rescaled so that it was in the range of -1 to +1, since it is bisignate. We also investigated 20 cm-1 and 100 cm-1 bin widths, but the performance was poorer than for 10 cm-1 (data not shown). Decreasing the bin width much below 10 cm-1 was not possible, as data would not always be available for each bin. Secondary structure peak assignments have previously been made for amide I, II, and III and backbone spectral regions.10,11 We, therefore, looked at these regions both in isolation and in combination to see whether the information they contain is sufficient to determine secondary structure. Analyses were performed on the following: amide I (1600-1700 cm-1), amide II (1510-1600 cm-1), and amide III (1200-1340 cm-1) modes, combinations of these, backbone stretch modes (850-1100 cm-1), and second derivative Raman spectra. Multiple regression methods are used to find correlations between measured variables and response variables to explain the behavior of the response variables. In our case, we used the measured 10 cm-1 width bin intensities as the measured variables, with the secondary structural contents as the responses. Partial least-squares (PLS) regression12 is a method for constructing predictive models when the factors are many and highly collinear. PLS extracts latent factors that account for the most variation in the response. MATLAB PLS regression software was used with 5-fold cross validation to minimize the estimated mean squared error for the PLS models. The PLS performance is reported using the Pearsons correlation coefficient (R2), the root squared mean deviation (rmsd, δ) between predicted and observed secondary structure contents, and the determination enhancement parameter ζ, which is the ratio of δ and the standard deviation in the observed data.13 ζ considers the variation seen in the variable that we are seeking to predict. For example, if a response variable only occurs in the range 0-10%, it will be much easier to achieve a low rmsd between predicted and observed than for a response variable than ranges from 0 to 80%. ζ takes this inherent variation into account by dividing the rmsd value by (10) Barron, L. D.; Hecht, L.; Blanch, E. W.; Bell, A. F. Prog. Biophys. Mol. Biol. 2000, 73, 1–49. (11) Blanch, E. W.; Hecht, L.; Barron, L. D. Methods 2003, 29, 196–209. (12) Geladi, P.; Kowalski, B. Anal. Chim. Acta 1986, 185, 1–17. (13) Lees, J. G.; Miles, A. J.; Janes, R. W.; Wallace, B. A. BMC Bioinf. 2006, 7, 507–518.
6348
Analytical Chemistry, Vol. 82, No. 15, August 1, 2010
the standard deviation of the response variable data. Supporting Information Table 2 gives the predicted secondary structure contents using second derivative Raman data. Table 1 shows the method accuracies for each spectral subdivision for ROA data. Using whole spectra, we find R2 correlation coefficients of 0.96-0.98 and RSMD values from 2.5% to 2.9% for the three structural subdivisions. Using either the amide I or III regions in isolation is more accurate than amide II. Combining the amide region data generally improves performance. The backbone region data performs poorly, particularly for the O structural class. Whole spectra outperform all subdivisions showing that there is information on secondary structure content outside the amide or backbone regions, despite a current lack of specific band assignments. Table 2 shows the method accuracies for each spectral subdivision using Raman data. As with ROA, the amide I and III data is more useful than the amide II, and combining them is generally beneficial. In contrast to ROA, however, the backbone data give highly accurate secondary structure contents, almost as good as the whole spectra. Overall, second derivative Raman whole spectrum data are the most useful, with rmsd values of only 1-2% and R2 values all better than 0.99. R-Helix results were usually more accurate than those for β-sheet or coil, perhaps because there is more variability in the latter two classes, in backbone dihedral angles (φ, Ψ), for example. Protein secondary structure contents have previously been determined from IR data, with reported rmsd values for R-helix of 2.5-10% and poorer performance for β-sheet.14 Reported correlation coefficients for secondary structure content determination using CD are typically in the range of R2 ) 0.85-0.94 for R-helix and R2 ) 0.64-0.81 for β-sheet.13,15 Using low wavelength CD data improves performance to rmsd of 6% using a 3-state classification.13 Combining different types of data is also beneficial.16 For example, using CD, IR, and secondary structure prediction algorithms gives R2 of 0.80-0.95 and rmsd of 0.05-0.06.17 Our work, notably prediction based on second derivative Raman whole spectra, has a far superior performance than any of these methods, however. Previous work has generally found that it is easiest to determine R-helix content. (14) Byler, D. M.; Susi, H. Biopolymers 1986, 25, 469–487. (15) Sreerama, N.; Woody, R. W. Methods Enzymol. 2004, 383, 318–351. (16) Oberg, K. A.; Ruysschaert, J.-M.; Goormaghtigh, E. Eur. J. Biochem. 2004, 271, 2937–2948. (17) Lees, J. G.; Janes, R. W. BMC Bioinf. 2008, 9, 24–30.
Table 2. Raman Regression Performancea H I II III I + II II + III I + III I + II + III backbone whole 2nd derivative
E
O
R2
δ
SD
ζ
R2
δ
SD
ζ
R2
δ
SD
ζ
0.88 0.52 0.78 0.93 0.90 0.89 0.94 0.97 0.97 0.996
9.4 18.8 2.2 7.3 8.4 9.0 6.5 4.9 4.4 1.7
27.6 27.6 27.6 27.6 27.6 27.6 27.6 27.6 27.6 27.6
2.93 1.47 2.17 3.79 3.30 3.06 4.26 5.62 6.30 16.4
0.68 0.30 0.76 0.86 0.86 0.90 0.92 0.91 0.88 0.999
10.2 14.9 8.7 6.6 6.6 5.7 4.9 5.4 4.9 0.6
18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2
1.78 1.22 2.09 2.76 2.76 3.19 3.71 3.38 2.92 28.7
0.76 0.70 0.75 0.82 0.94 0.87 0.91 0.93 0.95 0.994
8.8 9.7 8.9 7.6 4.3 6.3 5.2 4.6 4.1 1.4
18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2 18.2
2.06 1.87 2.05 2.41 4.22 2.88 3.49 3.98 4.46 13.4
a H ) R-helix; E ) β-sheet; O ) other (disordered). R2 ) correlation coefficient; RMS (δ) ) root mean squared deviation %; SD ) standard deviation % of secondary structure content for crystal structures of proteins in data set; ζ ) ratio of SD/RMS.
While our results broadly agree with this for Raman (Table 2), the ROA results are as accurate for sheet as helix (Table 1). In conclusion, protein secondary structure contents can be determined to an unprecedented degree of accuracy from ROA and Raman spectral data. In particular, using second derivative Raman data from 620 to 1850 cm-1 with the PLS algorithm appears to be considerably more accurate than any other method yet reported. Predictions of secondary structure content cannot be expected to have perfect accuracy since there is some uncertainty in recognizing helix and sheet boundaries in proteins, protein structure is dynamic in solution, and structures in solution may differ from those in crystals. The differences we have achieved between predicted and observed of only around 1% (Supporting Information Tables 1 and 2) may partly result from these effects, rather than flaws in the method. Our methods can also show whether secondary structure information is present within spectral regions without specific band
assignments. This may be an advantage for protein structural analysis as it mitigates the requirement for any expert knowledge of band assignments. ACKNOWLEDGMENT We are grateful to Professor L. D. Barron and Dr. L. Hecht at the Department of Chemistry of the University of Glasgow for provision of Raman and ROA spectra. SUPPORTING INFORMATION AVAILABLE Proteins used for Raman and ROA analyses, with secondary structure contents, PDB codes and individual predicted secondary structures from 2nd derivative Raman spectra. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review May 21, 2010. Accepted July 8, 2010. AC101334H
Analytical Chemistry, Vol. 82, No. 15, August 1, 2010
6349