Revisiting the Scale-Invariant, Two-Dimensional ... - ACS Publications

Apr 15, 2018 - distributed errors in both the independent and dependent variables (joint bivariate normal distribution).10 To apply the triangular reg...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jchemeduc

Cite This: J. Chem. Educ. XXXX, XXX, XXX−XXX

Revisiting the Scale-Invariant, Two-Dimensional Linear Regression Method A. Beate C. Patzer,*,† Hans Bauer,‡ Christian Chang,† Jan Bolte,§ and Detlev Sülzle∥ †

Zentrum für Astronomie und Astrophysik, Technische Universität Berlin, D-10623 Berlin, Germany Maximilian-Kaller-Straße 24, D-12279 Berlin, Germany § Fachbereich Mathematik, Universität Hamburg, D-20146 Hamburg, Germany ∥ Otternweg 15, D-13465 Berlin, Germany ‡

S Supporting Information *

ABSTRACT: The scale-invariant way to analyze two-dimensional experimental and theoretical data with statistical errors in both the independent and dependent variables is revisited by using what we call the triangular linear regression method. This is compared to the standard least-squares fit approach by applying it to typical simple sets of example data from the actual chemical literature. A new addin for Microsoft Excel, LinEstXY, and ready-to-use formulas for the scale-invariant method are provided.

KEYWORDS: Upper-Division Undergraduate, Graduate Education/Research, Physical Chemistry, Computer-Based Learning, Computational Chemistry, Calibration, Problem Solving/Decision Making



L

CLASSICAL LINEAR REGRESSION METHOD REVISITED When analyzing experimental data in two variables, say x and y, which are supposed to be linearly related, the well-known standard or ordinary linear regression method, also known as standard (ordinary) least-squares fitting, is commonly employed. A description of this method can be found in any standard textbook on statistical analysis (see, e.g., ref 11). For convenience and in order to introduce the notation used here, we briefly summarize the essential well-known formulas. However, the expressions for the absolute errors of the slope and intercept of the regression line are not often provided in the literature (not even in the help text of Microsoft Excel). The form of a standard general regression line arising from a set of pairs of data {xi, yi}, i = 1, ..., n, where x is the independent variable and y is the dependent variable, is given by

inear regression analysis is often employed when examining experimental and theoretical data in chemistry, astronomy, physics, engineering, and many other branches of science. In two dimensions this is usually done by the wellknown standard “classical” least-squares fitting approach, which assumes the independent variables to be free of statistical errors. However, there are other less commonly employed linear regression methods. In this study, we will look at the particular approach that is called here “triangular” regression. It is seldom used in actual applications and is known by various different names in the scientific literature, such as major axis regression, line of organic correlation, geometric mean regression, and Strömberg’s impartial line.1−9 In particular, this regression method is scale-invariant and assumes only that there are normally distributed errors in both the independent and dependent variables (joint bivariate normal distribution).10 To apply the triangular regression method, a set of data pairs is fitted to a regression line by minimizing the sum of the areas of triangles. Here we want to revisit this approach by summarizing its main features and applying it to a set of typical example data. Ready-to-use formulas and a new Microsoft Excel 2010 spreadsheet function LinEstXY (see the Supporting Information) are provided for this particular method. However, to introduce the subject conveniently, the classical standard linear regression is revisited first. © XXXX American Chemical Society and Division of Chemical Education, Inc.

y = mx + b

(1)

Minimization of the sum of squared distances Δyi = yi − mxi − b with respect to the regression line leads to expressions for the intercept b and the slope m of the line, as listed in Table 1. Received: September 5, 2017 Revised: April 15, 2018

A

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 1. Classical Linear Regressiona Line with Intercept (y = mx + b)

Parameter Slope

m=

Intercept

b=

Standard error of the slope

σm =

Standard error of the intercept

σb = σm

m=

gxx SySxx − SxSxy

n (S (n − 2)gxx yy

SSreg

∑i = 1 (yi ̂ − y ̅ )2

SSresid Degrees of freedom Standard error of y

(

r=

gxx g yy

− m2

)

gxy 2

r2 =

gxx g yy

Sxy

Sxy 2 SxxSyy

∑i = 1 (yi ̂ − y ̅ )2

∑i = 1 (yi − yi ̂ )2

n

∑i = 1 (yi − yi ̂ )2

df = n − 2

df = n − 1

F=

Syy Sxx

SxxSyy

n

σy =

n

n

SSresid df

σy =

SSreg

F=

σy 2

SSresid df

SSreg σy 2

∑ni=1xiyi; gxx

= nSxx − Sx ; gyy = nSyy − Sy ; gxy = nSxy − SxSy; n is the number of (xi, yi) data

After some algebraic manipulations using the law of propagation of uncertainty, assuming that the variable x is not prone to statistical errors and that the errors associated with the variable y are normally distributed, one obtains the absolute errors of b and m in terms of the absolute errors σy of the dependent variable y. The absolute errors of the dependent variable y depend on the number of degrees of freedom (n − 2) because we have two parameters defining the regression line. The equations obtained after inserting the expression for σy are given in Table 1 with the respective abbreviations used. The correlation coefficient r is usually used as a measure of the quality of the fit. In addition to the general least-squares fitting approach, we also consider the special case of a line that is forced to pass through the coordinate origin and thus has b = 0:

x and y (e.g., time (seconds) and length (meters)). To avoid this problem of incommensurateness, sometimes dimensionless variables are introduced by standardization or normalization. However, there are several different ways of doing this, and the resulting line fits are not equivalent to each other. Here we want to revisit the triangular regression approach, which in contrast is scale-invariant and does take normally distributed errors in both the dependent and independent variables into account. The triangular regression method fits a set of data pairs {xi, yi}, i = 1, ..., n, to a regression line (eq 1) by minimizing the sum of areas of triangles (see Figure 1):

a

Sx = Sy = Sxx = pairs; ŷi = mxi + b; y ̅ = Sy/n.

y = mx

Syy =

∑ni=1yi2;

1 (n − 1)



r2 =

∑ni=1xi2;

σm =

gxy

r2

∑ni=1yi;

− mSxy − bSy)

Sxx n

r=

∑ni=1xi;

Sxy Sxx



gxx

Correlation coefficient

F value

Line through the Origin (y = mx)

gxy

Sxy =

2

2

(2)

The respective formulas simplify noticeably (cf. Table 1), as we have only one parameter defining the regression line, so that the number of degrees of freedom is n − 1.



TRIANGULAR LINEAR REGRESSION: A SCALE-INVARIANT 2D LINEAR MODEL The standard linear regression model is based on the assumption that statistical errors are associated only with the dependent variable y, whereas the independent variable x is supposed to be exact, that is, not prone to errors. However, this might not always be the case, particularly if the data have been determined by experiment. One possibility for taking statistical errors in both variables into account is orthogonal regression (a special case of the total least-squares (TLS) method),12−19 where the sum of squares of perpendicular distances to the regression line is minimized assuming that both errors are normally distributed with identical variance. TLS in general comprises many variants of linear and nonlinear approaches as well as allowing for different ways to account for errors in the variables by weights. A common feature of this approach is that it is not scaleinvariant, in the sense that it cannot be applied on different incommensurable scales or physical dimensions for the variables

Figure 1. Illustration of the definitions used to describe the triangular area 0.5Δyi Δxi of the data point (xi, yi). The regression line is obtained by minimizing the sum of these areas for all data points i = 1,..., n (cf. eq 3). B

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 2. Triangular Linear Regressiona Line with Intercept (y = mx + b)

Parameter

Line through the Origin (y = mx)

g yy

Slopeb

m=±

Intercept

b = n (Sy − mSx)

Standard error of the slope

σm =

Standard error of the intercept

σb = σm

Syy

m=±

gxx

1

Sxx



⎛ 4 m⎜m (n − 2) ⎝



gxy ⎞

Sxx n

Sxy

(m − )

4 m (n − 1)

σm =



gxx ⎠

Sxx



gxy

Sxy

Correlation coefficient

r=

r2

r2 =

SSreg

∑i = 1 (yi ̂ − y ̅ )(x ̅ − xî )

∑i = 1 (yi ̂ − y ̅ )(x ̅ − xî )

SSresid

n ∑i = 1 (yi

∑i = 1 (yi − yi ̂ )(xî − xi)

gxy 2

n

df = n − 2

σ√A =

F value

F=

yi

Sxy 2 SxxSyy

n n

− yi ̂ )(xî − xi)

Standard error of √Ac

SxxSyy

r2 =

gxx g yy

Degrees of freedom

a

r=

gxx g yy

df = n − 1

SSresid df

σ√A =

SSreg

F=

σy 2

SSresid df

SSreg σy 2

Sx ; n

b ; m

xî = m − other abbreviations are given in Table 1. bThe sign to choose for the slope is identical to the sign of gxy or Sxy. cThe standard x̅ = error σ√A has been defined to have the same dimensional quality as σy of the classical regression in Table 1.

A = min m,b

1 2

1 = min m,b 2

n

Table 3. Comparative Implementation of the Two Regression Techniques in Excel Showing the Positions in the Output Matrix

∑ (Δyi Δxi) i=1 n

⎛ yi ⎞ b − − xi⎟ ⎠ m m

∑ (yi − mxi − b)⎜⎝ i=1

Excel Array Function

(3) Parameter

Frisch was one of the first to describe this method, but he called it somewhat misleadingly “diagonal” regression. This approach was later rediscovered, for example, by Woolley2 and Samuelson.3 However, it seems to be not commonly known and therefore is not often applied to actual situations, although the resulting formulas are quite simple and easily programmable. It should be noted that this method should not be confused with the so-called rectangular regression suggested by the authors of ref 20. Following Frisch,1 there are a few properties that any method of determining regression equations should desirably possess: A. For perfectly correlated variables, the fitted line should reduce to the unique solution. All of the hitherto-mentioned methods have this property. B. The fitted equation should be invariant under an exchange of variables x and y. An interchange of variables represents a very special orthogonal transformation, viz., a geometric rotation of both axes through an angle of π/2 followed by a reversal. B′. The fitted equation should be invariant under any general orthogonal transformation of coordinate system. This is an even stronger alternative of the transformation condition. C. The fitted equation should be invariant under a simple dimensional or scale change in any of the variables. D. The regression slope must depend only upon the correlation coefficient and ratios of standard deviations, i.e., it must be some function of the elementary regression parameters alone. TLS methods satisfy A, B, B′, and D but not C; that is, this kind of approach is not scale-invariant! Triangular regression, on the other hand, satisfies properties A, B, C, and D but 1

Slope Intercept Standard error of the slope Standard error of the intercept Correlation coefficient squared Sum of squared regression Sum of squared residuals Degrees of freedom Standard error of ya, √Ab F value a

Conventional

Triangular

Output Matrix [Row, Column]

m b σm

LINEST LINEST LINEST

LinEstXY LinEstXY LinEstXY

[1,1] [1,2] [2,1]

σb

LINEST

LinEstXY

[2,2]

r2

LINEST

LinEstXY

[3,1]

SSreg

LINEST

LinEstXY

[5,1]

SSresid

LINEST

LinEstXY

[5,2]

df

LINEST

LinEstXY

[4,2]

σy, σ√A

LINEST

LinEstXY

[3,2]

F

LINEST

LinEstXY

[4,1]

Symbol

Conventional regression. bTriangular regression.

not B′. It should be noted that triangular regression not only satisfies all four of these criteria but is also the only twodimensional (2D) linear regression method that does so (see ref 3 for a proof). The mathematical expressions for the slope m and intercept b required to apply the triangular regression method to a 2D data set are listed in Table 2 (see the Supporting Information for further details on the derivation of the expressions). Again, with some algebraic manipulations using the law of propagation of uncertainty and assuming statistically distributed errors associated with both variables x and y (joint bivariate normal distribution), one obtains the absolute errors σb and σm of b and m, respectively. The correlation coefficient r once again is used C

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 2. Application of the triangular regression approach to the correlation of FEP-predicted and experimental binding free energies of eight different protein−ligand systems as provided in ref 21. Data were taken from the Supporting Information of ref 21 (ja512751q_si_003.xls).

Table 4. Results of Triangular Linear Regressiona Regression Values from Applying LinEstXY (LINEST) to Selected Protein−Ligand Systemsb Parameter Slope Intercept Standard error of the slope Standard error of the intercept Correlation coefficientc r2c SSreg

BACE

CDK2

JNK1

MCL1

P38

Thrombin

Tyk2

PTP1B

All

1.6985 (1.3268) 6.5159 (3.0463) 0.2725 (0.1818) 2.5527 (1.7034) 0.7812 0.6103 30.7737 (37.1660)

0.5604 (0.2673) −4.0027 (−6.6709) 0.2166 (0.1316) 1.9886 (1.2084) 0.4771 0.2276 7.5346 (1.5826)

2.0983 (1.7747) 9.9983 (7.0529) 0.3781 (0.2569) 3.4557 (2.3474) 0.8458 0.7153 27.8154 (46.5963)

1.5317 (1.1825) 4.2944 (1.4738) 0.2313 (0.1539) 1.8839 (1.2539) 0.7720 0.5960 58.6109 (65.1169)

1.2632 (0.8243) 2.8063 (−1.8740) 0.2632 (0.1692) 2.8195 (1.8122) 0.6525 0.4258 30.5887 (22.7912)

1.4633 (1.0335) 3.8670 (0.2841) 0.5287 (0.3453) 4.4161 (2.8843) 0.7063 0.4989 3.3339 (3.1685)

0.8938 (0.7964) −1.0326 (−1.9816) 0.1577 (0.1084) 1.5496 (1.0655) 0.8911 0.7940 21.1080 (16.0435)

0.9784 (0.7854) −0.1905 (−1.9322) 0.1897 (0.1273) 1.7285 (1.1604) 0.8027 0.6443 32.6930 (23.4727)

1.2032 (0.9756) 1.8681 (−0.2242) 0.0746 (0.0502) 0.6926 (0.4660) 0.8109 0.6575 350.7325 (328.3720)

D

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 4. continued Regression Values from Applying LinEstXY (LINEST) to Selected Protein−Ligand Systemsb Parameter SSresid Degrees of freedom Standard error of √A (y) F value

BACE

CDK2

JNK1

MCL1

P38

Thrombin

Tyk2

PTP1B

All

15.6903 (23.7341) 34 0.6695 (0.8355) 68.6461 (53.2418)

12.9807 (5.3721) 14 0.9303 (0.6195) 8.7067 (4.1244)

9.5768 (18.5456) 19 0.6920 (0.9880) 58.0892 (47.7380)

32.5203 (44.1345) 40 0.8906 (1.0504) 73.8939 (59.0168)

29.4406 (30.7291) 32 0.9445 (0.9799) 34.2869 (23.7338)

2.5494 (3.1827) 9 0.5049 (0.5947) 13.0768 (8.9597)

4.9258 (4.1628) 14 0.5730 (0.5453) 64.2779 (53.9560)

14.6967 (12.9611) 21 0.8173 (0.7856) 48.9394 (38.0313)

156.9990 (171.0334) 197 0.8905 (0.9318) 442.3280 (378.2260)

a

The corresponding results from the classical linear regression (LINEST) are given in parentheses in italic type. bSee ref 21. cIt should be noted that r and the correlation coefficient are identical for the two methods because these values depend only on the set of data pairs {xi, yi}. 2

The differences between the two “classical” solutions can be quantified by the respective angle φ (see Figure 3), which is determined by (see also ref 22)

as a measure of the quality of the fit. The finally obtained expressions for the triangular linear regression method are summarized in Table 2 for both types of intercept cases (see eqs 1 and 2).

⎛ g (1 − r 2) ⎞ xy ⎟ tan φ = ⎜ 2 ⎜ r (g + g ) ⎟ ⎝ xx yy ⎠



COMPUTATIONAL IMPLEMENTATION: MICROSOFT EXCEL SPREADSHEET FUNCTION The equations given in Table 2 were converted into a Visual Basic for Application (VBA) function accessible via a Microsoft Excel add-in. The name of this new function, LinEstXY, was constructed from the name of the internal Excel function LINEST by adding “XY”. The new function uses the same arguments as LINEST, and the output matrix has the same structure as that returned by LINEST, as depicted in Table 3.



(4)

The classical regression lines of the overall data sets (ΔGexpt, ΔGFEP) and (ΔGFEP, ΔGexpt) differ according to this relation by the noticeable angle φ = 11.7°. It should be noted that the triangular regression line does not bisect the angle φ. Instead, its slope corresponds to the geometric mean of the slopes of the two classical solutions (cf. ref 8). However, all three regression lines intersect at one common point (x,̅ y). ̅ The triangular regression method is not inherently superior in comparison with the standard least-squares fit. The standard linear regression should be applied if one of the two quantities has negligible errors in comparison with the errors of the second quantity, which then has to be considered as the dependent variable. The triangular regression, on the other hand, should be used to define the line of best fit when both variables of a data set are affected with errors. It becomes obvious from eq 4 that the classical method is not a limiting special case of the triangular method. Only if all of the data points match perfectly with the fitted line (i.e., r2 = 1, φ = 0°) do the results of the two approaches coincide (see Tables 1 and 2). Moreover, the triangular linear regression approach is scale-invariant, in contrast to the often-used 2D total least-squares methods.3 Consequently, the triangular regression is also applicable to linear regressions of data with even different physical units, for example.

APPLICATION

A standard application of 2D statistical data analysis is the comparison of theoretical results with measured data, both affected with statistical errors. By nature, this kind of application has to result in a line of origin. Figure 2 illustrates such an example with data taken from the actual chemical literature.21 Figure 2 shows the results of the application of the triangular linear regression method to the correlation of the free energy perturbation (FEP)-predicted and corresponding experimental binding free energies of eight different protein−ligand systems.21 Their accurate prediction is the holy grail of computer-aided drug design. The previously described function outputs of LinEstXY applied to these data are summarized in Table 4, where the corresponding results of the internal Microsoft Excel function LINEST are also listed for comparison. According to these results, the differences in the 2D linear regression approaches become quite obvious, as all of the classical results deviate substantially from those obtained with the triangular method. So far, the experimental data are considered to be the independent variables and therefore to be statistically error-free in the standard regression approach. To explore the differences in the two methods even further, we changed the dependent and independent variables of the regression problem and applied the methods again to the complete data set. Because both variables are affected with statistical and methodical errors, respectively, exchanging the independent variable (ΔGexpt) with the dependent variable (ΔGFEP) illustrates the real advantage of using the triangular regression method. Applying the classical regression function to the data sets (ΔGexpt, ΔGFEP) and (ΔGFEP, ΔGexpt) for all of the protein− ligand data points results in completely different regression lines, whereas the application of the triangular regression method gives one unique solution (see Figure 3)!



CONCLUSIONS

The standard least-squares regression is the most common method to define a line of best fit for a set of bivariate data. Instead of this often-employed standard least-squares regression method, the scale-invariant triangular linear regression approach should be used to analyze data with statistical errors in both the dependent and independent variables. Little attention is given to an important difference between these two methods, namely, that the triangular regression line is a unique solution whereas the standard linear regression is asymmetric and gives two different resulting lines depending on which variable is considered as dependent or independent. The key criterion in selecting one of these methods for a line of best fit is whether the investigator considers the relationship between x and y to be symmetric (the errors of x and y are of comparable order of magnitude) or asymmetric (the errors of one variable are substantially smaller than those of the other variable). E

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 3. Application of the triangular and classical linear regression methods to the complete overall data set of the protein−ligand systems before and after exchanging the independent variable (ΔGexpt) with the dependent variable (ΔGFEP). Data points are the same as in Figure 2. For the purposes of illustration, the inverse linear function of the classical regression is depicted in the case of the switched ΔG variables. φ marks the angle between the two classical regression lines.

An asymmetric relationship should be analyzed by the classical standard least-squares approach and a symmetric relationship by the triangular regression method. Therefore, we provide for this approach an extensively tested, ready-to-use Microsoft Excel add-in, LinEstXY, which can be quite useful and easily applicable not only for students but also for research purposes in general.





ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available on the ACS Publications website at DOI: 10.1021/acs.jchemed.6b00204. Derivation of mathematical expressions for both parameters (slope m and intercept b) in the case of y = mx + b as well as of the slope m and its corresponding

error σm for a line through the origin for the triangular linear regression (PDF, DOCX) Directions for installing and using the MS Excel add-in (PDF, DOCX) The MS Excel add-in LinEstXY.xlam containing the function LinEstXY, which performs the triangular linear regression of 2D data pairs with statistical errors in both variables using the equations listed in Table 2 (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

A. Beate C. Patzer: 0000-0003-3230-7763 F

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Notes

The authors declare no competing financial interest.



REFERENCES

(1) Frisch, R. Statistical Confluence Analysis by Means of Complete Regression Systems. Nord. Stat. J. 1934, 5, 1−192. (2) Woolley, E. B. A Method of Minimized Areas as a Basis for Correlation Analysis. Econometrica 1941, 9, 38−62. (3) Samuelson, P. A. A Note on Alternative Regressions. Econometrica 1942, 10, 80−83. (4) Warton, D. I.; Wright, I. J.; Falster, D. S.; Westoby, M. Bivariate Line-Fitting Methods for Allometry. Biol. Rev. 2006, 81, 259−291. (5) Kermack, K. A.; Haldane, J. B. S. Organic Correlation and Allometry. Biometrika 1950, 37, 30−41. (6) Ricker, W. E. Linear Regressions in Fishery Research. J. Fish. Res. Board Can. 1973, 30, 409−434. (7) Feigelson, E. D.; Babu, G. J. Linear Regression in Astronomy II. Astrophys. J. 1992, 397, 55−67. (8) Tofallis, C. Model Fitting for Multiple Variables by Minimizing the Geometric Mean Deviation. In Total Least Squares and Errors-inVariables Modeling: Algorithms, Analysis and Applications; Van Huffel, S., Lemmerling, P., Eds.; Kluwer Academic: Dordrecht, The Netherlands, 2002. (9) Kruskal, W. H. On the Uniqueness of the Line of Organic Correlation. Biometrics 1953, 9, 47−58. (10) Draper, N. R. Straight Line Regression When Both Variables Are Subject to Error. Presented at the Annual Conference on Applied Statistics in Agriculture, 1991; http://newprairiepress.org/ agstatconference/1991/proceedings/2 (accessed April 2018). (11) Montgomery, D. C.; Peck, E. A.; Vining, G. G. Introduction to Linear Regression Analysis; Wiley: New York, 2012. (12) Van Huffel, S.; Vandervalle, J. The Total Least Squares Problem: Computational Aspects and Analysis; SIAM: Philadelphia, 1991. (13) Markovsky, I.; Van Huffel, S. Overview of Total Least Squares Methods. Signal Processing 2007, 87, 2283−2302 and references therein. (14) Wentworth, W. E. Rigorous Least Squares Adjustment I. J. Chem. Educ. 1965, 42, 96−103. (15) Wentworth, W. E. Rigorous Least Squares Adjustment II. J. Chem. Educ. 1965, 42, 162−167. (16) Jefferys, W. H. On the Method of Least Squares I. Astron. J. 1980, 85, 177−181. (17) Jefferys, W. H. On the Method of Least Squares II. Astron. J. 1981, 86, 149−155. (18) Lybanon, M. A. Better Least Squares Method, When Both Variables Have Uncertainties. Am. J. Phys. 1984, 52, 22−26. (19) York, D. Least Squares Fitting of a Straight Line with Correlated Errors. Earth Planet. Sci. Lett. 1968, 5, 320−324. (20) Ryu, H. K. Rectangular Regression for an Error-in-Variables Model. Econ. Lett. 2004, 83, 129−135. (21) Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Lupyan, D.; Robinson, S.; Dahlgren, M. K.; Greenwood, J.; Romero, D. L.; Masse, C.; Knight, J. L.; Steinbrecher, T.; Beuming, T.; Damm, W.; Harder, E.; Sherman, W.; Brewer, M.; Wester, R.; Murcko, M.; Frye, L.; Farid, R.; Lin, T.; Mobley, D. L.; Jorgensen, W. L.; Berne, B. J.; Friesner, R. A.; Abel, R. Accurate and Reliable Prediction of Relative Ligand Binding Potency in Prospective Drug Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. J. Am. Chem. Soc. 2015, 137, 2695−2703. (22) Legendre, P.; Legendre, L. Numerical Ecology; Elsevier Science B.V.: Amsterdam, 1998.

G

DOI: 10.1021/acs.jchemed.6b00204 J. Chem. Educ. XXXX, XXX, XXX−XXX