Proteochemometrics-Based Prediction of Peptide Binding to HLA-DP

Jul 18, 2017 - ABSTRACT: Human leukocyte antigens (HLA) class II proteins are involved in the ... They form complexes with antigen peptide fragments. ...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jcim

Proteochemometrics-Based Prediction of Peptide Binding to HLA-DP Proteins Ventsislav Yordanov,§ Ivan Dimitrov,§ and Irini Doytchinova* Faculty of Pharmacy, Medical University of Sofia, 2 Dunav Street, Sofia 1000, Bulgaria S Supporting Information *

ABSTRACT: Human leukocyte antigens (HLA) class II proteins are involved in the antigen processing in the antigen presenting cells. They form complexes with antigen peptide fragments. The peptide−HLA protein complexes are presented on the cell surface where they are recognized by helper T cells (Th cells). HLA-DP is one of the three HLA class II loci. The HLA-DP proteins are associated with a significant number of autoimmune diseases, as well as with a susceptibility or resistance to a number of infectious agents. In the present study, we apply proteochemometricsa method for bioactivity modeling of multiple ligands binding to multiple target proteinsto derive and validate a robust model for peptide binding prediction to the 7 most frequent HLA-DP proteins. The model is able to identify 86% of the binders in the top 10% of the best predicted nonamers generated from one protein.



INTRODUCTION

Foreign proteins entering a host are processed by antigen presenting cells (APCs) such as macrophages, B cells, and dendritic cells. In the endosome, the antigen is cleaved into oligopeptide fragments. Some of the peptides are loaded into the major histocompatibility complex (MHC) class II molecules. The complex peptide−MHC protein is presented on the cell surface, where it is recognized by helper T cells (Th cells). In humans, the MHC proteins are called human leukocyte antigens (HLAs). HLAs class II contain three loci: DR, DQ, and DP.1 Each HLA class II protein consists of two subunits: chain α and chain β. HLAs are extremely polymorphic and polygenic. The IMGT/HLA database (September 2016) lists 2 DRA, 1,569 DRB, 33 DQA1, 647 DQB1, 21 DPA1, and 552 DPB1 proteins.2 The peptide binding site is located between the two chains and can bind nine amino acid residues. As it is open at both ends, longer peptides are able to bind.3 Quite rationally, the polymorphism is concentrated in the binding site, allowing the HLA proteins to capture an extremely wide range of antigen peptides. Our study is focused on HLA-DP proteins. HLA-DPs are the least polymorphic genes of class II. The crystallographic structure of the complex peptide−HLA-DP2 protein (pdb code: 3lqz)4 is shown in Figure 1. The HLA-DP proteins are associated with a significant number of autoimmune diseases,5−14 as well as with a susceptibility or resistance to a number of infectious agents. Susceptibility to a disease can be explained by the inability of a given MHC protein to express peptide fragments generated by the pathogen, due to low affinity between the two molecules. In contrast, the peptide fragments forming stable complexes with MHC proteins are recognized by Th cells and the infected cells © XXXX American Chemical Society

Figure 1. Structure of the complex peptide−HLA-DP2 protein (pdb: 3lqz). The HLA-DP2 protein consists of two transmembrane chains (in blue and beige). The bound peptide is given in red.

are destroyed. This explains the innate resistance to certain diseases.15 The most common HLA-DP proteins and their Special Issue: EuroQSAR 2016 Received: January 14, 2017 Published: July 18, 2017 A

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table 1. Frequency of HLA-DP Alleles among the Human Population and Their Associations with Various Autoimmune Diseasesa HLA-DP allele DP1 (DPA*02:01/DPB1*01:01) DP2 (DPA*01:03/DPB1*02:01) DP3 (DPA*01:03/DPB1*03:01) DP41 (DPA*01:03/DPB1*04:01) DP42a (DPA*01:03/DPB1*04:02) DP42b (DPA*03:01/DPB1*04:02) DP5 (DPA*02:01/DPB1*05:01) a

Frequency (%) 2−17 2−53 2−75 8−91 2−89

(avg. (avg. (avg. (avg. (avg.

5%) 18%) 13%) 33%) 32%)

1−97 (avg. 25%)

Autoimmune diseases Graves’ disease5 chronic beryllium disease,6,7 juvenile chronic arthritis,8 sarcoidosis,9 atopic myelitis10 juvenile chronic arthritis11 protection against celiac disease12 protection against chronic beryllium disease13 Graves’ disease14

Allele Frequencies in Worldwide Populations: http://www.allelefrequencies.net.

In the present study, we apply the PCM method to derive and validate a robust model for peptide binding prediction to the 7 most frequent HLA-DP proteins. Foreign peptides able to bind to MHC class II proteins are immunogens; that is, they are able to trigger an immune response. The ultimate aim of this response is to destruct the infected cells and the pathogens resident there. The immunogenic peptides and/or their parent proteins are valuable inputs for rational design and development of vaccines and immunotherapeutics.

association with various autoimmune diseases are given in Table 1. Taken together, these proteins account for over 90% of the human population.16 The proteins listed in Table 1 are the subject of the present study. Proteochemometrics (PCM) is a quantitative structure− activity relationship (QSAR) method developed by Lapinsh and colleagues for bioactivity modeling of multiple ligands binding to multiple target proteins.17,18 PCM can be regarded as an extension of QSAR, which combines the information from the ligands and target molecules in a single matrix, allowing the extrapolation of the results and prediction of biological activity of new compounds to novel target molecules.19 PCM operates with a plurality of target molecules, which is essential in the development of multitarget drug molecules.20 A disadvantage of PCM is its dependence on the variability of the ligands and target molecules; if an amino acid plays an important role in an interaction, but is conserved in the entire data set, PCM will not be able to assess its importance.21 PCM is widely used in drug design, drug repositioning, and toxicity predictions.19,20 The chemical structure of the ligands and proteins and the interactions between them is described by three blocks of descriptors in PCM: ligand (L), protein (P), and the ligand− protein (LP) (Figure 2).



EXPERIMENTAL SECTION Data Sets. HLA-DP Target Proteins. The seven most frequent HLA-DP proteins (Table 1) were used in the present study. The protein sequences were derived from IMGT/HLA (http://www.ebi.ac.uk/imgt/hla) (Figure 3). The peptide binding site on DP proteins is formed by the first 80 residues of the α-chain and of the first 90 residues of the β-chain.22,23 Among them there are four polymorphic residues in the αchain and 15 in the polymorphic β-chain. These are Ala/ Met11α, Met/Gln31α, Gln/Arg50α, Leu/Ser66α, Val/Leu8β, Tyr/ Phe9β, Gly/Leu11β, Tyr/Phe/Leu35β, Ala/Val36β, Ala/Asp/ Glu55β, Ala/Glu56β, Glu/Asp57β, Ile/Leu65β, Lys/Glu69β, Val/ Met76β, Asp/Gly84β, Glu/Gly85β, Ala/Pro86β, and Val/Met87β. The peptide binding sites on all HLA proteins, including HLADP, could be occupied by 9 amino acids only. As the binding site in HLA class II proteins is open-ended, peptides longer than 9 residues could bind also. The binding core consists of 9 residues; the rest are floating. Peptide Training and Test Sets. A data set of 4,304 15-mer peptides binding to the 7 HLA-DP proteins was collected from IEDB24 (http://www.immuneepitope.org) of the National Institute of Allergy and Infectious Diseases at the National Institutes of Health of the United States in February 2016 in accordance with the following keywords: MHC binding search; MHC binding assay; MHC allele name: HLA-DPA1*02:01/ DPB1*01:01; DPA*01:03/DPB1*02:01; DPA*01:03/ DPB1*03:01; DPA*01:03/DPB1*04:01; DPA*01:03/

Figure 2. Block of descriptors in proteochemometrics.

Figure 3. Amino acid sequences of α- and β-chains, forming the peptide binding site on HLA-DP proteins. B

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 2. Peptide Datasets Used in the Study Number of 15-mers in HLA-DP allele

Total number of 15-mers

IC50 range nM

pIC50 range

Training set

Test set

DP1 DP2 DP3 DP41 DP42a DP42b DP5 total

762 615 163 762 38 762 762 3,864

46,700−0.215 46,700−0.446 46,700−0.039 46,700−0.34 46,700−108 46,700−0.096 46,700−2.17 46,700−0.039

4.331−9.668 4.331−9.351 4.331−10.409 4.331−9.469 4.331−6.967 4.331−10.018 4.331−8.664 4.331−10.409

610 492 131 610 31 610 610 3,094

152 123 32 152 7 152 152 770

Table 3. Three z-Scales for the 20 Naturally Occurring Amino Acids26a Amino acid

z1 hydrophobicity

z2 molecular size

z3 electronic properties

Amino acid

z1 hydrophobicity

z2 molecular size

z3 electronic properties

A Ala C Cys D Asp E Glu F Phe G Gly H His I Ile K Lys L Leu

0.07 0.71 3.64 3.08 −4.92 2.23 2.41 −4.44 2.84 −4.19

−1.73 −0.97 1.13 0.39 1.30 −5.36 1.74 −1.68 1.41 −1.03

0.09 4.13 2.36 −0.07 0.45 0.30 1.11 −1.03 −3.14 −0.98

M Met N Asn P Pro Q Gln R Arg S Ser T Thr W Trp V Val Y Tyr

−2.49 3.22 −1.22 2.18 2.88 1.96 0.92 −4.75 −2.69 −1.39

−0.27 1.45 0.88 0.53 2.52 −1.63 −2.09 3.65 −2.53 2.32

−0.41 0.84 2.23 −1.14 −3.44 0.57 −1.40 0.85 −1.29 0.01

a

Descriptor z1 accounts for hydrophobicity, z2 − for molecular size and z3 − for electronic properties.

DPB1*04:02; DPA*03:01/DPB1*04:02; DPA*02:01/ DPB1*05:01; assay: purified MHC competitive binding measuring half maximal inhibitory concentration (IC50) [nM] using radioactivity. In the present study, the affinities of the tested peptides were expressed as pIC50 = log(1/IC50). Duplicates were removed, and the final set contained 3,864 15-mer peptides (Table 2). Among them 771 peptides (20%) were promiscuous; that is, they were binding to more than one HLA-DP protein. The peptides were grouped according to the alleles that bind and were arranged descending on pIC50. Every fifth peptide (20%) of each group was defined as a member of the test set and used for external validation. The other peptides formed the training set. Thus, the training set included 3,094 15-mer peptide and was used to derive the proteochemometric model. The external test set consisted of 770 15-mers and was used to validate the model. The parent proteins of the binding 15-mers from the test set also were collected and used for model validation. The training and test sets and the set of parent proteins used in the present study are given as Supporting Information. z-Scales. The protein and peptide sequences used in the present study were described by the amino acid three z-scales (z1, z2, and z3) (Table 3).25,26 The z-scales have been extensively tested in the context of PCM.27 The descriptor z1 accounts for the hydrophobicity of amino acids, the descriptor z2 for their size, and z3 for the electronic properties. The resulting set of 27 (9 × 3) descriptors formed block L. There are 19 polymorphic residues in DPA and DPB. They were also encoded by z-scales. The set of 57 (19 × 3) descriptors formed the block P. The contacts between the bound peptides and DP proteins are given in Table 4. These interactions were reported by 57 (19 × 3) cross terms and formed the block LP. Iterative Self-consistent Algorithm. The proteochemometric model in this study was derived by an iterative self-

Table 4. Peptide−HLA-DP Protein Contacts Peptide position p1 p2 p3 p4 p5 p6 p7 p8 p9

(L1) (L2) (L3) (L4) (L5) (L6) (L7) (L8) (L9)

Protein position 31α, 76β, 84β 76β 76β 69β, 76β 11α, 66α, 11β 66α, 55β, 65β, 69β 55β 9β, 35β, 36β, 55β

consistent algorithm (ISC)28,29 based on partial least-squares (PLS) method. The algorithm consists of the following steps (Figure 4): 1. As the peptide binding site on HLA could be occupied by 9 amino acids only (binding core), each 15-mer peptide was represented as a set of seven overlapping nonamers taking the pIC50 value of the parent peptide. Thus, the initial nonamer set consisted of 21,658 (3,094 × 7) nonamers. It was used to derive the first model. The optimal number of principal components (PC) was obtained by cross-validation in 7 groups at maximum q2. 2. The first model was used to predict the affinities of the initial nonamer set. One nonamer only was selected for every parent peptidethis one with pIC50 value closest to the experimentally measured. The selected nonamers formed a new training set. 3. The new training set was used to derive the second model, which again predicts the affinities of the initial nonamer set. Again, only nonamers with pIC50 value closest to the experimentally measured were selected. The selected nonamers formed a new training set used to derive a new model. C

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Article



RESULTS AND DISCUSSION Model Development. The training set of 3,094 15-mer peptides binding to 7 HLA-DP proteins was used to derive the model by the PLS-based ISC algorithm as described in the Experimental Section. The X matrix consisted of 27 descriptors from block L, 57 descriptors from block P, and 57 cross terms from block LP, 141 descriptors in total. Each 15-mer peptide was presented as a set of overlapping nonamers acquiring the pIC50 value of the parent 15-mer as described in the Experimental Section. Self-consistency of 100% identity in two consecutive peptide sets was achieved at the 144-th iteration (Figure 5). The final model had r2 = 0.922 and q2 = 0.913 at PC = 8. Model Validation. The model was validated by an external test set of 770 15-mer peptides. Each 15-mer was presented as a set of overlapping nonamers, and the pIC50 values of the nonamers were predicted by the model. The highest predicted pIC50 for a 15-mer and the average of the predicted pIC50s for a 15-mer were compared to the experimental pIC50s. A better correlation was observed with the average values (rpred = 0.556) than with the highest ones (rpred = 0.384). As the PCM model had moderate predictive ability, we tried to improve it by preprocessing of the data set. The nonbinding nonamers derived from the nonbinding 15-mers (pIC50 < 5.3) were removed from the binding sequences. Many identical nonamers had different pIC50s. To avoid this ambiguity, the pIC50 value corresponding to one nonamer was presented in four modes: all pIC50s were considered, the maximum or the minimum pIC50 value was considered only, and the average of all corresponding pIC50s was considered. All four modes of presentation led to similar models and predictions. None of them was better than the initial one. The PCM model was tested also as a binder classifier. The test 15-mer peptide was classified as a binder if it contains a nonamer sequence predicted by the model to be a binder. Peptides with IC50 > 5000 nM (pIC50 < 5.3) are considered as nonbinders, while peptides with 5000 < IC50 < 500 nM (6.3 > pIC50 > 5.3) are considered as weak binders.24 Tests were performed at two thresholds: pIC50 = 5.3 (IC50 = 5,000 nM) and pIC50 = 6.3 (IC50 = 500 nM). The results are given in Table 5. The AUCs are illustrated in Figure 6. Increasing the

Figure 4. Iterative self-consistent algorithm. The initial set consists of 3,094 × 7 nonamers (Set 0) and is used to derive the initial model (M0). This model predicts the affinities of Set 0. For every parent 15mer peptide, only one nonamer is selectedthis with pIC50 value closest to the experimentally measured. The selected nonamers form a new training set (Set 1). Set 1 is used to derive the second model (M1), which again predicts the affinities of the initial nonamer set (Set 0). Again, only nonamers with pIC 50 value closest to the experimentally measured are selected into Set 2. Set 2 is used to derive model M2 and so on. At each iteration, the identity between two consecutive sets is recorded. The procedure is repeated until 100% (or close to it) identity is achieved. Then, the final model is derived.

4. The procedure was repeated until 100% (or close to it) of the peptides obtained in two consecutive training sets remain the same. Then, the final model was derived. Model Assessment. The model was assessed by r2 (goodness of fit), q2 (cross validation in 7 groups), and r2pred (external validation by a test set). The SIMCA 13.0 package30 was used to undertake PLS calculations. The number of principal components (PC) was derived for each iteration based on the maximum value of q2. The ability of the model to classify a peptide as a binder or nonbinder was tested on the external test set. A test peptide was classified as a binder if it contains a nonamer sequence predicted by the model to be a binder. The predictive ability was assessed by sensitivity (true predicted binders/all binders in the test set), specif icity (true predicted nonbinders/all nonbinders in the test set), accuracy (true predicted binders and nonbinders/all peptides in the test set), and area under the curve sensitivity vs 1-specif icity (AUC) at two thresholds: pIC50 = 5.3 and pIC50 = 6.3.

Table 5. Classification Performance of the Derived Model at Two Thresholds: pIC50 = 5.3 and pIC50 = 6.3 Threshold pIC50 5.3 6.3

Sensitivity % Specificity % 92.9 87.6

92.1 97.6

Accuracy %

AUC

MCC

92.5 92.6

0.965 0.980

0.849 0.859

Figure 5. PCM model development by the PLS-ISC algorithm. D

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

given in Figure 8. Among them are descriptors from blocks L and LP.

Figure 6. AUCs for the external test set of 770 peptides at two thresholds: pIC50 = 5.3 and pIC50 = 6.3.

Figure 8. Descriptors with coefficients (scaled and centered) above ±0.1.

The coefficients in the PCM model are indicative for the preferred residues in the peptide. Positive coefficients mean that amino acids with positive values of the corresponding z descriptor will increase pIC50 and will be preferred in the binding site. Negative coefficients point to the fact that residues with negative z descriptors will increase the affinity. According to the model, small sized amino acids are preferred at peptide positions 2 (L2) and 3 (L3) (Table 6).

threshold from 5.3 to 6.3 decreases the sensitivity from 93% to 88% but increases the specif icity from 92% to 98%. The overall accuracy does not change significantlyfrom 92.5% to 92.6%. The AUC and MCC also increase from 0.965 to 0.980 and from 0.849 to 0.859, respectively. The model was tested to identify the binders among the nonamers generated from one protein. They were 366 binders (pIC50 > 5.3) in the test set originating from 59 parent proteins. Each protein contains binders to more than one DP allele. The parent proteins were presented as sets of overlapping nonamers. The pIC50 values of the nonamers were predicted by the PCM model. The top 5%, 10%, 15%, and 20% of the predicted nonamers were compared to the test 15-mer binders. If a nonamer sequence is present in the 15-mer binder, it is considered as a binder. The model identified 86% of the test binders among the top 10% of the best predicted nonamers (Figure 7). Implementation of the HLA-DP Model in the Server EpiTOP. The novel proteochemometric model derived in the present study was implemented in our server for HLA class II binding prediction EpiTOP (http://www.ddg-pharmfac.net/ EpiTOP3). EpiTOP is a web-based application written in Python and HTML, and integrating the MySQL database environment. EpiTOP identified peptides binding to the 12 HLA-DRB1, 5 HLA-DQ, and 7 HLA-DP most frequent HLA class II proteins with options to vary HLA allele and cutoff. Model Interpretation. The proteochemometric model derived in the present study consists of three descriptor blocks: L, P, and LP. The descriptors with coefficients above ±0.1 are

Table 6. Coefficients of the z Descriptors from Block L and Preferred Peptide Residuesa Peptide position

a

z1

z2

p2 (L2)

−0.216

p3 (L3)

−0.107

p5 (L5)

0.237

p6 (L6)

−0.130

p8 (L8) p9 (L9)

−0.297

z3

preferred residues

0.108

0.214 −0.118

Ala, Cys, Gly, Ile, Leu, Met, Ser, Thr, Val Ala, Cys, Gly, Ile, Leu, Met, Ser, Thr, Val Asp, Asn, Glu, Gln, His, Arg, Lys Phe, Ile, Leu, Met, Pro, Trp, Val, Tyr Phe, Trp, Tyr, Pro Glu, Ile, Lys, Leu, Met, Gln, Arg, Thr, Val

The known preferred residues4 are given in bold.

At position 5 (L5) are well tolerated hydrophilic bulky residues. Hydrophobic residues are preferred at position 6 (L6), which corresponds to the large and deep pocket 6 in DP proteins.4 Position 8 (L8) requires bulky hydrophobic residues,

Figure 7. Known binders from the test set identified by the PCM model. E

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 7. Coefficients of the z Descriptors from Block LP and Preferred Peptide Residuesa DP2

DP3

DP41

DP42a

DP42b

DP5

+0.135*LP_1_a31_z11 α31 Gln α31 z1 2.18 p1 z1 Positive Preferred aa hydrophilic

Cross term

DP1

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Gln 2.18 Positive hydrophilic

+0.217*LP_2_b76_z11 β76 Val β 76 z1 −2.69 p2 z1 Negative Preferred aa hydrophobic

Met −2.49 Negative hydrophobic

Val −2.69 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

Met −2.49 Negative hydrophobic

+0.104*LP_3_b76_z22 β 76 Val β 76 z2 −2.53 p3 z2 Negative Preferred aa small sized

Met −0.27 Negative small sized

Val −2.53 Negative small sized

Met −0.27 Negative small sized

Met −0.27 Negative small sized

Met −0.27 Negative small sized

Met −0.27 Negative small sized

+0.103*LP_6_a66_z11 α66 Leu α66 z1 −4.19 p6 z1 Negative Preferred aa hydrophobic

Leu −4.19 Negative hydrophobic

Leu −4.19 Negative hydrophobic

Leu −4.19 Negative hydrophobic

Leu −4.19 Negative hydrophobic

Ser 1.96 Positive hydrophilic

Leu −4.19 Negative hydrophobic

−0.152*LP_6_b11_z11 β11 Gly β11 z1 2.23 p6 z1 Negative Preferred aa hydrophobic

Gly 2.23 Negative hydrophobic

Leu −4.19 Positive hydrophilic

Gly 2.23 Negative hydrophobic

Gly 2.23 Negative hydrophobic

Gly 2.23 Positive hydrophilic

Gly 2.23 Negative hydrophobic

−0.271*LP_8_b55_z11 β55 Ala β55 z1 0.07 p8 z1 Negative Preferred aa hydrophobic

Asp 3.64 Negative hydrophobic

Asp 3.64 Negative hydrophobic

Ala 0.07 Negative hydrophobic

Asp 3.64 Negative hydrophobic

Asp 3.64 Negative hydrophobic

Glu 3.08 Negative hydrophobic

−0.159*LP_8_b55_z22 β55 Ala β55 z2 −1.73 p8 z2 Positive Preferred aa bulky

Asp 1.13 Negative small sized

Asp 1.13 Negative small sized

Ala −1.73 Positive bulky

Asp 1.13 Negative small sized

Asp 1.13 Negative small sized

Glu 0.39 Negative small sized

Phe 0.45 Negative

Phe 0.45 Negative

Phe 0.45 Negative

Phe 0.45 Negative

−0.106*LP_9_b09_z33 β09 Tyr β09 z3 0.01 p9 z3 Negative Preferred aa Ile, Leu, Met, Val, a

Phe Tyr 0.45 0.01 Negative Negative Glu, Gln, Thr, Lys, Arg

The known preferred residues [4] are given in bold.

could exist at β76: Val and Met; both are hydrophobic. The positive coefficient of the cross term LP_2_b76_z11 means that hydrophobic residues are preferred here. The same is true for peptide position 3. Pocket 6 in DP2 also is deep and large and can accommodate aromatic amino acids.4 The presence of Leuα66 and Glyβ11 makes pocket 6 in DP1, DP41, DP42a, and DP5 also acceptable to hydrophobic residues. Serα66 in DP42b and Leuβ11 in DP3 point to less hydrophobic residues at this peptide position. Position 8 is solvent-exposed and makes no contact with β55, but there are two cross terms considering this coexistence. The hydrophilic residues at β55 coexist well with

while position 9 (L9) accepts amino acids with negative z3 values, which corresponds to the shape and size of pocket 9.4 The most important cross terms in the PCM model are summarized in Table 7. Pocket 1 in DP2 is deep and hydrophobic and can accommodate all hydrophobic residues, including large aromatic amino acids such as Phe, Trp and Tyr.4 This is true also for DP3, DP41, DP42a, and DP42b, where Met exists at protein position α31. However, in DP1 and DP5 the pocket is lined by Glnα31 which makes it less hydrophobic. The peptide position 2 protrudes up the binding site and has no direct contact with protein position β76. Two amino acids F

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

(4) Dai, A.; Murphy, G. A.; Crawford, F.; Mack, D. G.; Falta, M. T.; Marrack, P.; Kappler, J. W.; Fontenot, A. P. Crystal Structure of HLADP2 and Implications for Chronic Beryllium Disease. Proc. Natl. Acad. Sci. U. S. A. 2010, 107, 7425−7430. (5) Ratanachaiyavong, S.; McGregor, A. M. HLA-DPB1 Polymorphisms on the MHC-extended Haplotypes of Families of Patients with Graves’ Disease: Two Distinct HLA-DR17 Haplotypes. Eur. J. Clin. Invest. 1994, 24, 309−315. (6) Fontenot, A. P.; Kotzin, B. L. Chronic Beryllium Disease: Immune-mediated Destruction with Implications for Organ-specific Autoimmunity. Tissue Antigens 2003, 62, 449−458. (7) Petukh, M.; Wu, B.; Stefl, S.; Smith, N.; Hyde-Volpe, D.; Wang, L.; Alexov, E. Chronic Beryllium Disease: Revealing the Role of Beryllium Ion and Small Peptides Binding to HLA-DP2. PLoS One 2014, 9, e111604. (8) Begovich, A. B.; Bugawan, T. L.; Nepom, B. S.; Klitz, W.; Nepom, G. T.; Erlich, H. A. A Specific HLA-DRβ Allele is Associated with Pauciarticular Juvenile Rheumatoid Arthritis but not Adult Rheumatoid Arthritis. Proc. Natl. Acad. Sci. U. S. A. 1989, 86, 9489−9493. (9) Lympany, P. A.; Petrek, M.; Southcott, A. M.; Taylor, A. J. N.; Welsh, K. I.; du Bois, R. M. HLA-DPB Polymorphism: Glu69 Association with Sarcoidosis. Int. J. Immunogenet. 1996, 23, 353−359. (10) Sato, S.; Isobe, N.; Yoshimura, S.; Kanamori, Y.; Masaki, K.; Matsushira, T.; Kira, J. HLA-DPB1*0201 is Associated with Susceptibility to Atopic Myelitis in Japanese. J. Neuroimmunol. 2012, 251, 110−113. (11) Donn, R. Etiology and Pathogenesis of Juvenile Idiopathic Arthritis. In Hochberg, M. C., Silman, A. J., Smolen, J. S., Weinblatt, M. E., Weisman, M. H., Eds.; Rheumatology, 6th ed.; Elsevier: Philaderphia, 2015. (12) Hadley, D.; Hagopian, W.; Liu, E.; She, J. X.; Simell, O.; Akolkar, B.; Ziegler, A. G.; Rewers, M.; Krischer, J. P.; Chen, W. M.; et al. TEDDY Study Group. HLA-DPB1*04:01 Protects Genetically Susceptible Children from Celiac Disease Autoimmunity in the TEDDY Study. Am. J. Gastroenterol. 2015, 110, 915−920. (13) Arroyo, J.; Alvarez, A. M.; Nombela, C.; Sanchez-Perez, M. The Role of HLA-DP Beta Residue 69 in the Definition of Antibodybinding Epitopes. Hum. Immunol. 1995, 43, 219−226. (14) Takahashi, M.; Kimura, A. HLA and CTLA4 Polymorphisms May Confer a Synergistic Risk in the Susceptibility to Graves’ Disease. J. Hum. Genet. 2010, 55, 323−326. (15) Scherer, A.; Frater, J.; Oxenius, A.; Agudelo, J.; Price, D. A.; Günthard, H. F.; Barnardo, M.; Perrin, L.; Hirschel, B.; Phillips, R. E.; et al. Swiss HIV Cohort Study. Quantifiable Cytotoxic T Lymphocyte Responses and HLA-related Risk of Progression to AIDS. Proc. Natl. Acad. Sci. U. S. A. 2004, 101, 12266−12270. (16) Sidney, J.; Steen, A.; Moore, C.; Ngo, S.; Chung, J.; Peters, B.; Sette, A. Five HLA-DP Molecules Frequently Expressed in the Worldwide Human Population Share a Common HLA Supertypic Binding Specificity. J. Immunol. 2010, 184, 2492−2503. (17) Lapinsh, M.; Prusis, P.; Gutcaits, A.; Lundstedt, T.; Wikberg, J. E. Development of Proteo-chemometrics: a Novel Technology for the Analysis of Drug-Receptor Interactions. Biochim. Biophys. Acta, Gen. Subj. 2001, 1525, 180−190. (18) Prusis, P.; Uhlén, S.; Petrovska, R.; Lapinsh, M.; Wikberg, J. E. Prediction of Indirect Interactions in Proteins. BMC Bioinf. 2006, 7, 167. (19) Cortés-Ciriano, I.; Ain, Q. U.; Subramanian, V.; Lenselink, E. B.; Méndez-Lucio, O.; IJzerman, A. P.; Wohlfahrt, G.; Prusis, P.; Malliavin, T. E.; van Westen, G. J. P.; et al. Polypharmacology Modelling Using Proteochemometrics (PCM): Recent Methodological Developments, Applications to Target Families, and Future Prospects. MedChemComm 2015, 6, 24−50. (20) van Westen, G. J. P.; Wegner, J. K.; IJzerman, A. P.; van Vlijmen, H. W. T.; Bender, A. Proteochemometric Modeling as a Tool to Design Selective Compounds and for Extrapolating to Novel Targets. MedChemComm 2011, 2, 16−30. (21) Lapinsh, M.; Prusis, P.; Lundstedt, T.; Wikberg, J. E. Proteochemometrics Modeling of the Interaction of Amine G-protein

hydrophobic amino acids at peptide position 8. The small sized Alaβ55 coexists with bulky residues at p8, while the longer Aspβ55 and Gluβ55 coexist with small sized residues. Finally, the dimorphism Tyr/Phe at β9 prefers amino acids with negative z3 values at peptide position 9. This includes a wide range of amino acids such as the aliphatic Ile, Leu, Met, and Val, the polar Gln and Thr, and the charged residues Glu, Lys, and Arg. This is a good agreement with the X-ray structure of DP2.4 Our findings are in a good agreement with the binding motifs defined by Andreatta and Nielsen for five common DP allelic variants (DP1, DP2, DP41, DP42b, and DP5).31 Briefly, all variants share a common pattern characterized by anchors at p1 and p6, with strong preferences for Phe and other aromatic or hydrophobic amino acids. They also have observed preferences for hydrophobic amino acids at p9. In conclusion, the proteochemometric model derived in the present study has good explanatory and moderate predictive ability. Most of the preferred amino acids at the bound peptide are confirmed by the X-ray structure of the complex peptide− HLA-DP2 protein. The model was implemented in the server EpiTOP, and it is freely accessible at http://www.ddgpharmfac.net/epitop.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00026. Training set (Table S1) (XLSX) Test set (Table S2) (XLSX) Set of parent proteins of the binding 15-mers from the test set (Table S3) (TXT)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone +359 2 9236506. ORCID

Irini Doytchinova: 0000-0002-1469-1768 Author Contributions §

V.Y. and I.D. contributed equally.

Notes

The authors declare no competing financial interest.



ABBREVIATIONS USED AUC, area under the curve sensitivity vs 1-specificity; HLA, human leukocyte antigen; ISC, iterative self-consistent algorithm; L, ligand (peptide); LOTO, leave-one-target-out cross validation; LP, ligand−protein cross term; MCC, Mathews correlation coefficient; P, protein; PCM, proteochemometrics; PLS, partial least-squares



REFERENCES

(1) Janeway, Jr, C. A.; Travers, P.; Walport, M.; Shlomchik, M. J. The Major Histocompatibility Complex and its Functions. in Immunobiology: The Immune System in Health and Disease; Garland Science: New York, 2001. (2) Robinson, J.; Halliwell, J. A.; Hayhurst, J. H.; Flicek, P.; Parham, P.; Marsh, S. G. E. The IPD and IMGT/HLA Database: Allele Variant Databases. Nucleic Acids Res. 2015, 43, D423−431. (3) Doytchinova, I. A.; Flower, D. R. QSAR and the Prediction of TCell Epitopes. Curr. Proteomics 2008, 5, 73−95. G

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Coupled Receptors with a Diverse Set of Ligands. Mol. Pharmacol. 2002, 61, 1465−1475. (22) Patronov, A.; Dimitrov, I.; Flower, D. R.; Doytchinova, I. Peptide Binding Prediction for the Human Class II MHC Allele HLADP2: a Molecular Docking Approach. BMC Struct. Biol. 2011, 11, 32. (23) Patronov, A.; Dimitrov, I.; Flower, D. R.; Doytchinova, I. Peptide Binding to HLA-DP Proteins at pH 5.0 and pH 7.0: a Quantitative Molecular Docking Study. BMC Struct. Biol. 2012, 12, 20. (24) Vita, R.; Overton, J. A.; Greenbaum, J. A.; Ponomarenko, J.; Clark, J. D.; Cantrell, J. R.; Wheeler, D. K.; Gabbard, J. L.; Hix, D.; Sette, A.; et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015, 43, D405−D412. (25) Hellberg, S.; Sjostrom, M.; Skagerberg, B.; Wold, S. Peptide Quantitative Structure - Activity Relationships, a Multivariate Approach. J. Med. Chem. 1987, 30, 1126−1135. (26) Eriksson, L.; Jonsson, J.; Sjostrom, M.; Wold, S. Multivariate Parametrization of Coded and Non-coded Amino Acids by Thin Layer Chromatography. Prog. Clin. Biol. Res. 1989, 291, 131−134. (27) van Westen, G. J. P.; Swier, R. F.; Cortes-Ciriano, I.; Wegner, J. K.; Overington, J. P.; IJzerman, A. P.; van Vlijmen, H. W. T.; Bender, A. Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets. J. Cheminf. 2013, 5, 42. (28) Doytchinova, I. A.; Flower, D. R. Towards the In Silico Identification of Class II Restricted T-cell Epitopes: a Partial Least Squares Iterative Self-consistent Algorithm for Affinity Prediction. Bioinformatics 2003, 19, 2263−2270. (29) Mallios, R. R. An Iterative Approach to Class II Predictions. In Immunoinformatics: Predicting Immunogenicity In Silico. Methods in Molecular Biology; Flower, D. R., Ed.; Humana Press, 2007; p 409. (30) SIMCA 13.0. MKS Data Analytics Solutions. https:// mksdataanalytics.com. (31) Andreatta, M.; Nielsen, M. Characterizing the Binding Motifs of 11 Common Human HLA-DP and HLA-DQ Molecules using NNAlign. Immunology 2012, 136, 306−311.

H

DOI: 10.1021/acs.jcim.7b00026 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX