Where Informatics Lags Chemistry Leads - Biochemistry (ACS

Publication Date (Web): December 21, 2017. Copyright © 2017 American Chemical Society. *Department of Chemistry, Indian Institute of Technology Delhi...
0 downloads 12 Views 2MB Size
Communication Cite This: Biochemistry XXXX, XXX, XXX−XXX

pubs.acs.org/biochemistry

Where Informatics Lags Chemistry Leads Rahul Kaushik,†,‡ Ankita Singh,‡,§ and B. Jayaram*,†,‡,∥ †

Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Delhi, India § Department of Bioinformatics, Banasthali Vidyapith, Banasthali, India ∥ Department of Chemistry, Indian Institute of Technology, Delhi, India ‡

S Supporting Information *

sequence comparison algorithms and growth of the PDB over the past two decades have contributed immensely to the popularity of homology-based protein structure predictions. However, homologue detection becomes difficult with dwindling apparent sequence similarities and can lead to inconclusive and/or incorrect structures.16 A previous study explained the dissimilarities in amino acid sequences via three sets of properties, resulting from a principal component analysis of 29 physicochemical features.17 Recently, Scheraga and co-workers reported a homologue detection method based on hundreds of physicochemical properties and classifications of amino acids as an alternative to evolutionary approaches.18,19 The results showed some improvements in structure prediction over conventional methods. Thus, the core of the problem is in identifying similarities in amino acid sequences. Visualizing amino acids as a set of chemical templates made of different structural chemical side chain properties that characterize the chemical logic of protein sequences, we propose here a new amino acid classification that can be stated as a set of rules. (1) Each amino acid can be specified with a minimum of one and a maximum of three of four possible structural chemical properties. (2) Exactly 10 amino acids exhibit each property. (3) Only four amino acids can simultaneously share any two properties. (4) Only one amino acid can exhibit any three properties. (5) Only one amino acid exhibits a single property exclusively. None of the textbook classifications conform to all five rules simultaneously. The new structural chemical properties proposed here comprise (i) the presence of sp3-hybridized γcarbons (represented as “g”), (ii) the presence of hydrogen bond donor groups (represented as “d”), (iii) the absence of δcarbons (represented as “s”), and (iv) linearity/nonoccurrence of bidentate forks (represented as “l”) in the side chains. This incidentally explains the existence of only 20 naturally and commonly occurring amino acids (Table 1). Note that all five rules are satisfied with the property space classification of amino acids proposed above. Assignments to tryptophan and cysteine may be treated as provisional

ABSTRACT: The fact that amino acid sequences dictate the tertiary structures of proteins has been known for more than five decades. While the molecular pathways to tertiary structure are still being worked out, with the axiom that similar sequences adopt similar structures, computational methods are being developed continually in parallel, utilizing the Protein Data Bank structural repository and homologue detection strategies to predict structures of sequences of interest. The success of this approach is limited by the ability to unravel the hidden similarities among amino acid sequences. We consider here the 20 amino acids as a complete set of chemical templates in the physicochemical space of proteins and propose a new structural and chemical classification of amino acids. An integration of this perspective into the conventional evolutionary methods of similarity detection leads to an unprecedented increase in the accuracy in homologue detection, resulting in improved protein structure prediction. The performance is validated on a large data set of 11716 unique proteins, and the results are benchmarked against conventional methods. The availability of good quality protein structures helps in structure-based drug design endeavors and in establishing protein structure−function correlations.

P

rotein folding is a natural physicochemical process resulting in a unique arrangement of atoms in Euclidean space in accordance with the environmental conditions and dynamics.1,2 The accumulating knowledge of protein folding gained through molecular and evolutionary approaches has been helping the protein engineering and protein structure prediction fields considerably.3,4 Conventional methods for protein structure prediction known as homology modeling rely on the relationship among function, structure, and sequence.5−7 Thus, the focus of these methods is in identifying apparent and hidden similarities between the sequence of interest and sequences in the structural repositories.8 An amino acid substitution matrix has been proposed previously9−11 based on evolutionary conservation, which facilitates the detection of homologous sequences, after which a tertiary structure is built using this homologue in the Protein Data Bank (PDB) as a template.12 The model structure can be further refined to remove atomic clashes and made energetically stable via molecular dynamics.13−15 Improvements in © XXXX American Chemical Society

Received: October 23, 2017 Revised: December 7, 2017 Published: December 21, 2017 A

DOI: 10.1021/acs.biochem.7b01073 Biochemistry XXXX, XXX, XXX−XXX

Communication

Biochemistry Table 1. Unique Classification of Amino Acids Based on Structural Chemical Propertiesa amino acid A C D E F G H I K L M N P Q R S T V W Y

g1

g2

g3

d1

d2

d3

s1

s2

s3

* *

*

l1

l2

l3

*

*

* *

*

* * *

*

*

*

*

*

*

* *

* *

*

*

*

*

* * * *

Figure 1. Representation of a structural chemical property-based classification of amino acids and their relationship-based substitution scoring framework. The single-letter codes (uppercase) are used to designate individual amino acids, and lowercase letters are used to denote structural chemical properties. Further details are provided in the Supporting Information.

* * * *

* *

* *

* *

*

The first column represents the single-letter code of 20 amino acids. For each amino acid, an asterisk signifies the occurrence of the corresponding property and the subscripts in the first row denote the property multiplicity that adds to 3 for each amino acid. a

(preliminary findings are available at http://precedings.nature. com/documents/2135/version/1, and additional details are provided in the Supporting Information). Now to compare any two sequences, amino acid substitution scores are required. A popular matrix that presents these substitution scores derived from evolutionary conservation probabilities is called a BLOSUM matrix.9,20,21 The elements of this 20 × 20 matrix are log off odds scores derived from amino acid substitution probabilities and occurrence frequencies as shown in eq 1. Pij 1 scoreij = log FF λ i j (1)

Figure 2. Chemistry-fortified evolutionary substitution matrix derived for comparing amino acid sequences to detect potential homologues.

an efficient alignment. The best alignment identified between the sequence of interest and its template/reference in the PDB is used to build the tertiary structure for the query sequence. Modeler23 is a useful utility in this context. An overall workflow from query sequence to its structure is depicted in Figure 3. The proposed integrated model is validated on a large data set of experimentally determined proteins.12 Homologue detection is considered easier when the protein sequences are more than 40% identical.16,24 Thus, we excluded all those sequences that are >40% identical in the PDB and considered only those that are