Molecular dynamics information improves cis ... - ACS Publications

Molecular dynamics information improves cis-peptide based function annotation of proteins. Sreetama ... Indian Institute of Science, Bangalore 560012...
2 downloads 0 Views 5MB Size
Subscriber access provided by TUFTS UNIV

Article

Molecular dynamics information improves cispeptide based function annotation of proteins Sreetama Das, Pratiti Bhadra, Suryanarayanarao Ramakumar, and Debnath Pal J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00217 • Publication Date (Web): 21 Jun 2017 Downloaded from http://pubs.acs.org on June 22, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Molecular dynamics information improves cis-peptide based function annotation of proteins

Sreetama Das1, Pratiti Bhadra2, Suryanarayanarao Ramakumar1 and Debnath Pal2* 1 2

Department of Physics

Department of Computational and Data Sciences Indian Institute of Science, Bangalore 560012 India

Contact: * Debnath Pal

S. Ramakumar

Email: [email protected]

Email: [email protected]

Telephone: +91-80-2293-2901

Telephone: +91-80-2293-2312

FAX: +91-80-2360-2648

FAX: +91-80-2360-2602

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT Cis-peptide bonds, whose occurrence in proteins is rare but evolutionarily conserved, are implicated to play an important role in protein function. This has led to their previous use in a homologyindependent, fragment match-based protein function annotation method. However, proteins are not static molecules; dynamics is integral to their activity. This is nicely epitomized by the geometric isomerization of cis-peptide to trans form for molecular activity. Hence we have incorporated both static (cis-peptide) and dynamics information to improve the prediction of protein molecular function. Our results show that cis-peptide information alone cannot detect functional matches in cases where cis-trans isomerization exists but 3D coordinates have been obtained for only the trans isomer, or when the cis-peptide bond is incorrectly assigned as trans. On the other hand, use of dynamics information alone includes false-positive matches for cases where fragments with similar secondary structure show similar dynamics, but the proteins do not share a common function. Combining the two methods reduces errors while detecting the true matches, thereby enhancing the utility of our method in function annotation. A combined approach, therefore, opens up new avenues of improving existing automated function annotation methodologies.

KEYWORDS: cis-peptide fragment, fragment-based method, Gene Ontology, sequence-structure patterns, geometric clustering, molecular dynamics simulation, coarse grained forcefield, autocorrelation vector, function annotation, validation.

2

ACS Paragon Plus Environment

Page 2 of 35

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

INRODUCTION The importance of proteins as molecular 'workhorses' makes it imperative to understand how they function. However, a vast majority of the proteins catalogued in public sequence and structure databases do not have experimentally verified functional annotation. The inadequacy of experimental approaches in manually curating these proteins in bulk necessitates the use of computational function prediction tools. The simplest prediction methods involve the assessment of similarity in sequence and three-dimensional structure with homologous proteins of known function. The presence of high overall similarity, however, does not predict function unambiguously since certain protein folds are associated with multiple functions while proteins with different folds may share functional traits (1-4). Often proteins with different global structure are found to have structural similarity at the local level of segments of residues that are responsible for the similarity in function (5). This has given rise to fragment-based (FB) function annotation methods. FB methods may involve locating functionally relevant surface patches or cavities formed by sequentially distant residues (5-7), or the presence of structurally conserved, contiguous residue fragments with proven relevance to function (8, 9). A useful tool in this context is cis-peptide containing fragments since these peptides, especially those involving non-Proline amino acids, have been implicated to play an important role in protein structure and function (10). Cis-peptides have a low frequency of occurrence in proteins (11, 12) but are usually found to be conserved by evolution (13). They are often associated with ligand binding sites (14, 15) and dimerization interfaces (16). Cis-trans isomerization is expected to play a regulatory role in many cellular processes (17). Nonconservation of these peptides is implicated in the evolution of different function among similar protein folds (18). Several studies have focused on detecting cis-peptides from residue sequences and predicting cis-trans isomerisation in proteins (19-21). Given the functional relevance of cispeptides, a previous study (22) demonstrated the determination of functionally relevant cis-peptide containing fragments from proteins of known function and their use in providing annotation to uncurated proteins of known structure. 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Proteins are not static entities but exhibit dynamic behaviour which plays a role in binding (23, 24), allosteric regulation (25-28), catalysis (29-31) and ultimately results in the emergence of new functions (32, 33). Local motions like changes in residue conformations or loop movements occur on the timescale of nanoseconds and are expected to optimize binding interactions, while larger rearrangements of substructures (e.g. domain movements, allosteric effects) occur on the timescale of microseconds to milliseconds and are believed to be important in protein-protein interactions, signal transduction, etc. (34). Residues involved in substrate recognition are expected to exhibit enhanced mobility compared to the other residues in the absence of the binding partner (35). The cis-trans isomerization of the omega angle involves a local conformation change that is often compensated by variation of backbone angles of the residues flanking the cis-peptide, thereby avoiding global change. However, these changes are sometimes biologically crucial (36). Therefore, the backbone dynamics of atoms near the cis-peptide are influenced by the cis-peptide dynamics. Hence, the protein fragments directly linked to molecular function ought to be detected by their dynamics signature. The pattern of dynamics may also be used to locate similarly dynamic fragments in other proteins which perform similar functions. Such dynamics information can be obtained from experiments (X-ray crystallography, NMR) and computational studies (normal mode analysis, molecular dynamics (MD)). The experimental techniques provide ensemble-averaged information about dynamics on a long timescale and not the details at a single molecule level. On the other hand, atomistic MD simulations are exhaustive but computationally expensive for longer runtimes (37). A useful tool in this respect is normal mode analysis (38, 39); however, it calculates all possible modes for the given 3D structure and does not identify the functionally relevant modes, which have to be inferred from additional data (40). Coarse-grained (CG) simulations which sufficiently sample the protein conformational space at a reasonable computational expense are a viable tool to study functionally relevant protein dynamics. CG potential is mainly used to represent backbone dynamics. In most of the existing CG potentials (41, 42), the parameters of the cis-peptide bond are different from those of the trans-peptide bond. 4

ACS Paragon Plus Environment

Page 4 of 35

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A recent study (43, 44) demonstrated the use of a novel coarse-grained molecular mechanics (CGMM) forcefield, which closely reproduced experimental dynamics information, in the extraction of dynamics patterns for function inference. It may be noted that use of MD in function annotation is not new; however, its application has been limited to obtaining static frames from the dynamics trajectory on which functional site analysis was performed (45). Besides, the aforementioned study was limited to the identification of Ca2+ ion binding sites in proteins of known function, and not extended to un-annotated proteins. In the present study, we incorporate the dynamics of functionally important cis-peptide segments to show how it improves the prediction of protein molecular function. The detected fragments have been validated for their utility in function prediction on a dataset of known proteins, and then utilized in the annotation of un-curated proteins with available three-dimensional structure. Combining static and dynamic information provides a holistic framework to improve function annotation pipelines. Our work is a useful addition to the toolkit of function annotation approaches and will facilitate protein engineering studies around the cis-peptide neighbourhood of proteins.

EXPERIMENTAL SECTION Detection of functionally important cis-fragments (F0 dataset) The protocol to prepare the library of functionally enriched cis-peptide fragments is depicted in Fig. 1. A non-redundant set of Protein Data Bank (PDB) entries at 25% sequence identity and 2.5Å resolution was obtained from the PISCES server (46). After rejecting residues which do not have well-defined backbone torsion angles (e.g. at termini and chain breaks) or whose main-chain atoms have B-factor >60Å2, the remaining backbone was divided into successive, overlapping fragments of lengths 6 – 10. Backbone torsion angles (φ,ψ,ω) of the proteins were calculated using the SECSTR module of PROCHECK (47). Fragments containing at least one cis-peptide (a peptide bond was considered to be cis if |ω| ≤ 90° (48)) were separated from those having only transpeptides. The cis-conformation was verified using real space correlation coefficient (rscc) value 5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

>0.8 (49), wherever data was available. Cis-trans ambiguities do not arise in NMR derived structures due to presence of explicit peaks in the 2D spectrum indicating occurrence of the cis or trans form; these structures were therefore not subjected to checks. The cis-fragments were clustered together using an in-house clustering algorithm (9, 22) such that (φ,ψ) values of each residue in a fragment did not deviate from the (φ,ψ) values of the cluster centre by more than 60° each (9). The distance (DIST) between cluster centre Fi and the fragment Fj was obtained as: m + FL −1 , n + FL −1



DIST [ Fi , F j ] =



− φ jy ) + (ψ 2

ix

ix

−ψ

)

2

jy

x=m , y=n

Each clustered fragment was assigned the Gene Ontology Molecular Function (GO MF) term of its PDB entry (http://www.geneontology.org). For a given GO MF, the parent function at level L from the root node (GO: 0003674) was obtained using the GO directed acyclic graph (Supplementary Fig. S1). The propensity of each fragment GO MF term at level L in a cluster was calculated using the formula: propensityL = (nXL/nTL) / (NXL/NTL), where nXL and NXL are the number of occurrences of GO MF term ‘X’ in a cluster and in all clusters, respectively; nTL and NTL are the numbers of all GO MF terms in the particular cluster and in all the clusters, respectively. The p-value of a GO term ‘X’ occurring k times in a cluster was calculated as:

(

k−1

)

p-value = k 1− ∑ H L ( t ) , where H L (n XL ;N TL ;nTL ;N XL )= t= 0

nTL n XL

N TL− nTL N XL − n XL

( )( ) ( ) N TL N XL

Fragments with propensity ≥20 were considered functionally important while p-value ≤0.05 was used to confirm their statistically significance (F0 dataset), similar to our previous studies (9, 22).

Information content (IC) calculation Entropy S at a given position in a fragment was calculated using Shannon’s entropy formula (50) as S = - ∑ pi log (pi), where pi is the fractional occurrence of each residue at a position i and the sum is 6

ACS Paragon Plus Environment

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

over all amino acids. The average IC of a cluster was calculated by averaging the entropies at all positions of the fragments. To emphasize the significance of these IC values, distributions of IC were also obtained for randomly generated pseudo-clusters. Pseudo-clusters were obtained by randomly selecting 5000 fragments (~25% of the original dataset) from the set of all fragments of a given FL. Results were averaged over 30 pseudo-clusters to remove any influence of fluctuations.

Using cis-fragment library to detect function matches in other proteins The test dataset to assess the utility of our cis-fragment library in function annotation consisted of a non-redundant set of PDB entries from PISCES server at 90% sequence identity, 2.5Å resolution and devoid of any PDB entry that belonged to the F0 dataset or proteins of unknown function. The PDB entries with assigned GO MF terms, irrespective of the presence of any cis-bond, were searched for sequence matches with one or more of the functionally important fragments allowing at most one residue mismatch. In case of sequence matches, the GO MF terms of the F0 fragment and the PISCES derived PDB entries were also compared. The performance of our FB annotation method was assessed by calculating the true positive rates (TPR) and false positive rates (FPR) as TPR = TP / (TP + FN) and FPR = FP/ (FP + TN), where TP represents true positives (sequence matches whose cis-peptides and GO MF terms match with those of the F0 fragments), FP represents false positives (sequence matches whose cis-peptides match but GO MFs do not match with those of the F0 fragments), TN represents true negatives (sequence matches which do not have matching cis-peptides and GO MFs with those of the F0 fragments) and FN represents false negatives (sequence matches which do not have matching cis-peptides but whose GO MFs match with those of the F0 fragments). Since the number of cases with no matching cis-peptides outnumbers the cases with a cis-match, equal numbers of cases were randomly selected from the two sets 30 times and the results averaged to reduce bias in the calculation of positives and negatives.

Coarse-grained (CG) Molecular dynamics (MD) simulations 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1µs MD simulations were carried out on a smaller subset of protein pairs, where one protein is from the F0 dataset and the other from the test dataset. The data comprised equal number of cases in the positive set (protein pairs with matching fragment sequence and cis-peptide) and the negative set (matching fragments but with cis-peptide in one protein and trans-peptide in the other). At most one residue mismatch in fragment sequence was allowed between the selected pairs of proteins. A novel CGMM forcefield developed by Bhadra et al. (43, 44) was used to simulate the proteins in vacuo. The total energy Utotal was expressed as a sum of the bonded and non-bonded terms in the forcefield: Utotal = Ubond + Uangle + Utorsion + Unonbond . The cis-peptide virtual bond potential (in Kcal mol-1) was modeled by a harmonic functional term of the form Ucis (rij) = K ( |rij| - |r0| )2, where rij is the virtual bond distance (Å) between two consecutive Cα atom positions i and j, r0 (= 2.96 Å) is the virtual bond distance where the energy is minimum and K (= 2090 Kcal mol-1 Å-2) is the force constant. The other terms in the potential function were as described in the previous work (43). The in-house CGMM forcefield was incorporated into the GROMACS 4.5.5 simulation package (51). Modified amino acids were replaced by the corresponding unmodified ones. Missing regions were modelled using the SWISS-MODEL server (52). Each residue was represented by a CG pseudoatom at Cα atom position with mass equal to that of the amino acid. The proteins were energy minimised using the method of steepest descent. The temperature was set to 298K using Berendsen temperature coupling in NVT ensemble. The equilibration step included simulated annealing for 70ps. The 1µs production run was performed with 2fs integration time step using the Leapfrog integrator algorithm and coordinates saved every 100ps. The long timescale is expected to sample sufficiently the conformational space of the protein.

Detecting function matches based on dynamics The dynamics of the paired segments were compared using the un-weighted 3D autocorrelation method (43). 11 Snapshots, including the first and the last, were extracted at 100ns interval from the MD trajectory of each protein. The dynamics of the relevant fragment (important for function) in 8

ACS Paragon Plus Environment

Page 8 of 35

Page 9 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

each snapshot can be described by a 3D autocorrelation vector (ACV) of dimension n = dmax / dx, where dmax is the maximum distance between a residue pair in the fragment in that snapshot and dx (=2Å) is the step-size. The ith component of the 3D-ACV is given by: 3D A C V ( i ) =

∑PP j

j ,k

k

, where

Pj and Pk are the weights associated with atoms j and k, and the sum is over all atom pairs separated by (i)dx and (i + 1)dx. In the un-weighted ACV, all Ps are assigned the value 1. The correlation coefficient (CC) was then calculated between each pair of ACVs of the fragments (one each from the F0 set and the test set) associated with the same function. This yielded a matrix of 11 x 11 CC values, since there were 11 extracted snapshots and hence 11 ACVs for each fragment. We evaluated the percentages of CC values >0.95 and >0.9 (referred to as x and y, respectively). Two fragments were said to have a match in dynamics if x ≥ 25% and y ≥ 50%. Alternately we also allowed matches if ED ≤15, where Euclidean distance ED = (x − 25)2 + ( y − 50)2 . To assess the prediction of function using only dynamics information, TPR and FPR were calculated. In this case, TP represents fragment matches whose dynamics and GO MF terms match with those of the F0 fragments, FP represents fragment matches whose dynamics match but GO MFs do not match with those of the F0 fragments, TN represents fragment matches which do not have dynamics and GO MFs matching with those of the F0 fragments and FN represents fragment matches which do not have matching dynamics but whose GO MFs match with those of the F0 fragments. For multiple cis-fragment matches between a pair of PDB entries, the dynamics has to match for at least one pair of fragments.

Annotating proteins of unknown function (UF) Proteins of known structure but unknown function (UF set) were obtained from the PDB database by searching with the keywords 'hypothetical', 'putative' or 'unknown function' and eliminating entries assigned with any GO MF annotation, which yielded 1894 protein chains, 485 of which had one or more cis-peptides. The F0 fragments were used to search for sequence and cis-peptide matches in the UF set. In case of matches, the dynamics (from 1µs simulation with CGMM) of the 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

paired fragments were compared. UF fragments matching in both cis-peptide and dynamics (criteria as defined in previous section) with an F0 fragment were assigned the function associated with the F0 fragment.

RESULTS Choosing the fragment length (FL) for analyses The cis-peptide dataset used in the present study is about 4.5 times larger than the dataset used in our previous study (22). Our dataset consisted of 3113 protein chains (Supplementary Table S1) containing at least one cis-peptide, starting from a set of 9844 non-redundant protein chains at 25% sequence identity. The backbone was divided into fragments of length 6 – 10 and only the fragments containing cis-peptide were clustered (Fig. 1). Table 1 describes the statistics for the relevant fragments and clusters. The optimal FL for our annotation studies should neither be very short (to avoid random sequence matches) nor very long (to avoid large number of small-sized clusters and un-clustered fragments, that is, singletons). Moreover, the sequence library should be as diverse as possible, although the relevance of cis-peptides to function leads to higher sequence conservation than in trans-peptides. The diversity of the sequence library at different FL was assessed from the information content (IC) of the clusters. Lower IC indicates lower sequence diversity and hence higher sequence conservation at a particular fragment position in the cluster (IC=0 for a fully conserved position and increasingly positive in case of amino acid variability). The distributions of IC of FL 6 – 8 for both Pro and non-Pro cis-peptides (Fig. 2A and 2B) were similar, except for a peak around zero for non-Pro fragments. This denotes a significant number of highly conserved fragments, which is expected for Xaa-Xnp cis-bonds as they are known to be important for the structure or function in the corresponding proteins (14-16). The distribution of the average IC of FL6 clusters has a distinct peak in the range: 0.4-0.6 which increases to the range: 0.6-0.8 for FL8 fragments and is reversed for FL10. The peak values are significantly lower than those for the only 10

ACS Paragon Plus Environment

Page 10 of 35

Page 11 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

trans peptide-containing fragments (~2.0 (9)). These distributions are different from those of randomly-shuffled pseudo-clusters, pointing to their significance. Based on these values, FL6 and 8 were chosen for further analysis.

Preparing the functionally important fragment library (F0 dataset) Clustered fragments exhibit an enrichment of GO MF terms which can be utilized to detect fragments relevant to the molecular function of the corresponding protein (9, 22). Hence, the GO MF terms of the PDB entries were assigned to the clustered FL6 and FL8 fragments and subsequently mapped to the parent GO MF at levels 3 – 5 using the GO graph (Supplementary Fig. S1). Some of the GO MFs can be at multiple levels depending on the path used to trace it to the root of the graph and can, therefore, have multiple parent terms at a given level. Any cluster, in general, contained fragments associated with different GO MFs at a specific level. Statistical propensity (Experimental Section) was calculated to assess the enriched GO terms. The largest fraction of fragments with high propensity values (peak of the distribution is ~2) was found to occur at level 3 and gradually decreased at levels 4 and 5 (Fig. 2C), for both FL6 and 8, indicating maximum coverage at level 3. Hence, results are reported here for GO mappings at level 3. The propensity values at FL6 were greater than those at FL8 for all three levels. Highly enriched GO terms in a cluster were detected from propensity values >>1. To confirm the statistical significance of the propensity values, especially when the number of fragments in a cluster was not large, p-values for the GO occurrence in the clusters were estimated using the hyper-geometric distribution. All GO terms with p-value ≤0.05 and propensities ≥20 were accepted to conform to statistically significant enrichment, similar to our previous studies (9, 22). The fragments associated with these enriched GO terms were inferred to have functional relevance to their corresponding proteins and inducted into a library of 'functionally important fragments (F0 dataset). Our analyses yielded 2095 XaacisPro and 369 Xaa-cisXnp (Xaa: any amino acid, Xnp: non-Pro amino acid) functionally important FL6 fragments composed of unique 848 Pro and 124 non-Pro cis-peptides. For FL8, we obtained 11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3128 Xaa-cisPro and 477 Xaa-cisXnp fragments composed of 856 Pro and 125 non-Pro cispeptides.

Utility of the cis-fragment library alone in function annotation The cis-fragments in the library were searched in a test set (comprising proteins of known function) for matches in sequence, cis-peptide and GO MF terms, and the true positive rate (TPR) and false positive rate (FPR) were calculated. High TPR and low FPR with TPR-to-FPR ratio >1 indicate better predictive power in annotating the test proteins with our fragment library. Our analyses showed that with exact sequence matches, FL6 gives lower FPR and higher TPR than FL8 (Table 2). When one residue mismatch was allowed, the TPR and FPR increased for both FL, with the ratio of TPR-to-FPR >1 in all cases. However, the numbers of hits obtained for FL8 were less than for FL6. Overall, FL6 with one residue mismatch seems better suited for detecting function matches and hence, annotation. Illustrative examples of match/ mis-match in cis-peptide and GO MF for FL6 are in Supplementary Table S2. In most of the examples, the cis-fragments are found to be part of the binding site (ligand or another protein, e.g., PDB ID 2bo9B, 4klxA) and in some cases, are directly implicated in activity through cis-trans isomerization (e.g. PDB ID: 1bx7A, 2octA).

Suitability of the CGMM forcefield in reproducing cis-peptide dynamics The CGMM forcefield (average CC = 0.74 ± 0.24 with NMR) has been previously found to outperform MARTINI 2.2 forcefield (average CC = 0.54 ± 0.18 with NMR) in reproducing experimental protein dynamics from NMR, evaluated using RMSF profiles (43). Comparison of CGMM with CABSFlex (53, 54), which also produces RMSF graph to compare to NMR-ensemble derived RMSF profiles, show comparable results (44). To assess whether the CGMM forcefield can reproduce the experimental dynamics of the cis-peptides in the proteins under consideration, the distributions of root-mean-square fluctuation (RMSF) values obtained from the simulation were compared using correlation coefficient (CC) values with those from NMR structures (a non12

ACS Paragon Plus Environment

Page 12 of 35

Page 13 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

redundant set of 28 single domain monomeric proteins) with 10 or more ensemble models. Supplementary Table S3 documents the comparison of RMSF profiles produced by the CGMM to those from NMR for the proteins with cis-peptides. ACV plots for a fragment of length six around the cis-peptide were also compared from the NMR and CGMM MD data (Supplementary Table S3). Overall, the CGMM is found to reproduce cis-peptide dynamics reliably (average CC = 0.86 ± 0.10 with NMR, compared with 0.57 ± 0.12 for MARTINI 2.2, with at least 50% CC values >0.9 and at least 25% CC values >0.95 for most of the fragments, satisfying the condition for dynamics match).

Use of fragment dynamics to improve function prediction power Previous analyses have demonstrated the utility of cis-fragments alone in determining the function of homologous or unrelated proteins. To ascertain if incorporating dynamics information improves prediction, the ACV-based dynamics profiles were compared in a dataset consisting of 102 pairs each of protein chains with match or no match in cis-peptide (Experimental Section). We observed that for some pairs of fragments, the percentages of CC values greater than 0.9 and 0.95 narrowly missed the specified cut-offs (examples in Supplementary Table S2). Imposing ED ≤15 criterion led to an overall improvement in the TPR and FPR (Table 3). Supplementary Table S2 documents some examples, together with the CC and ED values which indicate the extent of match in dynamics. Fig. 3 depicts two representative cases where the match/ mis-match in dynamics are accompanied by a corresponding match/ mis-match in GO MF. In most cases (94 out of 102 in the positive data set), fragments with matching cis-peptide segment and dynamics are found to be annotated with the same GO MF term (Supplementary Table S4; selected examples in Table S2 and Figs 3A-C). However, we found 5 cases where the GO MF terms did not match in spite of matching cis-fragments and dynamics. One of these (PDB IDs: 1kmvA and 3jtwA, Table S2) appears to be a possible mis-annotation. PDB ID: 3jtwA (from structural genomics project) in the test set is a putative dihydrofolate reductase (similar to 1kmvA) 13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 35

but has been assigned a different GO MF term (5-amino-6-(5-phosphoribosylamino) uracil reductase

activity)

based

on

electronically

inferred

annotation

from

InterPro

(https://www.ebi.ac.uk/interpro/). Such annotations are obtained from automated processes (e.g. hits in sequence similarity searches), are not curator-reviewed and hence less reliable. The DHFR domain is present in both dihydrofolate reductase and bifunctional deaminase-reductase domain protein. Hence the function of the protein 3jtwA requires experimental verification. In the remaining cases, the fragments were found to have similar secondary structure (including loops) and may be contributing similarly to the overall function (e.g. fragment is part of binding interface in the PDB ID pair 2jdcA-2omzB, Table S2). 33 out of 102 fragments from the negative dataset (cis-peptide mismatch, i.e., trans-peptide) had a mismatch in dynamics and GO MF, and were of fragment pairs from functionally unrelated proteins (Supplementary Table S4 and Table S2; e.g. PDB IDs 2ciwA-1uuxX). Unexpectedly, there were 6 cases with a GO match, although the sequences did not have a match in cis-peptide or dynamics. Four of these cases are homologs having ED>15 but