Machine Learning Applied to Predicting Microorganism Growth

May 22, 2019 - The number of microorganisms that can be cultured in the laboratory ... First, we build a machine learning model to accurately predict ...
0 downloads 0 Views 2MB Size
Research Article Cite This: ACS Synth. Biol. 2019, 8, 1411−1420

pubs.acs.org/synthbio

Machine Learning Applied to Predicting Microorganism Growth Temperatures and Enzyme Catalytic Optima Gang Li,† Kersten S. Rabe,‡ Jens Nielsen,†,§ and Martin K. M. Engqvist*,† †

Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden Institute for Biological Interfaces 1 (IBG 1), Karlsruhe Institute of Technology (KIT), Group for Molecular Evolution, 76131 Karlsruhe, Germany § Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark Downloaded via UNIV OF SOUTHERN INDIANA on July 17, 2019 at 09:22:48 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



S Supporting Information *

ABSTRACT: Enzymes that catalyze chemical reactions at high temperatures are used for industrial biocatalysis, applications in molecular biology, and as highly evolvable starting points for protein engineering. The optimal growth temperature (OGT) of organisms is commonly used to estimate the stability of enzymes encoded in their genomes, but the number of experimentally determined OGT values are limited, particularly for thermophilic organisms. Here, we report on the development of a machine learning model that can accurately predict OGT for bacteria, archaea, and microbial eukaryotes directly from their proteome-wide 2-mer amino acid composition. The trained model is made freely available for reuse. In a subsequent step we use OGT data in combination with amino acid composition of individual enzymes to develop a second machine learning modelfor prediction of enzyme catalytic temperature optima (Topt). The resulting model generates enzyme Topt estimates that are far superior to using OGT alone. Finally, we predict Topt for 6.5 million enzymes, covering 4447 enzyme classes, and make the resulting data set available to researchers. This work enables simple and rapid identification of enzymes that are potentially functional at extreme temperatures. KEYWORDS: machine learning, optimal growth temperature, enzyme temperature optima, thermostable enzymes, sequence-based prediction

E

atures significantly higher or lower than the OGT.22,25 The Pearson correlation between the Topt of individual enzymes and OGT is only 0.48.22 In practice this means that enzymes from thermophilic organisms may be optimally active at significantly lower temperatures than expected. Due to these challenges a simple way to computationally estimate (1) the OGT of microbes and (2) the Topt of enzymes is in demand. For such computational estimates to be feasible there must be general trends for how quantifiable biological properties change with growth temperature, there must be a signal that can be modeled. The OGT of microorganisms is an important physiological parameter that has been widely used to understand the strategies organisms use to adapt their genomes and proteomes to different environmental conditions.26−28 Many genomic and proteomic features that are strongly correlated with OGT have been revealed.29 Examples include the existence of thermophile-specific enzymes,30 the presence or absence of certain dinucleotides,31 the GC content of structural RNAs,32 as well as amino acid composition of the proteome.26,33 Examples such as these indicate that estimating OGT directly from genomes or proteomes is indeed feasible.

nzymes that remain active at high temperatures, sometimes referred to as thermozymes, are used to catalyze chemical reactions in industrial processes,1−8 for applications in molecular biology,9−14 and for providing highly evolvable starting points for protein engineering.15−19 When testing new enzymes for these applications the optimal growth temperature (OGT) of microorganisms is commonly used to estimate protein stabilityenzymes derived from thermophilic organisms are expected to be both stable and active at high temperatures. Although frequently successful, using OGT as an estimate faces two challenges. First, for many microorganisms with experimentally determined OGT this information is not readily accessible. This challenge has been partially addressed through the creation of public databases and data sets.20−23 However, the OGT for the vast majority of microbial organisms is currently unknown since its experimental determination is a laborious process that requires cultivation in temperaturecontrolled conditions. The number of microorganisms that can be cultured in the laboratory is only a small fraction of the total diversity in nature.24 Consequently, many enzyme catalysts suitable for demanding biotechnological applications likely remain untested and undiscovered. Second, using OGT to estimate enzyme catalytic optima (Topt) constitutes a rough approximation, with many enzymes displaying Topt at temper© 2019 American Chemical Society

Received: March 5, 2019 Published: May 22, 2019 1411

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

Figure 1. Variability in amino acid frequencies decrease log-linearly with proteome size. (a) Schematic overview of process to build a machine learning model to predict OGT. Protein records from Ensembl genomes (bacteria and fungi) and RefSeq (bacteria, archaea and fungi) were downloaded. Sequences were annotated with the growth temperature of the organism from which they originate. Sequences from organisms that could not be annotated, i.e., for which there is no available information about the OGT for the organism, were retained in a separate unannotated data set. Amino acid frequencies of the annotated sequences were used to train a statistical model. This model was in turn used to predict growth temperatures for the unannotated data set. OGT: optimal growth temperature. (b) The frequency of each amino acid was plotted against the number of amino acids used to calculate the frequency. The final third part was fitted to a linear model to get the absolute slope value (|ai|), as well as its frequency variance (σ2i ) and varying range (Ri). The maximal |ai|, σ2i , and Ri of 20 amino acids of each proteome (rabs,max, σ2max, and Rmax) give measures of whether frequencies were stable. The calculated (c) σ2max, (d) rabs,max, and (e) Rmax of all species in the data set were plotted against the number of amino acids in the proteome. The dashed line indicates the cutoff for the selection of proteomes based on size. Effect of protein order on (f) σ2max, (g) rabs,max, and (h) Rmax. Seventeen proteomes with different size were randomly selected. Proteins in each proteome were shuffled 100 times and the three metrics for each shuffled proteome were calculated. (i) Distribution of proteome sizes in the annotated data set after filtering. (j) Distribution of growth temperatures in the annotated data set after filtering. (k) Proportion of species belonging to the three different taxonomic superkingdoms in the filtered data set. 1412

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

this data set, all proteins from 5761 organisms from RefSeq (https://www.ncbi.nlm.nih.gov/refseq/) and Ensembl genomes (http://ensemblgenomes.org/) could be associated with an OGT value (we refer to this as the annotated data set), while proteins from an additional 1803 organisms could not be associated with an OGT value (we refer to this as the unannotated data set) (Figure 1). For each organism in both the annotated and unannotated data set we calculated the global amino acid monomer and dipeptide frequencies. However, some organisms in the data set contain only a small number of protein sequences, and as a consequence the amino acid composition obtained from those sequences may not represent the true amino acid composition of the complete proteome. To address this problem we applied a filtering step. As it was unclear how many protein sequences are required to obtain a stable amino acid composition, we designed three different metrics (see Methods for details) to test how much protein sequence data was needed to obtain a stable amino acid composition. (Figure 1b). For each organism in the annotated data set the three metrics were calculated for every protein sequence added in order to observe at which point the values stop fluctuating. Using this analysis on amino acid monomer frequencies we found that at least 105 amino acids are needed to get a stable amino acid composition (Figure 1c,d,e). Repeating this analysis for amino acid dipeptides resulted in the same threshold (Figure S1). A further concern was that the order in which proteins appear in the input files may affect the cutoff analysis. For this reason, proteins from 17 organisms with different sizes of available proteomes were randomly selected. For each of these organisms the order in which protein sequences appear was shuffled and the three metrics were calculated. The shuffling, with subsequent analysis, was repeated 100 times. As expected, the analysis shows a high initial variability, where few sequences have been analyzed, but with increasing numbers of averaged proteins the values stabilized and converged (Figure 1f,g,h). From this analysis it is clear that the arrangement of proteins in a proteome has a negligible effect when the proteome size is larger than 105, and we therefore only chose organisms with at least 105 amino acids in the data set for further analysis. This approach resulted in a training data set with 5532 organisms annotated with OGT, as well as a data set with 1438 unannotated organisms. The annotated training data set comprises 4974 bacteria, 222 archaea, and 337 eukarya (Figure 1) and is much larger than those used in other approaches, such as 22 bacteria,31 77 bacteria,34 or 204 prokaryotes.33 In the annotated data set the number of proteins in each organism follows a normal distribution centered around 3000 (Figure 1i). The OGT distribution is, however, highly skewed with the majority of organisms having an OGT in the range 25−30 °C and at 37 °C (Figure 1j). The number of organisms in the data set with an OGT higher than 40 °C is 425 (340 for higher than 50 °C). OGT Can Be Accurately Predicted from Amino Acid Composition of the Proteome. For each organism in the annotated data set we calculated the global amino acid monomer frequencies (20 features) as well as amino acid dipeptide frequencies (400 features). To get the best feature set and statistical model for the prediction of OGT, we tested six different regression models and compared their performance on the monomer data set and the dipeptide data set. As shown in Figure 2a, a 5-fold cross-validation was applied to evaluate the performance of different regression models. Using

Statistical tools, such as regression and classification, have been used to model the correlation between OGT and biological features. For example, the OGT of 22 bacteria could be predicted using a linear combination of either dinucleotide or amino acid composition.31 Additionally, Zeldovich found that the sum fraction of the seven amino acids I, V, Y, W, R, E, and L showed a correlation coefficient as high as 0.93 with OGT in a data set consisting of 204 proteomes of archaea and bacteria.33 Jensen et al. developed a Bayesian classifier to distinguish three thermophilicity classes (thermophiles, mesophiles, and psychrophiles) based on 77 bacteria with known OGT.34 Training data sets containing the OGTs for a large number of organisms have been hard to obtain, something which has prevented the development of state-of-the-art machine learning models for OGT prediction. While there are only a few published models predicting organism OGT, we know of no computational tools to estimate Topt. However, many methods for the estimation of protein stability have been developed. We wish to emphasize the difference between these two measures as stability is an indication of the folding state of the protein, without information regarding catalytic activity, whereas Topt implicitly assumes stability and instead indicates the temperature of optimal catalysis. Methods for predicting protein stability fall into two main categories: predicting the stability of whole proteins, and predicting the stability change in a protein upon amino acid substitutions. Machine learning has been used extensively for the prediction of stability change upon amino acid substitutions,35−39 while only a few methods have been developed for the prediction of stability of whole protein empirically.40−43 Computational prediction of protein stability is challenging since it usually necessitates an accurate calculation of Gibbs-free energy change of protein unfolding process,42,43 which relies mainly on high-quality protein structures. Such structures are limited in number, thereby reducing the applicability of these methods for identifying thermostable enzymes for industrial applications. Here, we address the challenge of identifying proteins active at high temperatures in three steps. First, we build a machine learning model to accurately predict OGT using features extracted from all proteins encoded by an organism’s genome. This model is used to assign OGT values for organisms without experimental data. Second, we significantly improve the prediction of enzyme Topt values by using OGT in combination with sequence information on individual enzymes. Those predictions are significantly more accurate than using OGT alone for the prediction. Finally, we make use of the predictive models to estimate Topt for 6.5 million enzymes, covering 4447 enzyme classes in the BRENDA44 database. The OGT model and enzyme Topt estimates are made freely available for reuse (https://github.com/ EngqvistLab/Tome).



RESULTS AND DISCUSSION Collection of Optimal Growth Temperature and Proteomes of Microorganisms. Protein amino acid composition is strongly correlated with OGT.31,33 For this reason we decided to train machine learning models using the amino acid composition as features. To build such a model we first established a training data set. To this end, we downloaded an OGT data set (https://doi.org/10.5281/ zenodo.1175608), which contains data for 21 498 microorganisms, including bacteria, archaea, and eukarya.22 Using 1413

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

train models that are more generally applicable. We find that in general nonlinear models outperform linear models when using amino acid frequencies (Figure 2a). This suggests that the linear models used previously such as that from Nakashima et al.31 might be further improved by nonlinear regression to correlate the amino acid frequencies to OGT. We speculate that our approach to predict OGT from the proteome-wide dipeptide composition may be applicable for predicting other parameters in microorganisms, such as pH optimum or salt tolerance. The feasibility of developing such models ultimately depends on whether proteome-wide sequence adaptations to these growth conditions exist. Validation of the SVR Model for Growth Temperature Prediction. Leveraging the final SVR model, the OGT of 1438 organisms in the unannotated data set were predicted (Figure 1a). These OGT predictions were validated using two separate approaches. First, we performed a manual literature search to find experimentally obtained OGTs for a subset of the organisms (for which no experimental OGT was present in our original data set). We randomly sampled 54 of the organisms with predicted OGTs, in a manner that ensured even spread across temperatures. For 45 of the 54 organisms, OGT values could indeed be found in published peer-reviewed articles (Table S1). The agreement between the predicted OGT and the ones collected from literature is very high, with a Pearson correlation coefficient of 0.96 (Figure 2c). Second, we seized on the fact that the average temperature optimum of catalysis (Topt) of at least five enzymes from an organism shows a Pearson correlation above 0.75 with growth temperature.22 Of the 1438 organisms with predicted OGT only 23 were found to have at least five enzymes with Topt available in BRENDA. Plotting the arithmetic mean of these enzyme optima against the predicted OGT for each organism reveals a strong correlation, with a Pearson’s correlation coefficient of 0.77 (Figure 2d). Indeed, this correlation is the same as that obtained with experimentally determined organism OGTs,22 again showing that the predicted OGTs are very accurate. Improved Estimation of Enzyme Temperature Optima Using Machine Learning. In biotechnology and protein engineering OGT is typically used directly to guide the discovery of thermostable enzymes.3,21 We hypothesized that the accuracy of this estimation could be improved by also considering enzyme sequence information in a machine learning framework. To test this hypothesis, a training data set was generated by collecting 2609 enzymes that (1) have Topt and protein sequence data in the BRENDA database, and (2) come from organisms with an experimentally determined OGT (Figure 3a,b). We first tested the accuracy of directly using OGT as an estimation of Topt and found that only 25% of the enzyme Topt variance could be explained (Figure 3c, black bar). Then, to improve the accuracy of this estimation using machine learning, we extracted three feature sets from the enzyme sequences, namely, amino acid frequencies, dipeptide frequencies and other basic protein properties like length, isoelectric point, etc. (see Methods). Six regression models were trained and tested on these feature sets individually, as well as the two and three sets combined, with a 5-fold crossvalidation approach. As shown in Figure 3c, the best model (SVR) trained on amino acid frequencies achieved a slightly improved accuracy compared to OGT, as quantified by an R2

Figure 2. Model development for OGT prediction. (a) R2 score obtained by a 5-fold cross-validation for six different regression models. Error bars represent the standard deviation of R2 scores. (b) Performance of the final SVR (support vector regression) model trained on dipeptide data. The correlation between predicted organism growth temperatures and those present in the original annotated data set was evaluated. RMSE: root-mean-square error. Colors indicate the density of the points. (c) Correlation between literature values for growth temperatures and predicted growth temperatures. Species for unannotated data set were sampled at random, but with ensuring equal coverage over the temperature range. Growth temperatures for these organisms were obtained by manually searching the primary scientific literature. (d) Correlation between the mean enzyme temperature optima and predicted growth temperatures for each species present in both data sets. Only organisms with optima for at least five enzymes are shown. Error bars show the standard deviation. In (c) and (d) r denotes Pearson’s correlation coefficient.

the 20 amino acid frequencies, nonlinear models (SVR and Random forest) perform much better than linear models (Linear, Elastic net, Bayesian ridge regression). The superior performance of nonlinear models suggests that there are important nonlinear relationships between amino acid frequencies and OGT. In contrast, all models except decision tree show an almost identical performance when using dipeptide frequencies. Since models trained on each of the two data sets individually show good performance, we reasoned that models trained on the combined data sets may be even better performing. However, contrary to this expectation the six models trained using both monomer data set and dipeptide data set together do not show improvement (Figure 2a). A final SVR model was trained on the whole dipeptide data set, without cross-validation, and stored for further use. This model can explain an astounding 95% (88% by crossvalidation) of the variance in OGT (Figure 2b). This model has significantly higher predictive accuracy than other published models (Figure S2 and Figure S3). We propose that the high predictive accuracy results from two features of our approach; the size and quality of the training data used, and the use of nonlinear regression models. As a direct consequence of the increased size of the data set, we could 1414

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

Figure 3. Model development for prediction of enzyme temperature optima. (a) Schematic overview of process to build a Topt prediction model. (b) The distribution of enzyme temperature optima in training data set. (c) 5-fold cross-validation results for five regression models on different feature sets. The Topt = OGT bar shows the explained variance when using OGT as the estimation of enzyme Topt. Error bars shows the standard deviation of R2 scores obtained in 5-fold cross validation. AA, amino acid frequencies; Dipeptide, dimer frequencies; OGT, optimal growth temperature of source organism; Basic, basic information on proteins, like length, isoelectric point etc., see details in Methods section. (d) Performance of the final random forest model trained on AA+OGT data. The correlation between predicted and experimental Topt was evaluated. RMSE: root-mean-square error. Colors indicate the density of the points. (e) The feature importance in the final random forest model. Error bars indicates the standard deviation of feature importance of 1000 estimators.

inclusion of other basic enzyme properties (see Methods) did not further improve the accuracy (Figure 3c). To generate a final model for the prediction of enzyme temperature optima the random forest model was retrained without cross-validationusing the full set of amino acid frequencies and OGT data (Figure 3d). In this final model, OGT of the source organism is the most informative individual feature, whereas the 20 amino acid frequencies combined contribute over half of the predictive power of the model (Figure 3e). This is a remarkable result that demonstrates a clear importance of combining physiological parameters, such

score of around 30%. Using dipeptide frequencies alone in combination with amino acid frequencies did not further improve the accuracy. Models separately trained on OGT and sequence-derived features produce estimates of similar accuracy (25% and 30%, respectively). Therefore, we tested whether their combined use could boost predictive power. In line with our original hypothesis the best model (random forest) trained on the combination of amino acid frequencies and OGT almost doubled the model predictive accuracy to over 50%. Further 1415

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

Figure 4. Prediction of enzyme temperature optima. (a) Visual representation of the number of the enzymes with experimental Topt in BRENDA and the number of enzymes for which Topt was predicted leveraging experimental and predicted OGTs. Each box represents 33 050 enzymes. There are 12 115 011 enzymes in total. Pred. is an abbreviation of predicted. (b) A visual representation of the Topt temperature coverage for each EC number after annotation. The span between the highest and lowest Topt for each enzyme is indicated. Experimental (BRENDA) and predicted Topt values are shown in different colors. (c) Comparison between OGT of source organism and predicted and experimental Topt values of enzymes. Colors indicate the density of the points.

connected to a protein sequence. We made use of the Topt prediction model to provide Topt estimates for a majority of enzymes in BRENDA. First, experimentally determined OGTs22 and the 1438 OGTs predicted with the final OGT SVR model (Figure 2b) were combined to generate a data set containing the OGT of 22 936 microorganisms. Using this combined data set 6 507 076 out of 12 115 011 enzymes (54%) in BRENDA could be annotated with the OGT value of their source organism, of which 909 954 enzymes (14%) were contributed by the predicted OGT values. In a second step our Topt random forest model (Figure 3c) was applied to the 6.5 million OGT annotations combined with the amino acid frequencies of individual enzymes to estimate the Topt. The resulting predictions dramatically added to the Topt values in BRENDA, increasing them 197-fold (Figure 4a) and covering 4447 different EC numbers (Figure 4b). Moreover, the temperature coverage, i.e., the minimal and maximal Topt for an enzyme class, of the vast majority these EC numbers (3721 of 4447) were expanded (Figure 4b). The predicted enzyme Topt and annotated OGT values of these enzymes are freely available for download and reuse (https://zenodo.org/record/2539114, https://doi.org/10.5281/zenodo.2539114). As can be seen in Figure 4c, many of the predicted enzyme Topt values differ significantly from the OGT of the source organism. For enzymes from organisms with OGT below 40 °C many have Topt higher than the OGT. In contrast, enzymes from thermophiles generally have a lower Topt than the OGT. These results are in good agreement with previous findings comparing experimental OGT of organisms with average enzyme Topt.22 For three representative organisms we show

as OGT, with sequence information in the estimation of protein properties. For its predictive power our final Topt model is surprisingly simplistic, making use of only protein amino acid frequencies and OGT for a total of 21 features. We speculate that using more descriptive features may result in improved accuracy of future models. Indeed, in works where other enzyme properties are modeled, such as enantioselectivity, the importance of choosing an appropriate framework to represent enzyme sequence information is emphasized.45,46 The choice of which input features to use has a large impact on the quality of resulting models and must be tailored to each application. In terms of improving Topt estimates, pseudo-amino-acid composition,47 secondary structure content,48 amino acid physicochemical and biochemical properties49 as well as computed sequence embeddings 50−53 all provide rich representations of protein primary sequences, generating hundreds of features that could be used to create improved models. 3D-structures could also be used to obtain additional features.54,55 However, inclusion of such structural features would limit predictions to the small subset of proteins for which a high-resolution structure exists. Deep learning56,57 may also represent a superior algorithmic framework for enzyme Topt predictions. Regardless of the path taken, the R2 score of 51% obtained for Topt predictions in this study serves as a benchmark for future model development. Annotating Enzymes in BRENDA Using OGT and Predicted Topt. Currently, a main resource for enzyme data is the BRENDA database.44 However, there are approximately 12 million native protein sequences in BRENDA while there are only about 33 000 Topt records, many of which are not 1416

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology that the distribution of predicted Topt values are broadly consistent with experimental values, although the highest and lowest experimental values deviate more from the OGT than predicted ones (Figure S4). This pattern likely results from the heavy weighting of OGT in our model (Figure 3e), something that could be addressed by further improvement of the model as outlined in the section above. The predicted Topt values provided here represent a rich resource for identifying enzymes suitable for bioprocess carried out at high temperatures. Tome: A Command Line Tool for OGT Prediction and Identification of Enzyme Homologues with Different Topt. To ensure easy access to the OGT predictive model for the scientific community, as well as the enzyme data annotated with OGT of their source organism and estimated Topt, we developed the command line tool Tome (Temperature optima for microorganisms and enzymes). This tool is simple to use and has two fundamental applications: (1) prediction of OGT from a file containing protein sequences encoded by an organism’s genome; (2) identification of functional homologues within a specified temperature range for an enzyme of interest. For the prediction of OGT, a list of proteomes in fasta format58 is provided as input, and the temperature predictions are returned as an output. While this tool will perform predictions on any input given, we stress that the tool has been trained on bacteria, archaea, and a only small set of eukarya mostly fungi and protists. Predictions on organisms that do not fall into these categories may result in inaccurate results. For the identification of enzyme functional homologues with different estimated Topt, one can either simply specify an EC number and temperature range of interest to get all enzyme sequences from BRENDA matching the criteria. Alternatively, the sequence of an enzyme of interest can be provided in fasta format. The algorithm will then perform a protein BLAST,36 and an additional output file will be generated containing only homologous enzymes (default e-value cutoff is 10−10) within the specified temperature range. Full instructions regarding installation and usage of the Tome tool is available online (https://github.com/EngqvistLab/Tome).

In many cases the Ensembl Genomes and RefSeq data sets both contained information for the same organism, or for several strains of the same organism. Therefore, to combine the two data sets, the following steps were followed: First, where multiple strains from the same organism were present in the Ensembl Genomes data set, the strain with the largest file size, indicating the greatest number of amino acids in the downloaded fasta file, was selected for analysis. Other strains for that organism were discarded. Second, where the same organism was present in both the Ensembl Genomes and RefSeq data sets the one from Ensembl Genomes was retained and the one from RefSeq was discarded. In this way a protein data set comprising protein sequence data for 7565 microorganisms was obtained. Of these 5325 originated from Ensembl Genomes and 2240 originated from RefSeq. For each organism in the protein data set we attempted to annotate it with its optimal growth temperature. In this annotation procedure organism names were stemmed to the species level (ignoring strain designations) and crossreferenced with a published data set containing growth temperatures for 21 498 microorganisms (https://doi.org/10. 5281/zenodo.1175608). Growth temperatures could be associated with the protein sequence data from 5762 organisms, whereas 1803 were left unannotated. Estimation of Threshold. For each proteome, the total length of each protein was calculated. Then the amino acid frequencies and the total number of residues of the first n proteins (n = 1, 2, ..., N, where N is the total number of proteins) were calculated sequentially. The data points in the last one-third of all residues added were used to measure the stability of the calculated amino acid frequencies. Three different metrics were designed: (1) the absolute slope value (| ai|) in the linear regression between the number of residues and amino acid frequency; (2) frequency variance of these selected frequencies (σ2i ); and (3) varying range (Ri), the difference between maximal frequency and minimal frequency (i indicates each of 20 amino acids). Ideally, 0 was expected for all these three metrics if there is an absolutely stable amino acid frequency in a given proteome. Finally, for each proteome, the maximal |ai|, σ2i , and Ri of 20 amino acids of each proteome (rabs,max, σ2max, and Rmax) were used to measure whether frequencies were stable. To test the effect of the protein order in a proteome in the above analysis, a shuffling strategy was applied. First, equal coverage over the log10-transformed proteome size range 3− 7.5 was ensured by performing the random sampling in 20 bins. One proteome was randomly selected for each bin and this resulted in 17 selected proteomes as there is no proteome in 3 of these bins. The order of the proteins in each proteome was randomly shuffled and then rabs,max, σ2max, and Rmax were calculated. Each proteome was shuffled for 100 times. Machine Learning Workflow for OGT Model. Twenty amino acid frequencies and 400 dipeptide frequencies were extracted for each proteome. Then, each of these features were x −u normalized by xN , i = i δ i , where xi is the values of feature i,



METHODS Software. All machine learning analysis were conducted with scikit-learn package (version 0.19.1)59 using Python version 2.7.14. The module and model hyperparameters used for optimization are listed in Supplementary Table S2. Python code for proteome analysis, machine learning, and data visualization are available from the authors upon request. The source code for the Tome package is available under a permissive GPLv3 license at GitHub (https://github.com/ EngqvistLab/Tome). Proteome Data Set. The bulk of protein sequence data used in this work was obtained from Ensembl Genomes release 37, obtained in September 2017 (http://ensemblgenomes.org/ ). For all archaea and bacteria listed at ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/ fasta files containing protein sequences were downloaded. Similarly, fasta files containing protein sequences for all fungi listed at ftp:// ftp.ensemblgenomes.org/pub/fungi/release-37/fasta/ were downloaded. As a complement to the Ensembl Genome data we made use of protein data from RefSeq release 87, obtained in March 2017 (https://www.ncbi.nlm.nih.gov/refseq/). Fasta files containing a nonredundant set of protein sequences for each organism were downloaded from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/ for archaea, bacteria, fungi and protozoa.

i

and ui and δi are mean and standard deviation of xi, respectively. The following six models were selected, and their performance was tested on the annotated and filtered proteome data set using single amino acid frequencies (AA), dipeptide frequencies (Dipeptide), or the two together (AA +Dipeptide): Linear regression (Linear), Bayesian ridge, elastic net, decision tree, support vector regression (SVR), and 1417

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology

the Machine Learning Workflow for OGT Model section. The following five models were tested on the resulting data set: Bayesian ridge, elastic net, decision tree, support vector regression (SVR), and random forest. The linear model was not used due to its poor performance on any data sets containing dipeptide frequencies (negative R2 scores in crossvalidation). The performance of the five regression models was tested using the same cross-validation strategy as for OGT. In addition, to test the accuracy of using OGT of the organism as an estimation of enzyme Topt, the R2 score for using organism’s OGT value as estimation of enzyme Topt was calculated. The model with the highest R2 score was chosen and trained on the full training data set. BRENDA Annotation. Protein sequence data for each EC class was obtained by downloading comma-separated flatfiles from the BRENDA database version 2018.2 (July 2018). Each sequence in these files contain information regarding source organism as well as unique UniProt identifiers. Where possible, each protein sequence was associated with an OGT value by mapping the source organism name to the OGT data set from https://doi.org/10.5281/zenodo.1175608. Those sequences were first mapped to the existing Topt values in BRENDA by matching EC-UniProt id pair. For those enzymes without any experimental Topt values, the amino acid frequencies were calculated (ignore all “X” in the sequence). All 20 amino acid frequencies as well as the OGT value were normalized by x −u xN , i = i δ i , where xi is the values of feature i. ui and δi are

random forest. 5-fold cross-validation was used for the calculation of R2 scores. For SVR, elastic net, decision tree, and random forest models, an additional 3-fold internal crossvalidation were used to optimize the hyperparameters with a grid-search strategy (Supplementary Table S2). The model with the highest R2 score was selected and trained on the whole data set. For the prediction of OGT for those unannotated organisms, dipeptide frequencies were normalized x −u by xN , i = i δ i , where xi is the values of feature i. ui and δi are i

mean and standard deviation of feature i in the training data set, respectively. OGT Prediction Validation. For validating the predicted OGTs of unannotated organisms we sampled 54 species with predicted growth temperatures (for which no growth temperatures were available in the original data set) at random. Equal coverage over the temperature range 0−100 °C was ensured by performing the random sampling in 10 bins, each spanning a 10 °C temperature range. The primary scientific literature was then manually searched to obtain documented experimental growth temperatures for the sampled organisms. For 45 organisms a documented growth temperature could be found, and for 9 organisms it could not. The accuracy of predicted OGT was assessed by computing the Pearson correlation with experimental OGT. In a second approach to validate the OGT prediction we used Python scripts and the Zolera SOAP package (https:// pypi.python.org/pypi/ZSI/) to extract all available experimentally determined enzyme temperature optima from the BRENDA enzyme database https://www.brenda-enzymes. org/ release 2018.2 (July 2018). Data coming from the same enzyme was deduplicated by averaging temperature optima from records with the same EC number and originating from the same organism. For each organism with catalytic optima for more than five enzymes the arithmetic mean of those optima were calculated. Those organisms present in both the BRENDA enzyme data as well as the data set with predicted OGT were identified through cross-referencing species names. The accuracy of predicted OGT was assessed by computing the Pearson correlation between predicted OGT and mean catalytic optima of enzymes. Machine Learning Workflow for Topt Model. UniProt identifiers for proteins with an experimentally determined catalytic optimum were obtained from the “TEMPERATURE OPTIMUM” table in the web pages of the BRENDA database, release 2018.2 (July 2018). These identifiers were filtered to retain only those associated with an organism with experimentally determined OGT. After further filtering to remove sequences containing “X” (unknown amino acid), a data set with 2609 enzymes was generated. The protein sequences for each of these identifiers were downloaded from the UniProt database in fasta format. The following features were extracted for each enzyme: (1) 20 amino acid frequencies (AA); (2) 400 dipeptide frequencies (Dipeptide); (3) OGT of its source organism; (4) basic features including protein length, isoelectric point, molecular weight, aromaticity,60 instability index,61 gravy,62 and fraction of three secondary structure units: helix, turn, and sheet. These features were extracted with the module Bio.SeqUtils.ProtParam.ProteinAnalysis in Biopython63 (version 1.70). Additionally, six binary features were extracted: EC = 1, 2, 3, 4, 5, 6. These numbers represent the first digit in a EC number. All features except binary features were normalized as described in

i

mean and standard deviation of feature i in the original training data set, respectively. Finally, the normalized values were used for the prediction of Topt by the previously generated random forest regressor trained on the AA+OGT data sets. The predicted enzyme Topt and annotated OGT values of these enzymes are freely available for download and reuse (https:// zenodo.org/record/2539114, https://doi.org/10.5281/ zenodo.2539114).



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acssynbio.9b00099.



Figures S1−S4 and Tables S1−S2 (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +46 (0)31 772 8171. ORCID

Kersten S. Rabe: 0000-0001-7909-8191 Martin K. M. Engqvist: 0000-0003-2174-2225 Author Contributions

GL, JN, and MKME conceptualized the research. JN acquired funding to support the project. MKME generated the proteome and BRENDA data sets and performed data curation. GL performed the computational and statistical data analysis. GL, KSR, and MKME interpreted results. GL wrote the computer code for the Tome package. GL and MKME created the publication figures. GL and MKME wrote the initial draft of the paper. GL, KSR, JN, and MKME carried out revisions on the initial draft and wrote the final version. 1418

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology Notes

(15) Bloom, J. D., Labthavikul, S. T., Otey, C. R., and Arnold, F. H. (2006) Protein stability promotes evolvability. Proc. Natl. Acad. Sci. U. S. A. 103, 5869−5874. (16) Kan, S. B. J., Lewis, R. D., Chen, K., and Arnold, F. H. (2016) Directed evolution of cytochrome c for carbon-silicon bond formation: Bringing silicon to life. Science 354, 1048−1051. (17) Rigoldi, F., Donini, S., Redaelli, A., Parisini, E., and Gautieri, A. (2018) Review: Engineering of thermostable enzymes for industrial applications. APL Bioengineering 2, 011501. (18) Finch, A., and Kim, J. (2018) Thermophilic Proteins as Versatile Scaffolds for Protein Engineering. Microorganisms 6, 97. (19) Camps, M., Herman, A., Loh, E., and Loeb, L. A. (2007) Genetic constraints on protein evolution. Crit. Rev. Biochem. Mol. Biol. 42, 313−326. (20) Söhngen, C., Podstawka, A., Bunk, B., Gleim, D., Vetcininova, A., Reimer, L. C., Ebeling, C., Pendarovski, C., and Overmann, J. (2016) BacDive-The Bacterial Diversity Metadatabase in 2016. Nucleic Acids Res. 44, D581−585. (21) Pezeshgi Modarres, H., Mofrad, M. R., and Sanati-Nezhad, A. (2018) ProtDataTherm: A database for thermostability analysis and engineering of proteins. PLoS One 13, No. e0191222. (22) Engqvist, M. K. (2018) Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures. BMC Microbiol. 18, 177. (23) Richards, M. A., Cassen, V., Heavner, B. D., Ajami, N. E., Herrmann, A., Simeonidis, E., and Price, N. D. (2014) MediaDB: A Database of Microbial Growth Conditions in Defined Media. PLoS One 9, No. e103548. (24) Rappé, M. S., and Giovannoni, S. J. (2003) The uncultured microbial majority. Annu. Rev. Microbiol. 57, 369−394. (25) Dehouck, Y., Folch, B., and Rooman, M. (2008) Revisiting the correlation between proteins’ thermoresistance and organisms’ thermophilicity. Protein Eng., Des. Sel. 21, 275−278. (26) Hickey, D. A., and Singer, G. A. (2004) Genomic and proteomic adaptations to growth at high temperature. Genome Biol. 5, 117. (27) Saunders, N. F. W., et al. (2003) Mechanisms of thermal adaptation revealed from the genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii. Genome Res. 13, 1580−1588. (28) Venev, S. V., and Zeldovich, K. B. (2018) Thermophilic Adaptation in Prokaryotes Is Constrained by Metabolic Costs of Proteostasis. Mol. Biol. Evol. 35, 211−224. (29) Sauer, D. B., and Wang, D.-N. (2019) Predicting the optimal growth temperatures of prokaryotes using only genome derived features. Bioinformatics, 1−8. (30) Forterre, P. (2002) A hot story from comparative genomics: reverse gyrase is the only hyperthermophile-specific protein. Trends Genet. 18, 236−237. (31) Nakashima, H., Fukuchi, S., and Nishikawa, K. (2003) Compositional changes in RNA, DNA and proteins for bacterial adaptation to higher and lower temperatures. J. Biochem. 133, 507− 513. (32) Galtier, N., and Lobry, J. R. (1997) Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J. Mol. Evol. 44, 632−636. (33) Zeldovich, K. B., Berezovsky, I. N., and Shakhnovich, E. I. (2007) Protein and DNA Sequence Determinants of Thermophilic Adaptation. PLoS Comput. Biol. 3, No. e5. (34) Jensen, D. B., Vesth, T. C., Hallin, P. F., Pedersen, A. G., and Ussery, D. W. (2012) Bayesian prediction of bacterial growth temperature range based on genome sequences. BMC Genomics 13, S3. (35) Fariselli, P., Martelli, P. L., Savojardo, C., and Casadio, R. (2015) INPS: predicting the impact of non-synonymous variations on protein stability from sequence. Bioinformatics 31, 2816−2821.

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The computations were performed on resources at Chalmers Centre for Computational Science and Engineering (C3SE) provided by the Swedish National Infrastructure for Computing (SNIC). We thank Pia Schwitters for her assistance with performing the manual literature search for organism growth temperatures. GL and JN have received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie program, project PAcMEN (Grant Agreement No. 722287). We also acknowledge funding from the Novo Nordisk Foundation (Grant No. NNF10CC1016517) and the Knut and Alice Wallenberg Foundation. KSR acknowledges financial support by the Helmholtz program “BioInterfaces in Technology and Medicine”.



REFERENCES

(1) Demirjian, D. C., Morís-Varas, F., and Cassidy, C. S. (2001) Enzymes from extremophiles. Curr. Opin. Chem. Biol. 5, 144−151. (2) Turner, P., Mamo, G., and Karlsson, E. N. (2007) Potential and utilization of thermophiles and thermostable enzymes in biorefining. Microb. Cell Fact. 6, 9. (3) Vieille, C., and Zeikus, G. J. (2001) Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol. Mol. Biol. Rev. 65, 1−43. (4) Haki, G. D., and Rakshit, S. K. (2003) Developments in industrially important thermostable enzymes: a review. Bioresour. Technol. 89, 17−34. (5) Elleuche, S., Schröder, C., Sahm, K., and Antranikian, G. (2014) Extremozymes biocatalysts with unique properties from extremophilic microorganisms. Curr. Opin. Biotechnol. 29, 116−123. (6) Jemli, S., Ayadi-Zouari, D., Hlima, H. B., and Bejar, S. (2016) Biocatalysts: application and engineering for industrial purposes. Crit. Rev. Biotechnol. 36, 246−258. (7) Yeoman, C. J., Han, Y., Dodd, D., Schroeder, C. M., Mackie, R. I., and Cann, I. K. O. (2010) Thermostable enzymes as biocatalysts in the biofuel industry. Adv. Appl. Microbiol. 70, 1−55. (8) Kumar, S., Dangi, A. K., Shukla, P., Baishya, D., and Khare, S. K. (2019) Thermozymes: Adaptive strategies and tools for their biotechnological applications. Bioresour. Technol. 278, 372−382. (9) Chien, A., Edgar, D. B., and Trela, J. M. (1976) Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J. Bacteriol. 127, 1550−1557. (10) Saiki, R. K., Gelfand, D. H., Stoffel, S., Scharf, S. J., Higuchi, R., Horn, G. T., Mullis, K. B., and Erlich, H. A. (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239, 487−491. (11) Moser, M. J., DiFrancesco, R. A., Gowda, K., Klingele, A. J., Sugar, D. R., Stocki, S., Mead, D. A., and Schoenfeld, T. W. (2012) Thermostable DNA Polymerase from a Viral Metagenome Is a Potent RT-PCR Enzyme. PLoS One 7, No. e38371. (12) Chander, Y., Koelbl, J., Puckett, J., Moser, M. J., Klingele, A. J., Liles, M. R., Carrias, A., Mead, D. A., and Schoenfeld, T. W. (2014) A novel thermostable polymerase for RNA and DNA loop-mediated isothermal amplification (LAMP). Front. Microbiol. 5, 395. (13) Takahashi, M., Yamaguchi, E., and Uchida, T. (1984) Thermophilic DNA ligase. Purification and properties of the enzyme from Thermus thermophilus HB8. J. Biol. Chem. 259, 10041−10047. (14) Heller, R. C., Chung, S., Crissy, K., Dumas, K., Schuster, D., and Schoenfeld, T. W. (2019) Engineering of a thermostable viral polymerase using metagenome-derived diversity for highly sensitive and specific RT-PCR. Nucleic Acids Res. 47, 3619−3630. 1419

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420

Research Article

ACS Synthetic Biology (36) Quan, L., Lv, Q., and Zhang, Y. (2016) STRUM: structure− based prediction of protein stability changes upon single-point mutation. Bioinformatics 32, 2936−2946. (37) Dehouck, Y., Grosfils, A., Folch, B., Gilis, D., Bogaerts, P., and Rooman, M. (2009) Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537−2543. (38) Guerois, R., Nielsen, J. E., and Serrano, L. (2002) Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J. Mol. Biol. 320, 369−387. (39) Chen, C.-W., Lin, J., and Chu, Y.-W. (2013) iStable: off-theshelf predictor integration for predicting protein stability changes. BMC Bioinf. 14, S5. (40) Ku, T., Lu, P., Chan, C., Wang, T., Lai, S., Lyu, P., and Hsiao, N. (2009) Predicting melting temperature directly from protein sequences. Comput. Biol. Chem. 33, 445−450. (41) Dill, K. A., Ghosh, K., and Schmit, J. D. (2011) Physical limits of cells and proteomes. Proc. Natl. Acad. Sci. U. S. A. 108, 17876− 17882. (42) Murphy, K. P., and Freire, E. (1993) Structural energetics of protein stability and folding cooperativity. Pure Appl. Chem. 65, 1939−1946. (43) Oobatake, M., and Ooi, T. (1993) Hydration and heat stability effects on protein unfolding. Prog. Biophys. Mol. Biol. 59, 237−284. (44) Jeske, L., Placzek, S., Schomburg, I., Chang, A., and Schomburg, D. (2019) BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res. 47, D542−D549. (45) Li, G., Dong, Y., and Reetz, M. (2019) Can Machine Learning Revolutionize Directed Evolution of Selective Enzymes? Adv. Synth. Catal., DOI: 10.1002/adsc.201900149. (46) Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J., and Arnold, F. H. (2019) Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U. S. A. 116, 8852− 8858. (47) Chou, K.-C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct., Funct., Genet. 43, 246−255. (48) Magnan, C. N., and Baldi, P. (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30, 2592−2597. (49) Kawashima, S., Ogata, H., and Kanehisa, M. (1999) AAindex: amino acid index database. Nucleic Acids Res. 27, 368−369. (50) Asgari, E., and Mofrad, M. R. K. (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 10, No. e0141287. (51) Yang, K. K., Wu, Z., Bedbrook, C. N., and Arnold, F. H. (2018) Learned protein embeddings for machine learning. Bioinformatics 34, 2642−2648. (52) Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G. M. (2019) Unified rational protein engineering with sequence only deep representation learning. bioRxiv 589333. (53) Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nachaev, D., Matthes, F., and Rost, B. (2019) Modeling the Language of LifeDeep Learning Protein Sequences. bioRxiv 614313. (54) Pucci, F., and Rooman, M. (2017) Physical and molecular bases of protein thermal stability and cold adaptation. Curr. Opin. Struct. Biol. 42, 117−128. (55) Pugalenthi, G., Kumar, K. K., Suganthan, P. N., and Gangal, R. (2008) Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochem. Biophys. Res. Commun. 367, 630−634. (56) LeCun, Y., Bengio, Y., and Hinton, G. (2015) Deep learning. Nature 521, 436−444. (57) Min, S., Lee, B., and Yoon, S. (2016) Deep learning in bioinformatics. Briefings Bioinf. 18, 851−869. (58) Pearson, W. R., and Lipman, D. J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U. S. A. 85, 2444−2448.

(59) Pedregosa, F., et al. (2011) Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825−2830. (60) Lobry, J. R., and Gautier, C. (1994) Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 22, 3174−3180. (61) Guruprasad, K., Reddy, B. V., and Pandit, M. W. (1990) Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng., Des. Sel. 4, 155−161. (62) Kyte, J., and Doolittle, R. F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105−132. (63) Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., and de Hoon, M. J. L. (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422−1423.

1420

DOI: 10.1021/acssynbio.9b00099 ACS Synth. Biol. 2019, 8, 1411−1420