PTML Model of Enzyme Subclasses for Mining the Proteome of Biofuel

May 13, 2019 - ... xylose reductase, that is, the enzyme responsible for the conversion of biomass into bioethanol. ... Pichia stipitis enzyme classif...
1 downloads 0 Views 1MB Size
Article Cite This: J. Proteome Res. 2019, 18, 2735−2746

pubs.acs.org/jpr

PTML Model of Enzyme Subclasses for Mining the Proteome of Biofuel Producing Microorganisms Riccardo Concu,*,† M. Nataĺ ia. D. S. Cordeiro,†,# Cristian R. Munteanu,‡,§,▽ and Humbert Gonzaĺ ez-Díaz*,∥,⊥

Downloaded via NOTTINGHAM TRENT UNIV on July 22, 2019 at 11:20:22 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



LAQV@REQUIMTE/Department of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal ‡ RNASA-IMEDIR, Computer Science Faculty, University of A Coruña, 15071 A Coruña, Spain § INIBIC Biomedical Research Institute of Coruña, CHUAC University Hospital, 15006 A Coruña, Spain ∥ Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Biscay, Spain ⊥ IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Biscay, Spain S Supporting Information *

ABSTRACT: Predicting enzyme function and enzyme subclasses is always a key objective in fields such as biotechnology, biochemistry, medicinal chemistry, physiology, and so on. The Protein Data Bank (PDB) is the largest information archive of biological macromolecular structures, with more than 150 000 entries for proteins, nucleic acids, and complex assemblies. Among these entries, there are more than 4000 proteins whose functions remain unknown because no detectable homology to proteins whose functions are known has been found. The problem is that our ability to isolate proteins and identify their sequences far exceeds our ability to assign them a defined function. As a result, there is a growing interest in this topic, and several methods have been developed to identify protein function based on these innovative approaches. In this work, we have applied perturbation theory to an original data set consisting of 19 187 enzymes representing all 59 subclasses present in the protein data bank. In addition, we developed a series of artificial neural network models able to predict enzyme−enzyme pairs of query-template sequences with accuracy, specificity, and sensitivity greater than 90% in both training and validation series. As a likely application of this methodology and to further validate our approach, we used our novel model to predict a set of enzymes belonging to the yeast Pichia stipites. This yeast has been widely studied because it is commonly present in nature and produces a high ethanol yield by converting lignocellulosic biomass into bioethanol through the xylose reductase enzyme. Using this premise, we tested our model on 222 enzymes including xylose reductase, that is, the enzyme responsible for the conversion of biomass into bioethanol. KEYWORDS: microorganism proteome, enzymes, biofuel production, Spathaspora sp., perturbation theory, machine learning UniProtKB database,9 the Kyoto Encyclopedia of Genes and Genomes (KEGG),10 and the PEDANT protein database.11 In general terms, BLAST and alignment-based tools have proved to be one of the most powerful tools to assign a specific function to new and or unknown sequences. However, as is the case with all methods, these methods may fail under certain conditions. For instance, enzymes in a protein family are usually considered to be evolutionarily related. Even when they share a highly related sequence, these enzymes may have similar but different functions. Indeed, some authors have demonstrated that enzymes that share a sequence identity higher than 90% may have different functions and differ in the first digit of their

1. INTRODUCTION The ability to predict protein and enzyme function is a key goal in bioinformatics. In point of fact, many proteins included in the Protein Data Bank do not as yet have an assigned function.1 Moreover, the ability to predict a protein function enables researchers to develop proteins/enzymes with a specific and desired function or activity.2−7 In recent years, several research groups have developed effective and reliable approaches designed specifically to predict the function of proteins and enzymes. One of the most reliable methods is the Basic Local Alignment Search Tool (BLAST),8 an important tool created by the National Center for Biotechnology Information’s (NCBI) database for similar sequences, which makes it possible to determine a similarity score. Other relevant tools based on sequence similarity are the © 2019 American Chemical Society

Received: December 12, 2018 Published: May 13, 2019 2735

DOI: 10.1021/acs.jproteome.8b00949 J. Proteome Res. 2019, 18, 2735−2746

Article

Journal of Proteome Research EC number.12−14 On the contrary, some enzymes with a sequence identity below 30% share all four digits of the EC numbers. Thus such nonlinear correlations between function and sequence similarity make the identification of enzyme function a challenging task.15 It is a problem that has been clearly highlighted by Quester et al., who analyzed nine prokaryotic genomes and found ∼70% inconsistency rates in enzyme predictions using the main annotation resources, namely, NCBI, KEGG, and PEDANT.16 For instance, in only 29% of all annotations did the three main annotation databases predict identical enzyme functions. In a further 14% of cases, there was agreement between two of the three sources, and in 30% of all annotated genes, only one of the three databases showed any function assignment at all. On average, 19% of all genes with a predicted enzyme function were annotated only by the BLASTbased annotation and not in any of the other main annotation databases. For this reason, the authors employed the annotation pipeline EnzymeDetector for only nine organisms. This tool automatically compares and evaluates the assigned enzyme functions from the main annotation databases and supplements them with its own function prediction. It is an approach that proves to be particularly consistent and improves the results of the other databases, thereby avoiding inconsistencies; however, it works for only nine different prokaryotic genomes. Other important works in this area are the ones proposed by Li et al.17 and Nagao et al.15 In both cases, the authors developed different systems to predict enzyme functions by applying different similarity cutoffs while building their databases to avoid problems relating to low homology. The approach proposed by Nagao et al. showed a precision of 0.98 and a recall of 0.89 in a cross-validated benchmark assessment. In the case of the Li et al. approach, the authors slightly improve Nagao et al.’s crossvalidation accuracy. In the light of what has been referred to so far, one of the major targets of this work is to develop alternative and alignment-free methods that can be applied in a complementary manner, namely, by using alignment-free machine learning (ML) methods to predict protein function. As input, ML methods use alignment-free numerical parameters to quantify information about the 2D or 3D structure of proteins.18−21 Specifically, Graham, Bonchev, Marrero-Ponce, and others22−26 have used Shannon’s entropy measures to quantify relevant structural information in molecular systems. In addition, ́ et al.27−29 introduced the so-called Markov− González-Diaz Shannon entropies (θk) to codify the structural information on large biomolecules and complex biosystems or networks. ́ More recently, González-Di az’s group has combined perturbation theory (PT) operators with ML methods to create PTML algorithms that are useful in proteome research. PTML algorithms use the numerical parameters of a reference case (protein, gene, drug, etc.) to infer the biological function of a query protein without relying upon alignment.30,31 The input to be used in a PTML model can be the θk values of the sequences (qθk and rθk) of all query and template proteins or other numerical parameters. Accordingly, PTML is expected to have the advantages of BLAST (use of query proteins) but will not fail due to the lack of alignment. One of the more important PT operators of PTML models are the so-called moving averages (MAs). MA operators of different classes have been used by our group as input in PTML models to study drugs, proteins, as well as nanoparticles.32−40 In the case of entropy indices, an MA operator has the form Δθk(cj) = θk − . It measures the deviation of the information in one specific protein θk from the

expected value for a set of proteins of a certain class cj. However, these models are prone to problems of overfitting and thus may fail under certain conditions. Overfitting occurs when the model shows a high training performance on the one hand, while showing a poor validation performance on the other hand . This usually occurs when models are developed using a large number of predictors and neurons in the hidden layer relative to the number of cases, as in the case of Google Flu Trends (GFT).41,42 Finally, the past few years have witnessed fast-growing concerns for environmental issues brought about by the extensive use of oil. For instance, the European transport sector is estimated to depend on oil for up to 97% of its fuel.43 Because of this widespread use of petroleum-based fuels, the depletion rate of world stocks is inevitably increasing. Therefore, there is a growing demand for new, alternative, renewable, and greener processes for producing biofuels to reduce our dependency on nonrenewable sources of oil and mitigate climate change and its concomitant environmental urgencies. Recently, various efforts have been made to identify raw materials such as feedstocks. For example, lignocelluloses have received much attention owing to their abundance, renewability, and noncompetition with human demand.44 Other raw sources that have come under scrutiny include pretreated wastewater and wood waste, among others.45 In addition, significant efforts have been performed to identify potential bio-organisms able to process raw materials for the production of biofuels. To this end, several microorganisms have been explored and used to find new ways to produce biofuels such as Saccharomyces cerevisiae, Pichia stipitis, Spathaspora passalidarum, and so on.44,46 Among these, Pichia stipitis is one of the more interesting due to its capacity to ferment lignocellulosic substrates; indeed, it can ferment a variety of sugars to ethanol using specific enzymes present in its proteome. One of the key factors in this process is that this yeast can utilize a pentose xylose contained in the hemicellulosic component of wood. Therefore, Pichia stipitis is considered to be an excellent candidate to produce ethanol from lignocelluloses. In consideration of the above, in this study, we have developed the first linear PTML model able to predict the enzyme subclasses for a heterogeneous series of enzymes with different degrees of similarity. We also developed alternative nonlinear PTML models based on different artificial neural network (ANN) topologies. Details regarding how to use the PTML model in proteome research related to biofuel production are shown. To this end, we used the model to carry out an in silico high-throughput mining of the enzymes expressed in the proteome of Pichia stipites. Using the model we have developed, we simulated an alanine mutation scanning of many of the enzymes present in this organism and focused on the EC 1 and D-xylose reductase enzymes (EC 1.1), which are usually used for the fermentation of xylose into biofuel. Overall, it is shown that this kind of model has the potential to become a useful tool in the future design of new enzymes for biofuel production.

2. MATERIALS AND METHODS Topological-Information Indices of Protein Sequences

First, we used the software S2SNet47 to transform each protein sequence into one sequence recurrence network (SRN). The SRN of a protein sequence may be constructed starting from either of two directions: (1) from a sequence graph with linear topology by adding amino acid recurrence information or (2) from a protein representation graph with star graph (SG) 2736

DOI: 10.1021/acs.jproteome.8b00949 J. Proteome Res. 2019, 18, 2735−2746

Article

Journal of Proteome Research Table 1. Different Types of PT Information Indices Operators Used in the Model operator

formula

difference

ΔTIk(q, r) = qTIk − rTIk

query sequence moving average template sequence moving average double moving average cross covariance

ΔTIk(q, i) = qTIk − ΔTIk(r, j) = rTIk − ΔΔTIk(i, j, q, r) = ΔTIk(q, i) − ΔTIk(r, j) = [(qTIk − ) − (rTIk − )] ∇TIk(i, j, q, r) = ΔTIk(q, i)·ΔTIk(r, j) = (qTIk − )·(rTIk − )

type of information Perturbation (change) in the information on the sequence of the query (q-seq) with respect to the sequence of reference (r) Perturbation in the information on the q-seq with respect to the expected value of information for all q-seqs with the same class of activity ci Perturbation in the information on the r-seq with respect to the expected value of information for all r-seqs with the same class of activity cj Difference between query and template deviations from their respective classes Covariance for query and template deviations from their respective classes

(ci = cj) or different subclasses (ci ≠ cj). In our data set, we represented all query and reference protein sequences as two vectors qTI ≡ [qTI1, qTI1, qTI1, ..., qTIk] and rTI ≡ [rTI1, rTI1, r TI1, ..., rTIk]. The components of these vectors are Shannon

topology by adding sequence information.48−52 Note that in both of these SRN representations of a protein sequence, the amino acids are the nodes and a pair of nodes (na and nb) in the network being connected by a link (αab = 1) if they are adjacent or neighbor recurrent nodes. This means that αab = 1 if the topological distance between na and nb is d = 1 (chemically bonded amino acids) or if they are the nearest-neighbor amino acids of the same type (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V) with minimal topological distance dab = min(dab) between them. The first node in the sequence (center of the SG) is a bias or a dummy nonresidue vertex. Second, we need to transform the SRN of each sequence into one stochastic matrix 1Π. The elements of 1Π are the probabilities (pab) of reaching an amino acid (node nj) by walking from another amino acid (node ni) through a walk of length dij = 1 α pab = n = Lab

∑ αab

information indices used to quantify information about the sequence of these proteins. In the next section, we explain the calculation of the Shannon entropy information indices for these sequences. Using these indices, we can calculate different PT operators to quantify the effects of changes (perturbations) in the sequences of the proteins or in the subclass of enzyme activity. We can then use all of these variables as input to train different ML algorithms to seek an accurate PTML model. The general formula for a PTML linear model of this type is shown in eq 3

(1)

n=1

k=5

Sqr(ci , cj) = e0 + e1·εr(cj) +

Note that the number of amino acids in the sequences will be equal to the number of nodes (n) in the SRN graph and is also equal to the number of rows and columns in 1Π, the length of the sequence (L), and the maximal topological distance in the sequence max (dab). In this work, we quantified the information content of a peptide using the Shannon entropy values (TIk) of the kth natural powers of the Markov matrix 1Π. The same procedure was used to quantify the information on the q-seqs ( q TI k ) and r-seqs ( r TI k ). The generic formula for a Markov−Topological indice qTIk(Seq) is as follows

∑ k e 2·ΔTIk(q , r) k=0

k=5

+

∑ k e3·ΔTIk(q , j) k=0 k=5

+

∑ k e 4·ΔΔTIk(i , j , q , r) k=0 k=5

+

∑ k e5·∇∇k (i , j , q , r) k=0

a=L

(3)

k

TIk(seq) = − ∑ f ( pa )a a=0

In this model, Sqr(ci, cj) is the score value of the query proteins for the enzyme activity of class ci compared with the enzyme activity εr(cj) of the enzyme of reference. The coefficient e0 is the independent term. The parameter e1 is the coefficient of the first input variable of the model εr(cj). This variable quantifies the presence (εr(cj) = 1) or absence (εr(cj) = 0) of the enzyme activity of subclass cj with the template protein used as a reference. The other types of variables are the PT information index operators ΔTIk(q, r), ΔTIk(q, j), ΔΔTIk(i, j, q, r), and ∇∇TIk(i, j, q, r). The parameters ke2, ke3, ke4, and ke5 are the coefficients of these PT operators. In Table 1 we have summarized the definitions of the different types of PT operators used in this work. The following equation contains the expanded formula of the PTML linear model using these definitions

(2)

where kpa represents the absolute probability of reaching a node moving throughout a walk of length k with respect to any node (aminoacid a) in the spectral graph. Further details of this formula can be seen in previous works.27−29 PTML Model Definition

The rationale behind the application of PT to enzymes is based on the development of a computational method able to predict new enzymes or enzyme complexes. Here our method predicts the presence of enzyme activity εq(ci) = 1 of subclass ci (or the absence of this activity εq(ci) = 0) for a query protein with a known amino acid sequence. It performs an alignment-free comparison with respect to the information about the sequence and the values of activity εr(cj) of the subclass cj of a protein of reference. To apply PT, we first had to develop a data set where the entries are pairs of queries versus reference sequences with their respective values of enzyme activity within the same class 2737

DOI: 10.1021/acs.jproteome.8b00949 J. Proteome Res. 2019, 18, 2735−2746

Article

Journal of Proteome Research Table 2. Relevant Statistics for the PTML-LDA Model MCC

eigenvalue

canonical R

Wilk’s lambda

χ2

df

p value

0.64

0.793300

0.665108

0.557631

67976.39

25.00000

0.00

Table 3. Results Obtained for Several PTML-ANN Models train model 1. RBF-28-50-2

2. RBF-28-57-2

3. RBF-28-52-2

4. LNN 26:26-1:1

5. LNN 27:27-1:1

6. LNN 28:28-1:1

7. PNN 28:28-80390-2-2:1

8. PNN 28:28-80390-2-2:1

9. PNN 28:28-80390-2-2:1

total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%) total correct incorrect correct (%) incorrect (%)

validation

overall

0 = Sn

1 = Sp

all

0 = Sn

1 = Sp

all

0 = Sn

1 = Sp

all

52915 52867 48 99.9 0.1 52915 52897 18 100.0 0.0 52915 52909 6 100.0 0.0 52843 49792 3051 94.2 5.8 52843 50056 2787 94.7 5.3 52843 50871 1972.00 96.3 3.7 52914 51587 1327 97.492 2.508 52914 52914 0 100 0 52914 52914 0 100.00 0

27475 18126 9349 66.0 34.0 27475 17967 9508 65.4 34.6 27475 17990 9485 65.5 34.5 27547 17623 9924 64.0 36.0 27547 17705 9842 64.3 35.7 27547 17833 9714.00 64.7 35.3 27476 4903 22573 17.845 82.155 27476 0 27476 0 100 27476 0 27476 0.00 100

80390 70993 9397 82.9 17.1 80390 70864 9526 82.7 17.3 80390 70899 9491 82.7 17.3 80390 67415 12975 79.1 20.9 80390 67761 12629 79.5 20.5 80390 68704 11686 80.5 19.5 80390 56490 23900 57.668 42.332 80390 52914 27476 50 50 80390 52914 27476 50 50

8965 8938 27 99.7 0.3 8965 8965 0 100.0 0.0 8965 8962 3 100.0 0.0 8946 8456 490 94.5 5.5 8946 8494 452 94.9 5.1 8946 8585 361 96.0 4.0 8866 8513 353 96.0 4.0 8866 8866 0 100.0 0.0 8866 8866 0 100.0 0.0

31230 22498 8732 72.0 28.0 31230 22441 8789 71.9 28.1 31230 22526 8704 72.1 27.9 31249 21450 9799 68.6 31.4 31249 21546 9703 68.9 31.1 31249 21931 9318 70.2 29.8 31329 5116 26213 16.3 83.7 31329 0 31329 0.0 100.0 31329 0 31329 0.0 100.0

40195 31436 8759 85.9 14.1 40195 31406 8789 85.9 14.1 40195 31488 8707 86.0 14.0 40195 29906 10289 81.6 18.4 40195 30040 10155 81.9 18.1 40195 30516 9679 83.1 16.9 40195 13629 26566 56.2 43.8 40195 8866 31329 50.0 50.0 40195 8866 31329 50.0 50.0

83290 83163 127 99.8 0.2 83290 83265 25 100.0 0.0 83290 83275 15 100.0 0.0 83290 78558 4732 94.3 5.7 83290 78976 4314 94.8 5.2 83290 80104 3186 96.2 3.8 83290 80809 2481 97.0 3.0 83290 83290 0 100.0 0.0 83290 83290 0 100.0 0.0

77490 49460 28030 63.8 36.2 77490 49525 27965 63.9 36.1 77490 49532 27958 63.9 36.1 77490 46695 30795 60.3 39.7 77490 46961 30529 60.6 39.4 77490 47627 29863 61.5 38.5 77490 12227 65263 15.8 84.2 77490 0 77490 0.0 100.0 77490 0 77490 0.0 100.0

160780 132623 28157 81.8 18.2 160780 132790 27990 81.9 18.1 160780 132807 27973 82.0 18.0 160780 125253 35527 77.3 22.7 160780 125937 34843 77.7 22.3 160780 127731 33049 78.8 21.2 160780 93036 67744 56.4 43.6 160780 83290 77490 50.0 50.0 160780 83290 77490 50.0 50.0

2738

DOI: 10.1021/acs.jproteome.8b00949 J. Proteome Res. 2019, 18, 2735−2746

Article

Journal of Proteome Research k=5

Sqr(ci , cj) = e0 + e1·εr(cj) +

curation, deleting all of the duplicates randomly generated. Each pair of sequences contains a query sequence (q-seq) and a template sequence of reference (r-seq). The complete data set is reported in the Supporting Information S1 (Pichia stipitis enzyme proteome and alanine scanning procedure). From UniProt, we retrieved a total of 222 enzymes belonging to the EC class 1 (oxidoreductase) of the Pichia stipitis because the enzymes responsible for the xylose fermentation belong to EC class 1, and xylose fermentation enzymes are usually used to produce biofuels. We then applied the alanine scanning protocol, substituting the proline amino acid with the alanine to evaluate the ability of the model while identifying the mutation in the protein. The alanine scanning procedure is a technique usually applied to determine the contribution of a specific residue to the stability or function of a given protein.58 This procedure basically consists of substituting alanine with a specific amino acid and in effect may be interpreted as the application of a perturbation, where our initial state is the reference enzyme and the one with the alanine scanning substitution is the query. In fact, we used the original sequence as the reference and the alanine scanned as the query. We decided to apply this technique for two main reasons: First, we wanted to test the predictivity of the model and validate the model with another external series; second, we also wanted to check the ability of the model to predict enzymes with a specific mutation to verify whether the model was able to predict a new enzyme with a specific function.

∑ k e 2·(q TIk − r TIk) k=0

k=5

+

∑ k e3·(q TIk − ⟨q TI⟩i ) k=0 k=5

+

∑ k e 4·[(q TIk − ⟨q TI⟩i ) − (r TIk − ⟨q TI⟩j )] k=0 k=5

+

∑ k e5·(q TIk − ⟨q TI⟩i )·(r TIk − ⟨q TI⟩j ) k=0

(4)

PTML Nonlinear Models

Subsequently, we used the ANN tools implemented in the software STATISTICA53 to find robust and reliable classification models. Standard statistics such as the specificity (Sp), sensitivity (Sn), probability of error (p), cross-validation, and the Matthews correlation coefficient (MCC)54 were used to assess the discriminatory power of the model. We also performed a global sensitivity analysis to monitor the influence of each molecular descriptor (MD) appearing in the model. This analysis in data mining and statistical model building/fitting generally assesses the importance of the descriptor variables in the model; in fact, it is able to determine how sensitive a model is to changes in the structure of the model.55 To do so, the aforementioned ANN module computes the sum of squares residuals or misclassification rates for the model when a descriptor variable is eliminated from the neural net. Therefore, the descriptor variables can be sorted by their importance or relevance for the particular neural net. This analysis is automatically performed by a specific module included in the STATISTICA software. Moreover, to avoid an albeit unlikely overfitting problem, we have organized the data set to ensure that neither the reference nor the query structure was appearing in both the training and validation series. To do so, we first assigned a random number to each enzyme and then generated random pairs, subsequently ordering the generated pairs using the assigned number. Finally, we assigned the first 70% of the entries to the train series and the last 30% to the validation series. To identify the best topology and architectures, different ANNs such as linear neural network (LNN), radial basis function (RBF), multilayer perceptron (MLP), and probabilistic neural networks (PNNs) were closely examined.53

3. RESULTS AND DISCUSSION This study has essentially two main aims. First, we aim to develop a new model for predicting enzyme subclasses using an Table 4. Performance of the Best MLP Model Found predicted sets obs. sets

Data Set of Enzyme Sequences

All of the enzyme sequences were retrieved from the Protein Data Bank (PBD).1 We collected a total of 19 187 sequences, each one belonging to 1 out of 59 enzyme subclasses.56 The enzymes were sorted by their EC number. As mentioned before, we initially uploaded all of the enzyme sequences to the software S2SNet.47 S2SNet constructed the SRNs of all networks, created the recurrence/sequence stochastic matrices, and automatically calculated the iθk values for all of the enzyme sequences. Second, we used the Wichmann−Hill algorithm57 (the version implemented in Microsoft Excel) to create a pseudorandom list of 160 557 pairs of sequences. We generated a set of 83 067 positive cases, randomly pairing each sequence with a sequence that was 100 cases up or down in the original data set; the negative cases (= 77 490) were randomly generated. We selected pairs of two enzymes that were at a distance lower than five in the original data set. We then performed a data set

a

a

stat. param.

0 1 total

Spa Sna Aca

0 1 total

Spa Sna Aca

0 1 total

Spa Sna Aca

pred. stat.

a

0

Training Series 90.9 52899 91.7 5330 91.3 58229 Validation Series 90.6 22513 91.6 2325 91.1 24838 Overall 90.8 75412 91.7 7655 91.2 83067

1

nj

4844 53328 58172

106227 10174 116401

1,620 17698 19318

40211 3945 44156

6464 71026 77490

146438 14119 160557

a

Obs. sets, observed sets; stat. param., statistical parameter; pred. stat., predicted statistics; Sp, specificity; Sn, sensitivity; Ac, accuracy.

innovative approach based on the application of the combination of PT and ML tools. On the contrary, we wanted to extend this methodology to the prediction of enzymes that may play a significant role in the production of biofuels, taking advantage of the metabolism of a natural yeast, the Pichia stipitis. Consequently, we developed a new ML model that is able to assign enzymes to a specific subclass and, at the same time, can predict the function of some enzymes of the Pichia stipitis yeast, including the D-xylose reductase, which plays an important role in the transformation of lignocellulosic biomass into biofuel. This enzyme is one of the most studied in this area due to its high rate of efficiency in transforming natural substrates into biofuel. 2739

DOI: 10.1021/acs.jproteome.8b00949 J. Proteome Res. 2019, 18, 2735−2746

Article

Journal of Proteome Research

Figure 1. ROC curve of the best PTML model found.

PTML-LDA Linear Model

PTML-ANN Nonlinear Models

First, we used the linear discriminant analysis (LDA) algorithm implemented in the software STATISTICA to derive a linear PTML model able to discriminate all subclasses of enzymes. Because at the very beginning we had more than 200 variables, we first performed a forward stepwise analysis to select only the variables with influence on the model. The best model found has the following values of specificity: Sp = 80%, sensitivity: Sn = 83%, and accuracy: Ac = 82%. These values are promising for a linear model but could be improved with a nonlinear method (see the next section). The best PTML-LDA linear model found is shown as follows

As reported in the previous section, we ran a set of 50 different models to identify the best topology and algorithm able to accurately classify the two classes of our data set. For instance, having investigated the following topologies, LNN, RBF, MLP, and PNN, we demonstrated in previous works that MLP usually performs better than the other topologies.59 In fact, the performance of the LNN, RBF, and PNN models was analogous to the linear model or even worse. For purposes of comparison, a selection of LNN, RBF, and PNN models is shown in Table 3. The statistics of the MLP models, on the contrary, are improved with respect to those of the LDA and other neural network models. The best model found has the following topology: MLP: 28-51-2. It presented high values of goodnessof-fit statistics such as Sp (specificity) = 90.8%, Sn (sensitivity) = 91.7%, and Ac (accuracy) = 91.2%. Indeed, in the training series, the model is able to correctly classify 52 899 out of 58 229 enzymes (Sp = 90.9%) and 53 328 out of 58 172 nonenzymes (Sn = 91.7%) for a total of 106 227 out of 116 401 (Ac = 91.3%). In the validation series, the model correctly classifies 22 513 out 24 838 enzymes (Sp = 90.6%) and 17 698 out of 19 318 nonenzymes (Sn = 91.6%) for a total of 40 211 out of 44 156 (Ac = 91.1%). Table 4 reports all of these statistics. Moreover, the MCC value of 0.82 is in line with the previous statistics as well as the receiver operating characteristics (ROC) plot, which is shown in Figure 1. To further evaluate the results of the PTML-ANN model, we performed a sensitivity analysis to assess the influence of the MD in the model. Although ANN models do not lead to simple linear equations and the interpretation is a challenging task, the sensitivity analysis should be used to analyze the influence of each MD descriptor on the model. In this sense, a highest sensitivity value for a parameter suggests that the system’s performance can drastically change with small variations in the parameter. Parameters with sensitivity > 1 are considered to be necessary for the model. Note also that all of the input variables are in a range of sensitivity of 1−16 in this model, which falls within the acceptable range.53 The sensitivity analysis results of the 28 MDs are reported in Table 5.

EC = *0.13 + *0.04 + *0.05 + J(srn)q*0.15 + *0.02 + *0.13 + *0.05 + *0.13 + *0.05 + *0.05 + *0.05 + *0.05 + *0.05 + *−0.07 + *0.13 +