ELM-MHC: An improved MHC Identification method with Extreme

Jan 30, 2019 - Furthermore, an online identification website named ELM-MHC was constructed with the following URL: http://server.malab.cn/ELM-MHC/...
0 downloads 0 Views 410KB Size
Subscriber access provided by Grand Valley State | University

Article

ELM-MHC: An improved MHC Identification method with Extreme Learning Machine Algorithm Yanjuan Li, Mengting Niu, and Quan Zou J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.9b00012 • Publication Date (Web): 30 Jan 2019 Downloaded from http://pubs.acs.org on January 31, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ELM-MHC: An improved MHC Identification method with Extreme Learning Machine Algorithm Yanjuan Li1,Mengting Niu1,Quan Zou2 ,3 * 1 School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China; 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China 3 Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China

* Correspondence: [email protected]; Tel: +86-136-5600-9020

Abstract: The major histocompatibility complex (MHC) is a term for all gene groups of major histocompatibility antigen. It binds to peptide chains derived from pathogens and displays pathogens on the cell surface to facilitate T-cell recognition and perform a series of immune function. MHC molecules are critical in transplantation, autoimmunity, infection and tumor immunotherapy. Combining machine learning algorithms and making full use of bioinformatics analysis technology, more accurate recognition of MHC is an important task. The paper proposed a new MHC recognition method compared with traditional biological methods. And used the built classifier to classify and identify MHC I and MHC II. The classifier used the SVMProt 188D, bag-of-ngrams (BonG) and information theory (IT) mixed feature representation methods, and used the extreme learning machine (ELM), which selects lin-kernel as the activation function and used ten-fold cross-validation and the independent test set validation verify the accuracy of the constructed classifier, and simultaneously identify the MHC, and identify the MHC I and MHC II respectively. Through the ten-fold cross-validation, the proposed algorithm obtained 91.66% accuracy when identifying MHC, and 94.442% accuracy when identifying MHC 1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 37

categories. Furthermore, an online identification website named ELM-MHC was constructed with the following URL: http://server.malab.cn/ELM-MHC/. Keywords: The major histocompatibility complex; Extreme Learning Machine; MHC I; MHC II; identification; machine learning

1 Introduction The major histocompatibility complex (MHC) is a genomic region with many genes encoding a wide variety of molecules, including highly polymorphic classical class I and class II, which is vital in the adaptive immune response of vertebrates[1] .These classical MHC molecules present peptides to thymus-derived (T) lymphocytes[2]. In addition to binding and presenting peptides, classical MHC molecules typically have high allelic polymorphisms and sequence diversity[3]. MHC molecules are essential for transplantation, autoimmunity, infection, and tumor immunotherapy[4, 5]. The MHC I and MHC II are three-dimensional structure cell surface glycoproteins that have the function of expressing peptide fragments to the immune system, and providing a target for controlling diseases such as tumors[6]. Currently, there are many methods for predicting MHC binding peptides[7]. Through the feature extraction of amino acid physicochemical properties, molecular properties etc., DCNN[8, 9], Markov[10,

11]

and nonlinear advanced neural networks are used to

predict MHC and peptide binding, and many predictive tools that can be directly used, such as SVMHC, NETMHCIIpan[12], HONNs[13], ARB[14], MHCpred[15], are generated. The MHC is one of the most deformable and polygenic regions in the vertebrate genome[16], especially the MHC I and II loci strongly influence the effectiveness of immune responses against specific pathogens, such as avians, leukemia and other 2

ACS Paragon Plus Environment

Page 3 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

viruses. MHC I and MHC II are two classic classifications of MHC protein. MHC I is distributed in appearance of nuclear cells, but the expression levels of different tissue cells vary widely. Class I molecules have an antigen recognition function that limits cd8t cells. Class II is mainly in antigen presenting cells. The biological role of class II is to present processed antigens to cd4t cells in the initial stages of the immune response. Class II functions are involved in the presentation of exogenous antigens and can also deliver endogenous antigens under certain conditions. Figure 1 shows the phylogenetic tree of MHC. Scientists have been working to discover MHC molecules in various vertebrate genomes[17-22]. Hopkins et al[23]. described a rat monoclonal antibody that recognizes a sheep MHC class II antigen and appears to recognize a non-polymorphic determinant. Using this antibody, we investigated the distribution of class II molecules in sheep and the class II expression changes in cells in peripheral and efferent lymphocytes produced by in vitro antigen vaccination. In a study of a group of cynomolgus monkeys from China, Westbrook et al[24]. used the SMRT-CCS method to characterize 60 new all-round MHC class I transcript sequences. When studying the extent and type of MHC polymorphisms and haplotypes in the Philippine macaque population, 127 unrelated animals were genotyped and 112 different alleles were identified[25]. At the same time, in order to meet the needs of MHC system research and the use of bioinformatics tools for analysis, the International Society for Animal Genetics (ISAG) standardized the nomenclature and established the IPD-MHC database to scientific management of current and future MHC genes and allele sequences from non-human organisms[26, 27].

3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 37

Figure 1. Only the lower vertebrates have MHC, and the organization may vary greatly. This idealized phylogenetic tree shows the relationship of some organisms, and on the right is the idealized representation of some genes in the MHC line (for some organisms, equivalent MHC paralogs). The solid horizontal line indicates a continuous sequence, and the question mark indicates an indeterminate link. The horizontal lines indicate the MHC class I and II, the dashed vertical lines indicate antigen processing and peptide loading genes. Data from references[28].

At early stages, the understanding of the discovery, gene composition and function of MHC is based on mouse experiments. Faced with the large amount of data available, and the development of machine learning, there is still a need for efficiency and the development of highly accurate models using existing machine learning algorithms. We expect a deeper understanding of the function and mechanism of MHC, using bioinformatics to meet development needs and promote development. Therefore, the paper proposed to use the method of extreme learning machine to identify MHC and its types. In this paper, SVMProt 188D[29, 30] that use the physical and chemical properties of amino acids and the text features of amino acid frequency in the text to extract features, bag of ngram (BonG) that use an ngram vector instead 4

ACS Paragon Plus Environment

Page 5 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

of the original single word vector, informationtheory (IT) that use information theory knowledge, and combines ELM to construct the classifier. The classification is performed by ten-fold crossover and independent data set. 2 Methods The frame diagram of the MHC classifier in this paper is shown in Figure 2. After the classifier is constructed, the MHC is identified, and then the classifier is used to identify the specific categories of MHC: MHC I and II. In the section, the paper will introduce the data set, feature representation methods, and classifiers in detail.

Query Sequences

Training dataset

Feature representation BonG

188D

Feature representation Training phase

IT

188D

IT

Feature selection

Feature selection MRMR algorithm for feature selection

Feature matrix

BonG

MRMR algorithm for feature selection Prediction phase

Feature matrix

Fed ELM

MHC proteins or not?

Predict

Training model

Figure 2. The paper framework for an MHC classifier. The classifier ELM-MHC mainly includes two parts of training and testing. The feature extraction representation method for the data set is a hybrid feature representation method. A hybrid feature representation method using BonG, 188D, and IT is used. For the mixed feature matrix, the MRMR method is used for feature selection and dimensionality 5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 37

reduction to obtain the final feature matrix. The training model is then constructed using the extreme learning machine classification method for the data set. At the same time, feature extraction and dimension reduction are performed on the prediction sequence.

2.1. Dataset Based on the biological function and location of protein molecules, MHC has two types: MHC I and MHC II. This article constructs a new data set in which the build process is described below. And You can download the data set in the ELM-MHC server. The first is to search for MHC sequences through the Uniprot database. For the identification of MHC, a positive example is a data set consisting of a mixture of MHC I and MHC II sequences. By using "MHC I" and “MHC II" as a search term, a sequence that matches the search requirements is generated, and a sequence file in the fasta format is downloaded. The relative protein sequence is selected to generate an MHC sequence file. Then get a counterexample through the PFAM family. The resulting fasta format file is then de-redundant to obtain the final data set. In the paper, we use cd-hit for de-duplication processing. The working principle is to cluster all the sequences according to the parameter settings, and output the longest sequence in each cluster as a representative sequence, and give each group of clusters. Each sequence name under the class can be used for similarity analysis. It is necessary to pay attention to setting the threshold (the default similarity is 0.9). After the above steps, the paper constructed a data set of 6712 MHC protein sequences (expressed as Smhc) and 6776 non-MHC protein sequences (expressed as Snon-mhc), named DMHC. Smhc is divided into two parts to predict the two types of MHC protein. The first part, which contains 4370 MHC protein sequences, is used for training, and the

6

ACS Paragon Plus Environment

Page 7 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

second part, which contains 2342 MHC protein sequences, is used for independent testing. For the identification of MHC I and MHC II, we chose the MHC I sequence as a positive example and the MHC II sequence as a counterexample. In Smhc, the sequences number of MHC I and II was 3350 and 3362, respectively. So we use this as a data set to identify MHC I and MHC II. 2.2. Feature

Extraction

When using machine learning methods to identify protein types, feature extraction is very important

[31-38].

A multi-feature hybrid representation method is

adopted in the paper, including three representation methods, as follow SVMProt 188D feature extraction method, bag-of-ngram (BonG) feature extraction method and InformationTheory (IT) based on information theory. 2.2.1

SVMProt 188D features When predicting the type of protein, the unique physical and chemical properties

are usually considered. The compositional characteristics are also well applied in protein recognition. Therefore, we consider whether the combination of the two can also make better predictions. Dubchak first used the hybrid feature representation method of two feature fusions, and verified the effectiveness of this method. It achieved good results in predicting protein folding mode experiments[39]. Later, more feature fusion methods emerged

[38, 40-48].

188D also incorporates the compositional

and physicochemical properties of amino acids. The characteristics of each dimension in the 188D are described in detail below. The first 20 dimensions are the content of the amino acid (in alphabetical order “ACDEFGHIKLMNPQRSTVWY”) in the sequence

[49].

The 21-41 dimensions are 7

ACS Paragon Plus Environment

Journal of Proteome Research

hydrophobic feature of amino acids, including hydrophilic, neutral, and hydrophobic [50].

Among them, the 21-23 dimensions are the amino acid content of hydrophilic

(“RKEDQN”), neutral (“GASTPHY”) and hydrophobic (“CVLIMFW”). The 24-26 dimensions are the above three conversion frequencies. The 27-31 dimensions are the first, 25%, 50%, 75%, and the last position of the hydrophilic amino acid in the sequence, 32-36 dimensions are the first, 25%, 50%, 75% and the last in the sequence of the water-soluble amino acids. The 37-41 dimensions are the first, 25%, 50%, 75%, respectively. And the last position in the sequence. The 42-62 dimensions are van der Waals force property, and the 63-83 dimensions are amino acid polarity feature. 84-104 are amino acid polarization property. 105-125 are charge property, 126-146 are surface tension, 147-167 dimensions are sequence secondary structure property, and 168-188 are solvent accessibility. Therefore, you will get a 188-dimensional feature. Figure 3 describes the188-D function.

Dimensions 21.5

21

21

21

21

21

21

21

21

21 20.5

20

20 19.5

y lit

e

bi

So

lv

en

ta

cc

es

si

ct

ur

n

ru

st

ns io

Se

co n

da ry

e

te

ar ge Ch

rf ac Su

al ic iz ity ed s vo van lu m de r e Po la rit Po y la riz ab i li ty

m

W

aa l

ph ob or

H

yd ro N

in

o

ac i

d

co

m po s

iti

on

19

Am

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 37

Figure 3. Structure of 188-D Feature

2.2.2 Bag-of-ngram feature representation

8

ACS Paragon Plus Environment

Page 9 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

In information retrieval, for text, the bag-of-words model assumes that the appearance of each word is independent, its word order and grammar are ignored, and is only considered as a collection of words. In the word bag feature, the text document is converted into a vector (a vector is just a collection of n numbers.). A vector contains the number of possible occurrences of each word in the vocabulary. Feature selection is also very important to improve the classification performance while reduce the redundancy [51-53]. Bag-of-N-gram or bag-of-ngram is a natural extension of BOW[54]. Optimization of model input. The input to the model is no longer a simple word vector, but an ngram vector. The ngram vector draws on the N-gram language model[55]. An n-gram is n ordered tokens. A word is basically a 1-gram, also known as a monaural model. When it is marked, the counting mechanism can count individual words or count overlapping sequences as n-grams. BoW's n-gram extension is the number of occurrences of any k-gram in the bag-of-n-Grams (BonG) calculation file under k < n conditions. 2.2.3 InformationTheory (IT) Information theory is a branch of probability theory and mathematical statistics. It is used in information processing, information entropy, signal-to-noise ratio and other related topics, and is gradually applied to the field of bioinformatics[56-65]. The information theory feature representation mainly represents the protein sequence characteristics from three aspects: information entropy, relative entropy and information gain. The calculated expressions of the three are shown in equation (1-3).[66].

9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 37

In information theory, entropy is the average amount of information in each message. It is a measure of uncertainty. The more random the source, the greater its entropy.[67] Information entropy (also known as Shannon entropy, recorded as I) reflects the degree of disorder (ordering) of a system. The more ordered a system, the lower the information entropy, and vice versa. The Shannon information entropy is calculated as shown in Equation 1. 20

I = ― ∑𝑖 = 1𝑝𝑖log2( 𝑝𝑖)

(1)

Where pi represents the ratio of the number of occurrences of amino acid i to the length of the sequence. The expression formula of relative entropy is as shown in Equation 2. 20

𝑝𝑖

REn = ― ∑𝑖 = 1𝑝𝑖log2( 𝑝0)

(2)

Information gain is a measure of the importance of sample characteristics, recorded as G. G = I ― REn

(3)

2.3. Classifier In order to identify MHC proteins, the paper chose ELM as a classification algorithm. After the introduction of extreme machine learning, it has been applied and practical problems, and has been successfully applied to bioinformatics many times [68-76].

ELM is a single hidden layer feedforward neural network that is easy to understand and more convenient to use. [77, 78].

10

ACS Paragon Plus Environment

Page 11 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

f(x)

1

β1 1 G(a1,b1,x)

m

...

β1 ...

...

i

L

G(aL,bL,x)

G(ai,bi,x)

1

βL

...

d

x Figure 4. Structure diagram of extreme learning machine.

For traditional neural networks, we need to set the training parameters of the network. Moreover, it is easy to fall into the local optimal solution when seeking the optimal solution. Relative to ELM, it simplifies the parameter problem that needs to be set. We only need to set the number of nodes in the network hidden layer, then optimize the generation of the optimal solution, only get the only optimal solution. From Huang Guangbin's proposed ELM algorithm to its widespread use, it proves that ELM has a higher learning efficiency, and training model is fast, and the generalization performance is better. Extreme learning machine frame flow chart is shown in the figure 4. The input of the hidden layer is[77]: 𝐿

𝑓𝐿(𝑥) = ∑𝑖 = 1𝛽𝑖𝐺(𝑎𝑖,𝑏𝑖,𝑥).

(4)

11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 37

Where g is the activation function and the activation function can be sigmod, rbf, but is not limited to this. Among them, L in the formula is the number of hidden layer neuron nodes. And 𝑎𝑖, 𝑏𝑖 represents the parameter vector of the input node. 𝛽𝑖 is a weight vector. The dimension of each component of 𝛽𝑖 is f (set the input layer node to f). Then the output of the neural network is expressed as equation (5) [79]. Hβ = T.

(5)

Where H is the output of the hidden layer node, and 𝛽 is the output weight composed of 𝛽𝑖, and T is the desired output. ℎ(𝑥1) 𝑔(𝑊1 ⋅ 𝑋1 + 𝑏1) ⋮ ⋮ H= = ℎ(𝑥𝑁) 𝑔(𝑊1 ⋅ 𝑋𝑁 + 𝑏1)

[ ][

⋯ ⋱ ⋯

𝑔(𝑊𝐿 ⋅ 𝑋1 + 𝑏𝐿) ⋮ (𝑊𝐿 ⋅ 𝑋𝑁 + 𝑏𝐿)

𝛽𝑇1 𝑇𝑇1 , and T = [ ⋮ ] β=[ ⋮ ] 𝑇 𝛽𝐿 𝐿 × 𝑚 𝑇𝑇𝐿 𝑀 × 𝑚

]

𝑁×𝐿

(6)

The general flow of the ELM algorithm is described in Table 1. Table1 ELM algorithm

ELM Algorithm 𝑁 Input:Given training sample set:{(𝑥𝑖,𝑡𝑖)}𝑖 = 1, hidden layer output function 𝐺(𝑎𝑖,𝑏𝑖,𝑥), and number of

hidden layer nodes L. a) Randomly generated hidden layer node parameters (𝑎𝑖,𝑏𝑖), 𝑖 = 1,⋯𝐿. b) Calculate the hidden layer output matrix H. c) Output: network optimal weight: β:β = 𝐇/𝐓.

In the original ELM algorithm, Huang G. B. et al. proposed an ELM algorithm for training a single hidden layer network corresponding to a normally added hidden layer node network and a radial basis hidden layer node in 2004[79]. The learning principle of support vector machine, Huang G. B. et al. proposed ELM Kernel in 2010[80]. The ELM algorithm constructed by this method has fewer constraints and 12

ACS Paragon Plus Environment

Page 13 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

better than the Extreme SVMs proposed by Liu Q[81]. Therefore, this paper selects the kernel functions commonly used by support vector machines: RBF and linear kernel functions, and compares the classification performance of ELMKernel under different kernel functions. 2.4. The ELM-MHC Online identification Server With the popularity of machine learning methods, bioinformatics has developed greatly [82-88]. The machine learning prediction method and the development of online servers have high practical value

[89-92].

Therefore, this paper developed the

ELM-MHC online prediction server to identify the MHC. Its access link is http://server.malab.cn/ELM-MHC/. And on the website platform, users can provide protein sequences or protein sequence files in fasta format. The ELM-MHC performs feature extraction on the submitted sequence, and given the probability value of MHC or non-MHC. When the sequence is MHC, further studies are performed to predict whether the sequences are MHC I and MHC II. DMHC and related running files can be downloaded from the server. 3 Result In the section, the paper will give a detailed introduction to the evaluation indicators, the recognition results of MHC, and the recognition results of MHC I and MHC II. 3.1. Measurement

13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 37

So as to evaluate the prediction accuracy of ELM-MHC, this paper uses four indicators: sensitivity (SE), specificity (SP), accuracy (ACC) and Matthew coefficient (MCC), and the expression formulas are respectively as formula (13- 16)[93-104].

𝐴𝐶𝐶 = MCC =

SE =

𝑇𝑃 𝑇𝑃 + 𝐹𝑁

(7)

SP =

𝑇𝑁 𝑇𝑁 + 𝐹𝑃

(8)

𝑇𝑁 + 𝑇𝑃 𝑇𝑁 + 𝐹𝑃 + 𝑇𝑃 + 𝐹𝑁

(𝑇𝑃 × 𝑇𝑁) ― (𝐹𝑃 × 𝐹𝑁) (𝑇𝑃 + 𝐹𝑃) × (𝑇𝑃 + 𝐹𝑁) × (𝑇𝑁 + 𝐹𝑃) × (𝑇𝑁 + 𝐹𝑁)

(9)

(10)

Among them, TP and FP respectively indicate the number of correct sequences predicted in the positive and negative examples, and TN and FN respectively indicate the number of error sequences in the positive and negative examples. SE indicates the correct ratio in the sequence to the ratio of MHC (TP) to all sequences (TP + FN). SP indicates the correct identification as the ratio of MHC (TP) to all sequences (TP + FN). ACC indicates the accuracy of the prediction, and for MHC, when the MCC coefficient is 1, the classifier is perfect. In the paper, we choose the extreme learning machine as the classification algorithm, and use the mixed feature representation method of SVMProt 188D, BonG and information theory, and use the independent test set verification and ten-fold cross-validation. Identification, and MHC I and MHC II for classification and identification. Among them, the independent test is to use 80% of the data set as the test set and 20% as the verification set. Ten-fold cross-validation is to divide the data into ten and get 10 classification results. The accuracy rate is the average of 10 results. First, we create a classifier that recognizes the MHC protein sequence and then identifies the MHC class I and II for the sequence identified as MHC. 14

ACS Paragon Plus Environment

Page 15 of 37

3.2 Two-Class Predictor for Identification MHC To classify MHC, the data set used in this section is Smhc and Snon-mhc. The contrast experiment in this section consists of two parts: different feature representation methods, different classification algorithms. 3.2.1 Performance of Different Features In this section, the paper uses the mixed feature representation method of SVMProt 188D, BonG and information theory. On the basis of using the ELM classification algorithm, different feature representation methods are used respectively. Meanwhile this classification results are compared through ten-fold cross-validation and independent test set validation. The comparison representation method used is a representation method of different feature combinations of 188D, BonG and IT, and a single feature representation method for each of the three. In the paper, we select different activation functions to verify the classification effect of ELM, such as Lin_kernel and RBF_kernel. 100 90 80 70 60

+ IT

8 18

8D

+

Bo n

G

18 IT +

Bo nG + IT

G IT

+

Bo n

8 18

IT

nG

50

Bo

Accuracy(%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Feature representation methods RBF_kernel

Lin_kernel

Figure 5. Comparison of the accuracy of different feature representation methods.

15

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 37

According to the figure 5, it can be seen that the effects of the 188D, BonG, IT and the mixed representation methods. The effect of the three hybrids has achieved the best effect. When the activation function is a linear function, the classification accuracy rate is the highest, which is 91.66%, which is higher than other feature representation methods. Among them, when the activation function is RBF_kernel, the accuracy of 188D is the lowest, which is 67.11%. For the three separate single feature representation methods, the classification effect of 188D is relatively better than the other two. It can also be seen that the analysis of the physical and chemical properties of amino acids has a better recognition effect on proteins. On the other hand, we can see from the figure that the choice of different activation functions of ELM has a significant impact on the classification performance. From the trend of the fold line in the figure, we can also clearly see that the linear activation function (Lin_kernel) is significantly better than RBF_kernel. Therefore, we choose the activation function for Lin_kernel to perform independent test set verification on the classifier. Table 2. Experimental data table for different feature representation methods. Method

ACC (%)

MCC

SE

SP

BonG+188D+IT

93.1712

0.864

0.923

0.942

188-D+BonG

89.58

0.787

0.871

0.916

188-D+IT

89.4713

0.789

0.897

0.892

IT

83.421

0.672

0.882

0.787

188-D

91.231

0.829

0.862

0.963

BonG

74.57

0.540

0.535

0.955

Table 3. Experimental data table for different classifiers. Classifier

ACC (%)

MCC

SE

SP

ELM

91.66

0.822

0.893

0.908

Random Forest

85.48

0.7140

0.9087

08015

16

ACS Paragon Plus Environment

Page 17 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Naive Bayes

80.86

06172

0.8156

0.8016

SGD

82.08

0.6417

0.8266

0.8151

Nearest Neighbors

84.61

0.6963

0.7895

0.9021

Decision Tree

79.99

0.6963

0.7975

08022

LinearSVC

87.14

0.7428

0.8641

0.84785

Logistic Regression

89.52

0.7911

0.8737

0.9166

LibSVM

73.28

0.5116

0.9380

0.5294

ExtraTrees

84.20

0.6894

0.9033

0.813

Bagging

86.32

0.7270

0.8823

0.8443

AdaBoost

87.03

0.7406

0.8683

0.8723

GradientBoosting

90.98

0.8200

0.8921

0.9272

3.2.2 Comparison with Other Classifiers In the paper, so as to verify the validity of the classification algorithm, the results are compared with other classification algorithms. Through the comparison results of the previous experiment, we select the hybrid feature representation method of 188D, BonG and IT with the highest classification accuracy. The classification effects of ELM were compared with the following classification methods, such as random forest (RF)[105-109], support vector machine (LibSVM)[110-113], and integrated classification algorithm (AdaBoost). The experimental results of the table 3 ten-fold cross-validation can be obtained. When the activation function is selected as Lin_kernel, the accuracy of the ELM reaches a maximum of 91.66%. Compared with other classification algorithms, the validity of the classification method of the extreme learning machine selected in the paper is also well confirmed. 3.3 Two-Class Predictor for Identification MHCI and MHC II MHC I and MHC II are two classic classifications of MHC protein. The identification of MHC I and MHC II was performed by using the constructed classifier, and the experimental results were analyzed. In Smhc, the sequences number 17

ACS Paragon Plus Environment

Journal of Proteome Research

of MHC I and II was 3350 and 3362, respectively. We use this as a data set to identify MHC I and MHC II. 3.3.1 Performance of Different Features 100 90 80 70

+ IT Bo nG

18

8D

+

IT

+

18

8

nG Bo IT

+

nG Bo IT

+

18 8

Bo

IT

60

nG

Accuracy(%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 37

Feature representation methods RBF_kernel

Lin_kernel

Figure 6. Comparison of the accuracy of different feature representation methods.

In this section, the paper uses the mixed feature representation method of SVMProt 188D, BonG and information theory. On the basis of using the ELM classification algorithm, different feature representation methods are used respectively, and the classification results are compared through 10-fold cross-validation

and

independent

set

validation.

The

comparison

feature

representation method used is a representation method of different feature combinations of 188D, BonG and IT, and a single feature representation method. In the paper, here we still compare the classification effects of different activation functions under ten-fold cross-validation. The activation function still selected Lin_kernel and RBF_kernel. The experimental results of the ten-fold cross-validation are shown in the figure 6. The independent test set verification is the experimental result as shown in the table 4. 18

ACS Paragon Plus Environment

Page 19 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

From the comparison results in Figure 6, we can get that the extreme learning machine works best when Lin_keneral that is selected as the activation function. Moreover, the three feature hybrid representation methods still achieve the best results. The classification accuracy reaches 94.42%. In the independent test set verification, we select Lin_kernel as the activation function to verify the result. Table 4 shows the results of the experiment. Table 4. Experimental data table for different feature representation methods. Method

ACC (%)

MCC

SE

SP

BonG+188D+IT

97.432

0.949

0.990

0.859

188-D+BonG

94.10

0.888

0.930

0.958

188-D+IT

92.02

0.829

0.862

0.963

IT

76.48

0.530

0.944

0.620

188-D

93..171

0.864

0.923

0.940

BonG

82.4943

0.664

0.823

0.827

According to table 4, we can get that the mixed feature representation of the three features achieved the best results. The classification accuracy is 97.432%, and good results have been obtained on the other four indicators such as SE and SP. This also proves the validity of the hybrid feature representation method of 188D, BonG and IT. Verification by ten-fold cross-validation and independent set, the accuracy of the recognition of the representation method selected in the paper is well proved, and the MHC I and MHC II can be well recognized. 3..3.2 Comparison with Other Classifiers In the paper, so as to verify the accuracy of the classifiers, the experimental results were compared with other classifiers. Through the comparison results of the previous experiment, we select the hybrid feature representation method of 188D, 19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 37

BonG and IT with the highest classification accuracy. The classification effects of ELM were compared with the following classification methods, such as random forest (RF), support vector machine (LibSVM), and integrated classification algorithm (AdaBoost). The experimental results of the above ten-fold cross-validation can be obtained. When the activation function is selected as Lin_kernel, the accuracy of the ELM reaches a maximum of 94.442%. Compared with other classification algorithms, the validity of the classification method of the extreme learning machine is also well confirmed. Table 5. Experimental data table for different feature representation methods. Classifier

ACC (%)

MCC

SE

SP

ELM

94.442

0.822

0.893

0.908

Random Forest

88.52

0.7722

0.9206

0.8495

Naive Bayes

82.60

06520

0.8333

0.8186

SGD

82.12

0.6424

0.8268

0.8156

Nearest Neighbors

84.61

0.6963

0.7895

0.9021

Decision Tree

88.22

0.6963

0.7975

08022

LinearSVC

87.14

0.7656

0.8561

0.9086

Logistic Regression

90.31

0.8067

0.8859

0.9205

LibSVM

88.83

0.7820

0.8308

0.9464

ExtraTrees

85.46

0.7128

0.9063

0.8023

Bagging

89.77

0.7966

0.256

0.8695

AdaBoost

90.19

0.8038

0.9004

0.9034

GradientBoosting

92.80

0.8562

0.9357

0.9357

4 Conclusions In the paper, we have designed a new classifier ELM-MHC. We used a combination of SVMProt 188-D, bag-of-ngrams, and InformationTheory to represent feature and extreme learning machine classifiers. We use the constructed classifier to identify specific categories of MHC and MHC, respectively. So as to 20

ACS Paragon Plus Environment

Page 21 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

verify the effect of ELM-MHC, we performed 10 cross-validation and independent test set verification, respectively, and compared with other feature representations and classifiers. When MHC and MHC I were identified, 10-fold cross-validation achieved accuracy of 91.66% and 94.442%, respectively. The accuracy of the independent test set was 93.1712% and 97.432%, respectively. It shows that ELM-MHC has better prediction effect than other feature extraction algorithms and classifiers. It also proves that ELM and hybrid feature representation methods are effective for identifying MHC recognition. Online servers make MHC prediction and open source possible. The access URL of ELM-MHC proposed in this paper is: http://server.malab.cn/ELM-MHC/. In future, the next step can be to consider the combination of high accuracy classifiers, and establish an integrated classifier to optimize the prediction performance of ELM-MHC. And consider improving server parallel processing capabilities. Moreover, the use of computational intelligence computing platforms

[118-121]

[114-117]

and Cloud

may bring the increase in performance of the

classification process. Acknowledgement: The work was supported by the National Key R&D Program of China (2018YFC0910405), the Natural Science Foundation of China (No. 61771331, 61300098), and the Fundamental Research Funds for the Central Universities (No.2572017CB33). References 1.

Kaufman, J., Unfinished Business: Evolution of the MHC and the Adaptive Immune System of Jawed Vertebrates. Annual Review of Immunology, 2018. 36,

21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 37

383. 2.

Monzóncasanova, E., et al., The Forgotten: Identification and Functional Characterization of MHC Class II Molecules H2-Eb2 and RT1-Db2. Journal of Immunology, 2016. 196, 988.

3.

Trowsdale, J. and J.C. Knight, Major histocompatibility complex genomics and human disease. Annual Review of Genomics & Human Genetics, 2013. 14, 301-323.

4.

Rock, K.L., E. Reits, and J. Neefjes, Present Yourself! By MHC Class I and MHC Class II Molecules. Trends in Immunology, 2016. 37, 724-737.

5.

Comber, J.D., et al., MHC Class I Presented T Cell Epitopes as Potential Antigens for Therapeutic Vaccine against HBV Chronic Infection. Hepatitis Research & Treatment, 2014. 2014, 423–431.

6.

Nakayama, M., Antigen Presentation by MHC-Dressed Cells. Front Immunol, 2014. 5, 672.

7.

Giguère, S., et al., MHC-NP: predicting peptides naturally processed by the MHC. Journal of Immunological Methods, 2013. 400-401, 30-36.

8.

Han, Y. and D. Kim, Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction. Bmc Bioinformatics, 2017. 18, 585.

9.

Xu, Y., et al., Deep learning of the splicing (epi)genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res, 2017. 45, 12100-12112.

10.

H, N., et al., Hidden Markov Model-Based Prediction of Antigenic Peptides That 22

ACS Paragon Plus Environment

Page 23 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Interact with MHC Class II Molecules. Journal of Bioscience & Bioengineering, 2002. 94, 264-270. 11.

Yu, L., et al., Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk. Ieee-Acm Transactions on Computational Biology and Bioinformatics, 2017. 14, 966-977.

12.

Andreatta, M., et al., Accurate pan-specific prediction of peptide-MHC class II binding affinity with improved binding core identification. Immunogenetics, 2015. 67, 641-650.

13.

Kuksa, P.P., et al., High-order neural networks and kernel methods for peptide-MHC binding prediction. Bioinformatics, 2015. 31, 3600-3607.

14.

Bui, H.H., et al., Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics, 2005. 57, 304-314.

15.

Doytchinova, I.A. and D.R. Flower, Towards the in silico identification of class II restricted T-cell epitopes: a partial least squares iterative self-consistent algorithm for affinity prediction. Bioinformatics, 2003. 19, 2263.

16.

Kaufman, J.F., et al., Xenopus MHC class II molecules. I. Identification and structural characterization. The Journal of Immunology, 1985. 134, 3248.

17.

Malmstrøm, M., et al., Evolution of the immune system influences speciation rates in teleost fishes. Nature Genetics, 2016. 48, 1204.

18.

Edholm, E.S., M. Banach, and J. Robert, Evolution of innate-like T cells and their selection by MHC class I-like molecules. Immunogenetics, 2016, 525-536.

19.

Hearn, C., et al., An MHC class I immune evasion gene of Marek‫׳‬s disease virus. 23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 37

Virology, 2015. 475, 88-95. 20.

Li, Y.H., et al., Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics. Nucleic Acids Res, 2018. 46, D1121-D1127.

21.

Li, B., et al., NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic Acids Res, 2017. 45, W162-W170.

22.

Yu, L., et al., Inferring drug-disease associations based on known protein complexes. Bmc Medical Genomics, 2015. 8, 13.

23.

Hopkins, J., B.M. Dutia, and I. Mcconnell, Monoclonal antibodies to sheep lymphocytes. I. Identification of MHC class II molecules on lymphoid tissue and changes in the level of class II expression on lymph-borne cells following antigen stimulation in vivo. Immunology, 1986. 59, 433.

24.

Westbrook, C.J., et al., No assembly required: Full-length MHC class I allele discovery by PacBio circular consensus sequencing. Human Immunology, 2015. 76, 891-896.

25.

Shiina, T., et al., Discovery of novel MHC-class I alleles and haplotypes in Filipino cynomolgus macaques (Macaca fascicularis) by pyrosequencing and Sanger sequencing: Mafa-class I polymorphism. Immunogenetics, 2015. 67, 563-578.

26.

Maccari, G., et al., IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Research, 2017. 45, D860-D864. 24

ACS Paragon Plus Environment

Page 25 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

27.

Maccari, G., et al., IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era. Immunogenetics, 2018. 70, 619-623.

28.

Dijkstra, J.M., et al., Comprehensive analysis of MHC class II genes in teleost fish genomes reveals dispensability of the peptide-loading DM system in a large part of vertebrates. BMC Evolutionary Biology,13,1(2013-11-26), 2013. 13, 260.

29.

Wei, L., et al., Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine, 2017. 83, 67-74.

30.

Wei, L., et al., A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine, 2017. 83, 82-90.

31.

Niu, M., et al., RFAmyloid: A Web Server for Predicting Amyloid Proteins. International Journal of Molecular Sciences, 2018. 19, 2071.

32.

Yang, H., et al., iRNA-2OM: A Sequence-Based Predictor for Identifying 2 ′ -O-Methylation Sites in Homo sapiens. Journal of Computational Biology, 2018. 25, 1266-1277.

33.

Liu, B., BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2018. DOI: 10.1093/bib/bbx165.

34.

Chen, J., et al., A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in 25

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 37

Bioinformatics, 2018. 9, 231-244. 35.

Zou, Q., et al., Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. Bmc Systems Biology, 2016. 10, 114.

36.

Xu, L., et al., A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides. Genes, 2018. 9, 158.

37.

Xu, Y., et al., Identify bilayer modules via pseudo-3D clustering: applications to miRNA-gene bilayer networks. Nucleic acids research, 2016. 44, e152-e152.

38.

Zhu,

P.F.,

et

al.,

Combining

neighborhood

separable

subspaces

for

classification via sparsity regularized optimization. Information Sciences, 2016. 370, 270-287. 39.

Dubchak, I., et al., Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences of the United States of America, 1995. 92, 8700-8704.

40.

Li, Y.H., et al., SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. Plos One, 2016. 11, e0155290.

41.

Chen, L., et al., Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier. Plos One, 2013. 8, e56499.

42.

Zou, Q., et al., An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier. BioMed Research International,2013,(2013-8-21), 2013. 2013, 686090.

43.

Liu, B., H. Wu, and K.C. Chou, Pse-in-One 2.0: An Improved Package of Web 26

ACS Paragon Plus Environment

Page 27 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. Natural Science, 2017. 09, 67-91. 44.

Wang, G., et al., BinMemPredict: a Web Server and Software for Predicting Membrane Protein Types. Current Proteomics, 2013. 10, 1-2.

45.

Chen, J., et al., ProtDec-LTR2.0: An improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics, 2017. 33, 3473–3476.

46.

Wan, S., Y. Duan, and Q. Zou, HPSLPred: An Ensemble Multi-label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source. Proteomics, 2017. 17, 1700262.

47.

Zhu, P.F., et al., Multi-view label embedding. Pattern Recognition, 2018. 84, 126-135.

48.

Yu, L., J. Zhao, and L. Gao, Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity. International Journal of Biological Sciences, 2018. 14, 971-980.

49.

Zhang,

J.

and

sequence-based

L.

Kurgan,

predictors

Review of

and

comparative

protein-binding

assessment

residues.

Briefings

of in

bioinformatics, 2017. 50.

Zhang, J., Z. Ma, and L. Kurgan, Comprehensive review and empirical analysis of hallmarks of DNA-, RNA-and protein-binding residues in protein chains. Briefings in bioinformatics, 2017.

51.

Zhu,

P.F.,

et

al.,

Co-regularized

unsupervised

feature

selection. 27

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 37

Neurocomputing, 2018. 275, 2855-2863. 52.

Zhu, P.F., et al., Multi-label feature selection with missing labels. Pattern Recognition, 2018. 74, 488-502.

53.

Zhu, P.F., et al., Subspace clustering guided unsupervised feature selection. Pattern Recognition, 2017. 66, 364-374.

54.

Cummins, N., et al. MULTIMODAL BAG-OF-WORDS FOR CROSS DOMAINS SENTIMENT ANALYSIS. in IEEE Interna- Tional Conference on Acoustics,

Speech, and Signal Processing, ICASSP. 2018. 55.

Liu, B., et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, 2015. 43, W65-W71.

56.

Qu, K., et al., Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods. Molecules, 2017. 22, 1602.

57.

Wei, L., et al., ACPred-FL: a sequence-based predictor using effective feature representation

to

improve

the

prediction

of

anti-cancer

peptides.

Bioinformatics, 2018. 34, 4007-4016. 58.

Wei,

L.,

et

al.,

PhosPred-RF:

a

novel

sequence-based

predictor

for

phosphorylation sites using sequential information only. IEEE transactions on nanobioscience, 2017. 16, 240-247. 59.

Wei, L., et al., Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017, 1-1. 28

ACS Paragon Plus Environment

Page 29 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

60.

Liu, B. and S. Li, ProtDet-CCH: Protein remote homology detection by combining Long Short-Term Memory and ranking methods. IEEE/ACM Transactions

on

Computational

Biology

and

Bioinformatics.

DOI:

10.1109/TCBB.2018.2789880. 61.

Deng, L., et al., PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties. Nucleic acids research, 2014. 42, W290-W295.

62.

Peng, J.J., et al., A novel method to measure the semantic similarity of HPO terms. International Journal of Data Mining and Bioinformatics, 2017. 17, 173-188.

63.

Hu, Y., et al., Identifying diseases-related metabolites using random walk. BMC Bioinformatics, 2018. 19, 116.

64.

Cheng, L., et al., InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics, 2018. 19, 919.

65.

Xu, Y., et al., A novel insight into Gene Ontology semantic similarity. Genomics, 2013. 101, 368-375.

66.

Wei, L., et al., Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience, 2015. 14, 649-659.

67.

Wei, L., et al., An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information. IEEE Transactions on Nanobioscience, 2015. 14, 339-349. 29

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

68.

Page 30 of 37

Cao, J., et al., Voting based extreme learning machine. Information Sciences, 2012. 185, 66-77.

69.

Cao, J. and L. Xiong, Protein Sequence Classification with Improved Extreme Learning Machine Algorithms. BioMed Research International, 2014. 2014, 12.

70.

Wang, D. and G.B. Huang. Protein sequence classification using extreme learning machine. in IEEE International Joint Conference on Neural Networks,

2005. IJCNN '05. Proceedings. 2005. 71.

Shen, Y., J. Tang, and F. Guo, Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. Journal of Theoretical Biology, 2019. 462, 230-239.

72.

Limin Jiang, Y.X., Yijie Ding, Jijun Tang, Fei Guo, FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics, 2019. 19.

73.

Ding, Y., J. Tang, and F. Guo, Identification of drug-side effect association via multiple

information

integration

with

centered

kernel

alignment.

Neurocomputing, 2019. 325, 211-224. 74.

Song, J., J. Tang, and F. Guo, Identification of Inhibitors of MMPS Enzymes via a Novel Computational Approach. International Journal of Biological Sciences, 2018. 14, 863-871.

75.

Pan, G., et al., A Novel Computational Method for Detecting DNA Methylation Sites with DNA Sequence Information and Physicochemical Properties. International Journal of Molecular Sciences, 2018. 19, 511. 30

ACS Paragon Plus Environment

Page 31 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

76.

Jiang, L., et al., MDA-SKF: Similarity Kernel Fusion for Accurately Discovering miRNA-Disease Association. Frontiers in Genetics, 2018. 9.

77.

Huang, G., et al., Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 2014. 44, 2405.

78.

Huang, G.B., Q.Y. Zhu, and C.K. Siew, Extreme learning machine: Theory and applications. Neurocomputing, 2006. 70, 489-501.

79.

Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: a new learning scheme of feedforward neural networks. in Neural Networks, 2004.

Proceedings. 2004 IEEE International Joint Conference on. 2004. IEEE. 80.

Huang, G.B., X. Ding, and H. Zhou, Optimization method based extreme learning machine for classification. Neurocomputing, 2010. 74, 155-163.

81.

Frénay, B. and M. Verleysen. Using SVMs with randomised feature spaces: an extreme learning approach. in ESANN. 2010.

82.

Zeng, X.X., et al., Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics, 2018. 34, 2425-2432.

83.

Zeng, X., et al., Probability-based collaborative filtering model for predicting gene–disease associations. BMC Medical Genomics, 2017. 10, 76.

84.

Zou, Q., et al., Similarity computation strategies in the microRNA-disease network: a survey. Briefings in Functional Genomics, 2016. 15, 55-64.

85.

Zou, Q., et al., Reconstructing evolutionary trees in parallel for massive sequences. BMC Systems Biology, 2017. 11, 15-21.

86.

Wang, X., et al., A Classification Method for Microarrays Based on Diversity. 31

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 37

Current Bioinformatics, 2016. 11, 590-597. 87.

Zou, Q., et al., Machine learning and graph analytics in computational biomedicine. Artificial Intelligence in Medicine, 2017. 83, 1.

88.

Xuan, Z., et al., Meta-path methods for prioritizing candidate disease miRNAs. IEEE/ACM Transactions on Computational Biology & Bioinformatics, 2017. PP, 1-1.

89.

Zeng, X., X. Zhang, and Q. Zou, Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Briefings in Bioinformatics, 2016. 17, 193-203.

90.

Liu, Y., et al., Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017. 14, 905-915.

91.

Xu, Y., et al., System-level insights into the cellular interactome of a non-model organism: inferring, modelling and analysing functional gene network of soybean (Glycine max). PloS one, 2014. 9, e113907.

92.

Xu, Y., et al., SoyFN: a knowledge database of soybean functional networks. Database, 2014. 2014.

93.

Wei, L., et al., Prediction of human protein subcellular localization using deep learning. Journal of Parallel and Distributed Computing, 2018. 117, 212-217.

94.

Wei, L., J. Tang, and Q. Zou, Local-DPP: An Improved DNA-binding Protein Prediction Method by Exploring Local Evolutionary Information. Information Sciences, 2017. 384, 135-144. 32

ACS Paragon Plus Environment

Page 33 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

95.

Yang,

H.,

et

al.,

iRSpot-Pse6NC:

Identifying

recombination

spots

in

Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International Journal Of Biological Sciences, 2018. 14, 883-891. 96.

Tang, H., et al., HBPred: a tool to identify growth hormone-binding proteins. International Journal Of Biological Sciences, 2018. 14, 957-964.

97.

Su, Z.-D., et al., iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics, 2018. 34, 4196-4204.

98.

Liu, B., et al., iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 2017. 33, 35-41.

99.

Zhang, J., et al., Ontological function annotation of long non-coding RNAs through hierarchical multi-label classification. Bioinformatics, 2018. 34, 1750-1757.

100.

Peng, J.J., W.W. Hui, and X.Q. Shang, Measuring phenotype-phenotype similarity through the interactome. Bmc Bioinformatics, 2018. 19, 114.

101.

Cheng, L., et al., DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics, 2018, bty002-bty002.

102.

Cheng, L., et al., LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Research, 2018, gky1051-gky1051.

103.

Xu, L., et al., SeqSVM: A Sequence-Based Support Vector Machine Method for 33

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 37

Identifying Antioxidant Proteins. International Journal of Molecular Sciences, 2018. 19, 1773. 104.

Xu, L., et al., An Efficient Classifier for Alzheimer’s Disease Genes Identification. Molecules, 2018. 23, 3140.

105.

Basu, S., F. Söderquist, and B. Wallner, Proteus: a random forest classifier to predict

disorder-to-order

transitioning

binding

regions

in

intrinsically

disordered proteins. Journal of Computer-Aided Molecular Design, 2017. 31, 453-466. 106.

Zhang,

N.,

et

al.,

Computational

prediction

and

analysis

of

protein

γ-carboxylation sites based on a random forest method. Molecular Biosystems, 2012. 8, 2946-2955. 107.

Shu, Y., et al., Predicting A-to-I RNA Editing by Feature Selection and Random Forest. Plos One, 2014. 9, e110607.

108.

Liu,

B.,

et

al.,

iRO-3wPseKNC:

Identify

DNA

replication

origins

by

three-window-based PseKNC. Bioinformatics, 2018. 34, 3086-3093. 109.

Pan, Y.W., Zixiang;Zhan, Weihua;Deng, Lei, Computational identification of binding energy hot spots in protein-RNA complexes using an ensemble approach. Bioinformatics, 2017. 34, 1473–1480.

110.

Cai, C.Z., et al., SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research, 2003. 31, 3692-3697.

111.

Li, D., Y. Ju, and Q. Zou, Protein Folds Prediction with Hierarchical Structured 34

ACS Paragon Plus Environment

Page 35 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

SVM. Current Proteomics, 2016. 13, 79-85. 112.

Wang, S.P., et al., Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm. Current Bioinformatics, 2018. 13, 3-13.

113.

Zhang, N., et al., Discriminating Ramos and Jurkat Cells with Image Textures from Diffraction Imaging Flow Cytometry Based on a Support Vector Machine. Current Bioinformatics, 2018. 13, 50-56.

114.

Song, T., et al., A Parallel Workflow Pattern Modelling Using Spiking Neural P Systems with Colored Spikes. IEEE Transactions on Nanobioscience, 2018. 17, 474-484.

115.

Song, T., et al., Spiking Neural P Systems with Colored Spikes. IEEE Transactions on Cognitive and Developmental Systems, 2018. 10, 1106-1115.

116.

Hang, X., et al., An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization. IEEE Transactions on Cybernetics, 1-12.

117.

Hang, X., et al., MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition. IEEE Transactions on Cybernetics, 2017. PP, 1-10.

118.

Mrozek, D., P. Daniłowicz, and B. Małysiak-Mrozek, HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Information Sciences, 2016. 349-350, 77-101.

119.

Mrozek, D., P. Gosk, and B. Małysiak-Mrozek, Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud. Journal of Grid Computing, 2015. 13, 561-585. 35

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

120.

Page 36 of 37

Mrozek, D., B. Małysiakmrozek, and A. Kłapciński, Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics, 2014. 30, 2822-5.

121.

Zou, Q., et al., Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics, 2014. 15, 637-647.

36

ACS Paragon Plus Environment

Page 37 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table of Contents graphic (TOC) Input protein sequences

Dataset input

Three sequence-based feature descriptors

Feature representation

feature selection strategy

Feature selection

Prediction models

Prediction engine

Predicted MHC 、MHC I and MHC II

Predictor

ACS Paragon Plus Environment