Subscriber access provided by Grand Valley State | University
Article
ELM-MHC: An improved MHC Identification method with Extreme Learning Machine Algorithm Yanjuan Li, Mengting Niu, and Quan Zou J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.9b00012 • Publication Date (Web): 30 Jan 2019 Downloaded from http://pubs.acs.org on January 31, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
ELM-MHC: An improved MHC Identification method with Extreme Learning Machine Algorithm Yanjuan Li1,Mengting Niu1,Quan Zou2 ,3 * 1 School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China; 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China 3 Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
* Correspondence:
[email protected]; Tel: +86-136-5600-9020
Abstract: The major histocompatibility complex (MHC) is a term for all gene groups of major histocompatibility antigen. It binds to peptide chains derived from pathogens and displays pathogens on the cell surface to facilitate T-cell recognition and perform a series of immune function. MHC molecules are critical in transplantation, autoimmunity, infection and tumor immunotherapy. Combining machine learning algorithms and making full use of bioinformatics analysis technology, more accurate recognition of MHC is an important task. The paper proposed a new MHC recognition method compared with traditional biological methods. And used the built classifier to classify and identify MHC I and MHC II. The classifier used the SVMProt 188D, bag-of-ngrams (BonG) and information theory (IT) mixed feature representation methods, and used the extreme learning machine (ELM), which selects lin-kernel as the activation function and used ten-fold cross-validation and the independent test set validation verify the accuracy of the constructed classifier, and simultaneously identify the MHC, and identify the MHC I and MHC II respectively. Through the ten-fold cross-validation, the proposed algorithm obtained 91.66% accuracy when identifying MHC, and 94.442% accuracy when identifying MHC 1
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 37
categories. Furthermore, an online identification website named ELM-MHC was constructed with the following URL: http://server.malab.cn/ELM-MHC/. Keywords: The major histocompatibility complex; Extreme Learning Machine; MHC I; MHC II; identification; machine learning
1 Introduction The major histocompatibility complex (MHC) is a genomic region with many genes encoding a wide variety of molecules, including highly polymorphic classical class I and class II, which is vital in the adaptive immune response of vertebrates[1] .These classical MHC molecules present peptides to thymus-derived (T) lymphocytes[2]. In addition to binding and presenting peptides, classical MHC molecules typically have high allelic polymorphisms and sequence diversity[3]. MHC molecules are essential for transplantation, autoimmunity, infection, and tumor immunotherapy[4, 5]. The MHC I and MHC II are three-dimensional structure cell surface glycoproteins that have the function of expressing peptide fragments to the immune system, and providing a target for controlling diseases such as tumors[6]. Currently, there are many methods for predicting MHC binding peptides[7]. Through the feature extraction of amino acid physicochemical properties, molecular properties etc., DCNN[8, 9], Markov[10,
11]
and nonlinear advanced neural networks are used to
predict MHC and peptide binding, and many predictive tools that can be directly used, such as SVMHC, NETMHCIIpan[12], HONNs[13], ARB[14], MHCpred[15], are generated. The MHC is one of the most deformable and polygenic regions in the vertebrate genome[16], especially the MHC I and II loci strongly influence the effectiveness of immune responses against specific pathogens, such as avians, leukemia and other 2
ACS Paragon Plus Environment
Page 3 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
viruses. MHC I and MHC II are two classic classifications of MHC protein. MHC I is distributed in appearance of nuclear cells, but the expression levels of different tissue cells vary widely. Class I molecules have an antigen recognition function that limits cd8t cells. Class II is mainly in antigen presenting cells. The biological role of class II is to present processed antigens to cd4t cells in the initial stages of the immune response. Class II functions are involved in the presentation of exogenous antigens and can also deliver endogenous antigens under certain conditions. Figure 1 shows the phylogenetic tree of MHC. Scientists have been working to discover MHC molecules in various vertebrate genomes[17-22]. Hopkins et al[23]. described a rat monoclonal antibody that recognizes a sheep MHC class II antigen and appears to recognize a non-polymorphic determinant. Using this antibody, we investigated the distribution of class II molecules in sheep and the class II expression changes in cells in peripheral and efferent lymphocytes produced by in vitro antigen vaccination. In a study of a group of cynomolgus monkeys from China, Westbrook et al[24]. used the SMRT-CCS method to characterize 60 new all-round MHC class I transcript sequences. When studying the extent and type of MHC polymorphisms and haplotypes in the Philippine macaque population, 127 unrelated animals were genotyped and 112 different alleles were identified[25]. At the same time, in order to meet the needs of MHC system research and the use of bioinformatics tools for analysis, the International Society for Animal Genetics (ISAG) standardized the nomenclature and established the IPD-MHC database to scientific management of current and future MHC genes and allele sequences from non-human organisms[26, 27].
3
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 37
Figure 1. Only the lower vertebrates have MHC, and the organization may vary greatly. This idealized phylogenetic tree shows the relationship of some organisms, and on the right is the idealized representation of some genes in the MHC line (for some organisms, equivalent MHC paralogs). The solid horizontal line indicates a continuous sequence, and the question mark indicates an indeterminate link. The horizontal lines indicate the MHC class I and II, the dashed vertical lines indicate antigen processing and peptide loading genes. Data from references[28].
At early stages, the understanding of the discovery, gene composition and function of MHC is based on mouse experiments. Faced with the large amount of data available, and the development of machine learning, there is still a need for efficiency and the development of highly accurate models using existing machine learning algorithms. We expect a deeper understanding of the function and mechanism of MHC, using bioinformatics to meet development needs and promote development. Therefore, the paper proposed to use the method of extreme learning machine to identify MHC and its types. In this paper, SVMProt 188D[29, 30] that use the physical and chemical properties of amino acids and the text features of amino acid frequency in the text to extract features, bag of ngram (BonG) that use an ngram vector instead 4
ACS Paragon Plus Environment
Page 5 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
of the original single word vector, informationtheory (IT) that use information theory knowledge, and combines ELM to construct the classifier. The classification is performed by ten-fold crossover and independent data set. 2 Methods The frame diagram of the MHC classifier in this paper is shown in Figure 2. After the classifier is constructed, the MHC is identified, and then the classifier is used to identify the specific categories of MHC: MHC I and II. In the section, the paper will introduce the data set, feature representation methods, and classifiers in detail.
Query Sequences
Training dataset
Feature representation BonG
188D
Feature representation Training phase
IT
188D
IT
Feature selection
Feature selection MRMR algorithm for feature selection
Feature matrix
BonG
MRMR algorithm for feature selection Prediction phase
Feature matrix
Fed ELM
MHC proteins or not?
Predict
Training model
Figure 2. The paper framework for an MHC classifier. The classifier ELM-MHC mainly includes two parts of training and testing. The feature extraction representation method for the data set is a hybrid feature representation method. A hybrid feature representation method using BonG, 188D, and IT is used. For the mixed feature matrix, the MRMR method is used for feature selection and dimensionality 5
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 37
reduction to obtain the final feature matrix. The training model is then constructed using the extreme learning machine classification method for the data set. At the same time, feature extraction and dimension reduction are performed on the prediction sequence.
2.1. Dataset Based on the biological function and location of protein molecules, MHC has two types: MHC I and MHC II. This article constructs a new data set in which the build process is described below. And You can download the data set in the ELM-MHC server. The first is to search for MHC sequences through the Uniprot database. For the identification of MHC, a positive example is a data set consisting of a mixture of MHC I and MHC II sequences. By using "MHC I" and “MHC II" as a search term, a sequence that matches the search requirements is generated, and a sequence file in the fasta format is downloaded. The relative protein sequence is selected to generate an MHC sequence file. Then get a counterexample through the PFAM family. The resulting fasta format file is then de-redundant to obtain the final data set. In the paper, we use cd-hit for de-duplication processing. The working principle is to cluster all the sequences according to the parameter settings, and output the longest sequence in each cluster as a representative sequence, and give each group of clusters. Each sequence name under the class can be used for similarity analysis. It is necessary to pay attention to setting the threshold (the default similarity is 0.9). After the above steps, the paper constructed a data set of 6712 MHC protein sequences (expressed as Smhc) and 6776 non-MHC protein sequences (expressed as Snon-mhc), named DMHC. Smhc is divided into two parts to predict the two types of MHC protein. The first part, which contains 4370 MHC protein sequences, is used for training, and the
6
ACS Paragon Plus Environment
Page 7 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
second part, which contains 2342 MHC protein sequences, is used for independent testing. For the identification of MHC I and MHC II, we chose the MHC I sequence as a positive example and the MHC II sequence as a counterexample. In Smhc, the sequences number of MHC I and II was 3350 and 3362, respectively. So we use this as a data set to identify MHC I and MHC II. 2.2. Feature
Extraction
When using machine learning methods to identify protein types, feature extraction is very important
[31-38].
A multi-feature hybrid representation method is
adopted in the paper, including three representation methods, as follow SVMProt 188D feature extraction method, bag-of-ngram (BonG) feature extraction method and InformationTheory (IT) based on information theory. 2.2.1
SVMProt 188D features When predicting the type of protein, the unique physical and chemical properties
are usually considered. The compositional characteristics are also well applied in protein recognition. Therefore, we consider whether the combination of the two can also make better predictions. Dubchak first used the hybrid feature representation method of two feature fusions, and verified the effectiveness of this method. It achieved good results in predicting protein folding mode experiments[39]. Later, more feature fusion methods emerged
[38, 40-48].
188D also incorporates the compositional
and physicochemical properties of amino acids. The characteristics of each dimension in the 188D are described in detail below. The first 20 dimensions are the content of the amino acid (in alphabetical order “ACDEFGHIKLMNPQRSTVWY”) in the sequence
[49].
The 21-41 dimensions are 7
ACS Paragon Plus Environment
Journal of Proteome Research
hydrophobic feature of amino acids, including hydrophilic, neutral, and hydrophobic [50].
Among them, the 21-23 dimensions are the amino acid content of hydrophilic
(“RKEDQN”), neutral (“GASTPHY”) and hydrophobic (“CVLIMFW”). The 24-26 dimensions are the above three conversion frequencies. The 27-31 dimensions are the first, 25%, 50%, 75%, and the last position of the hydrophilic amino acid in the sequence, 32-36 dimensions are the first, 25%, 50%, 75% and the last in the sequence of the water-soluble amino acids. The 37-41 dimensions are the first, 25%, 50%, 75%, respectively. And the last position in the sequence. The 42-62 dimensions are van der Waals force property, and the 63-83 dimensions are amino acid polarity feature. 84-104 are amino acid polarization property. 105-125 are charge property, 126-146 are surface tension, 147-167 dimensions are sequence secondary structure property, and 168-188 are solvent accessibility. Therefore, you will get a 188-dimensional feature. Figure 3 describes the188-D function.
Dimensions 21.5
21
21
21
21
21
21
21
21
21 20.5
20
20 19.5
y lit
e
bi
So
lv
en
ta
cc
es
si
ct
ur
n
ru
st
ns io
Se
co n
da ry
e
te
ar ge Ch
rf ac Su
al ic iz ity ed s vo van lu m de r e Po la rit Po y la riz ab i li ty
m
W
aa l
ph ob or
H
yd ro N
in
o
ac i
d
co
m po s
iti
on
19
Am
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 37
Figure 3. Structure of 188-D Feature
2.2.2 Bag-of-ngram feature representation
8
ACS Paragon Plus Environment
Page 9 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
In information retrieval, for text, the bag-of-words model assumes that the appearance of each word is independent, its word order and grammar are ignored, and is only considered as a collection of words. In the word bag feature, the text document is converted into a vector (a vector is just a collection of n numbers.). A vector contains the number of possible occurrences of each word in the vocabulary. Feature selection is also very important to improve the classification performance while reduce the redundancy [51-53]. Bag-of-N-gram or bag-of-ngram is a natural extension of BOW[54]. Optimization of model input. The input to the model is no longer a simple word vector, but an ngram vector. The ngram vector draws on the N-gram language model[55]. An n-gram is n ordered tokens. A word is basically a 1-gram, also known as a monaural model. When it is marked, the counting mechanism can count individual words or count overlapping sequences as n-grams. BoW's n-gram extension is the number of occurrences of any k-gram in the bag-of-n-Grams (BonG) calculation file under k < n conditions. 2.2.3 InformationTheory (IT) Information theory is a branch of probability theory and mathematical statistics. It is used in information processing, information entropy, signal-to-noise ratio and other related topics, and is gradually applied to the field of bioinformatics[56-65]. The information theory feature representation mainly represents the protein sequence characteristics from three aspects: information entropy, relative entropy and information gain. The calculated expressions of the three are shown in equation (1-3).[66].
9
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 37
In information theory, entropy is the average amount of information in each message. It is a measure of uncertainty. The more random the source, the greater its entropy.[67] Information entropy (also known as Shannon entropy, recorded as I) reflects the degree of disorder (ordering) of a system. The more ordered a system, the lower the information entropy, and vice versa. The Shannon information entropy is calculated as shown in Equation 1. 20
I = ― ∑𝑖 = 1𝑝𝑖log2( 𝑝𝑖)
(1)
Where pi represents the ratio of the number of occurrences of amino acid i to the length of the sequence. The expression formula of relative entropy is as shown in Equation 2. 20
𝑝𝑖
REn = ― ∑𝑖 = 1𝑝𝑖log2( 𝑝0)
(2)
Information gain is a measure of the importance of sample characteristics, recorded as G. G = I ― REn
(3)
2.3. Classifier In order to identify MHC proteins, the paper chose ELM as a classification algorithm. After the introduction of extreme machine learning, it has been applied and practical problems, and has been successfully applied to bioinformatics many times [68-76].
ELM is a single hidden layer feedforward neural network that is easy to understand and more convenient to use. [77, 78].
10
ACS Paragon Plus Environment
Page 11 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
f(x)
1
β1 1 G(a1,b1,x)
m
...
β1 ...
...
i
L
G(aL,bL,x)
G(ai,bi,x)
1
βL
...
d
x Figure 4. Structure diagram of extreme learning machine.
For traditional neural networks, we need to set the training parameters of the network. Moreover, it is easy to fall into the local optimal solution when seeking the optimal solution. Relative to ELM, it simplifies the parameter problem that needs to be set. We only need to set the number of nodes in the network hidden layer, then optimize the generation of the optimal solution, only get the only optimal solution. From Huang Guangbin's proposed ELM algorithm to its widespread use, it proves that ELM has a higher learning efficiency, and training model is fast, and the generalization performance is better. Extreme learning machine frame flow chart is shown in the figure 4. The input of the hidden layer is[77]: 𝐿
𝑓𝐿(𝑥) = ∑𝑖 = 1𝛽𝑖𝐺(𝑎𝑖,𝑏𝑖,𝑥).
(4)
11
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 37
Where g is the activation function and the activation function can be sigmod, rbf, but is not limited to this. Among them, L in the formula is the number of hidden layer neuron nodes. And 𝑎𝑖, 𝑏𝑖 represents the parameter vector of the input node. 𝛽𝑖 is a weight vector. The dimension of each component of 𝛽𝑖 is f (set the input layer node to f). Then the output of the neural network is expressed as equation (5) [79]. Hβ = T.
(5)
Where H is the output of the hidden layer node, and 𝛽 is the output weight composed of 𝛽𝑖, and T is the desired output. ℎ(𝑥1) 𝑔(𝑊1 ⋅ 𝑋1 + 𝑏1) ⋮ ⋮ H= = ℎ(𝑥𝑁) 𝑔(𝑊1 ⋅ 𝑋𝑁 + 𝑏1)
[ ][
⋯ ⋱ ⋯
𝑔(𝑊𝐿 ⋅ 𝑋1 + 𝑏𝐿) ⋮ (𝑊𝐿 ⋅ 𝑋𝑁 + 𝑏𝐿)
𝛽𝑇1 𝑇𝑇1 , and T = [ ⋮ ] β=[ ⋮ ] 𝑇 𝛽𝐿 𝐿 × 𝑚 𝑇𝑇𝐿 𝑀 × 𝑚
]
𝑁×𝐿
(6)
The general flow of the ELM algorithm is described in Table 1. Table1 ELM algorithm
ELM Algorithm 𝑁 Input:Given training sample set:{(𝑥𝑖,𝑡𝑖)}𝑖 = 1, hidden layer output function 𝐺(𝑎𝑖,𝑏𝑖,𝑥), and number of
hidden layer nodes L. a) Randomly generated hidden layer node parameters (𝑎𝑖,𝑏𝑖), 𝑖 = 1,⋯𝐿. b) Calculate the hidden layer output matrix H. c) Output: network optimal weight: β:β = 𝐇/𝐓.
In the original ELM algorithm, Huang G. B. et al. proposed an ELM algorithm for training a single hidden layer network corresponding to a normally added hidden layer node network and a radial basis hidden layer node in 2004[79]. The learning principle of support vector machine, Huang G. B. et al. proposed ELM Kernel in 2010[80]. The ELM algorithm constructed by this method has fewer constraints and 12
ACS Paragon Plus Environment
Page 13 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
better than the Extreme SVMs proposed by Liu Q[81]. Therefore, this paper selects the kernel functions commonly used by support vector machines: RBF and linear kernel functions, and compares the classification performance of ELMKernel under different kernel functions. 2.4. The ELM-MHC Online identification Server With the popularity of machine learning methods, bioinformatics has developed greatly [82-88]. The machine learning prediction method and the development of online servers have high practical value
[89-92].
Therefore, this paper developed the
ELM-MHC online prediction server to identify the MHC. Its access link is http://server.malab.cn/ELM-MHC/. And on the website platform, users can provide protein sequences or protein sequence files in fasta format. The ELM-MHC performs feature extraction on the submitted sequence, and given the probability value of MHC or non-MHC. When the sequence is MHC, further studies are performed to predict whether the sequences are MHC I and MHC II. DMHC and related running files can be downloaded from the server. 3 Result In the section, the paper will give a detailed introduction to the evaluation indicators, the recognition results of MHC, and the recognition results of MHC I and MHC II. 3.1. Measurement
13
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 37
So as to evaluate the prediction accuracy of ELM-MHC, this paper uses four indicators: sensitivity (SE), specificity (SP), accuracy (ACC) and Matthew coefficient (MCC), and the expression formulas are respectively as formula (13- 16)[93-104].
𝐴𝐶𝐶 = MCC =
SE =
𝑇𝑃 𝑇𝑃 + 𝐹𝑁
(7)
SP =
𝑇𝑁 𝑇𝑁 + 𝐹𝑃
(8)
𝑇𝑁 + 𝑇𝑃 𝑇𝑁 + 𝐹𝑃 + 𝑇𝑃 + 𝐹𝑁
(𝑇𝑃 × 𝑇𝑁) ― (𝐹𝑃 × 𝐹𝑁) (𝑇𝑃 + 𝐹𝑃) × (𝑇𝑃 + 𝐹𝑁) × (𝑇𝑁 + 𝐹𝑃) × (𝑇𝑁 + 𝐹𝑁)
(9)
(10)
Among them, TP and FP respectively indicate the number of correct sequences predicted in the positive and negative examples, and TN and FN respectively indicate the number of error sequences in the positive and negative examples. SE indicates the correct ratio in the sequence to the ratio of MHC (TP) to all sequences (TP + FN). SP indicates the correct identification as the ratio of MHC (TP) to all sequences (TP + FN). ACC indicates the accuracy of the prediction, and for MHC, when the MCC coefficient is 1, the classifier is perfect. In the paper, we choose the extreme learning machine as the classification algorithm, and use the mixed feature representation method of SVMProt 188D, BonG and information theory, and use the independent test set verification and ten-fold cross-validation. Identification, and MHC I and MHC II for classification and identification. Among them, the independent test is to use 80% of the data set as the test set and 20% as the verification set. Ten-fold cross-validation is to divide the data into ten and get 10 classification results. The accuracy rate is the average of 10 results. First, we create a classifier that recognizes the MHC protein sequence and then identifies the MHC class I and II for the sequence identified as MHC. 14
ACS Paragon Plus Environment
Page 15 of 37
3.2 Two-Class Predictor for Identification MHC To classify MHC, the data set used in this section is Smhc and Snon-mhc. The contrast experiment in this section consists of two parts: different feature representation methods, different classification algorithms. 3.2.1 Performance of Different Features In this section, the paper uses the mixed feature representation method of SVMProt 188D, BonG and information theory. On the basis of using the ELM classification algorithm, different feature representation methods are used respectively. Meanwhile this classification results are compared through ten-fold cross-validation and independent test set validation. The comparison representation method used is a representation method of different feature combinations of 188D, BonG and IT, and a single feature representation method for each of the three. In the paper, we select different activation functions to verify the classification effect of ELM, such as Lin_kernel and RBF_kernel. 100 90 80 70 60
+ IT
8 18
8D
+
Bo n
G
18 IT +
Bo nG + IT
G IT
+
Bo n
8 18
IT
nG
50
Bo
Accuracy(%)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Feature representation methods RBF_kernel
Lin_kernel
Figure 5. Comparison of the accuracy of different feature representation methods.
15
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 37
According to the figure 5, it can be seen that the effects of the 188D, BonG, IT and the mixed representation methods. The effect of the three hybrids has achieved the best effect. When the activation function is a linear function, the classification accuracy rate is the highest, which is 91.66%, which is higher than other feature representation methods. Among them, when the activation function is RBF_kernel, the accuracy of 188D is the lowest, which is 67.11%. For the three separate single feature representation methods, the classification effect of 188D is relatively better than the other two. It can also be seen that the analysis of the physical and chemical properties of amino acids has a better recognition effect on proteins. On the other hand, we can see from the figure that the choice of different activation functions of ELM has a significant impact on the classification performance. From the trend of the fold line in the figure, we can also clearly see that the linear activation function (Lin_kernel) is significantly better than RBF_kernel. Therefore, we choose the activation function for Lin_kernel to perform independent test set verification on the classifier. Table 2. Experimental data table for different feature representation methods. Method
ACC (%)
MCC
SE
SP
BonG+188D+IT
93.1712
0.864
0.923
0.942
188-D+BonG
89.58
0.787
0.871
0.916
188-D+IT
89.4713
0.789
0.897
0.892
IT
83.421
0.672
0.882
0.787
188-D
91.231
0.829
0.862
0.963
BonG
74.57
0.540
0.535
0.955
Table 3. Experimental data table for different classifiers. Classifier
ACC (%)
MCC
SE
SP
ELM
91.66
0.822
0.893
0.908
Random Forest
85.48
0.7140
0.9087
08015
16
ACS Paragon Plus Environment
Page 17 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Naive Bayes
80.86
06172
0.8156
0.8016
SGD
82.08
0.6417
0.8266
0.8151
Nearest Neighbors
84.61
0.6963
0.7895
0.9021
Decision Tree
79.99
0.6963
0.7975
08022
LinearSVC
87.14
0.7428
0.8641
0.84785
Logistic Regression
89.52
0.7911
0.8737
0.9166
LibSVM
73.28
0.5116
0.9380
0.5294
ExtraTrees
84.20
0.6894
0.9033
0.813
Bagging
86.32
0.7270
0.8823
0.8443
AdaBoost
87.03
0.7406
0.8683
0.8723
GradientBoosting
90.98
0.8200
0.8921
0.9272
3.2.2 Comparison with Other Classifiers In the paper, so as to verify the validity of the classification algorithm, the results are compared with other classification algorithms. Through the comparison results of the previous experiment, we select the hybrid feature representation method of 188D, BonG and IT with the highest classification accuracy. The classification effects of ELM were compared with the following classification methods, such as random forest (RF)[105-109], support vector machine (LibSVM)[110-113], and integrated classification algorithm (AdaBoost). The experimental results of the table 3 ten-fold cross-validation can be obtained. When the activation function is selected as Lin_kernel, the accuracy of the ELM reaches a maximum of 91.66%. Compared with other classification algorithms, the validity of the classification method of the extreme learning machine selected in the paper is also well confirmed. 3.3 Two-Class Predictor for Identification MHCI and MHC II MHC I and MHC II are two classic classifications of MHC protein. The identification of MHC I and MHC II was performed by using the constructed classifier, and the experimental results were analyzed. In Smhc, the sequences number 17
ACS Paragon Plus Environment
Journal of Proteome Research
of MHC I and II was 3350 and 3362, respectively. We use this as a data set to identify MHC I and MHC II. 3.3.1 Performance of Different Features 100 90 80 70
+ IT Bo nG
18
8D
+
IT
+
18
8
nG Bo IT
+
nG Bo IT
+
18 8
Bo
IT
60
nG
Accuracy(%)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 37
Feature representation methods RBF_kernel
Lin_kernel
Figure 6. Comparison of the accuracy of different feature representation methods.
In this section, the paper uses the mixed feature representation method of SVMProt 188D, BonG and information theory. On the basis of using the ELM classification algorithm, different feature representation methods are used respectively, and the classification results are compared through 10-fold cross-validation
and
independent
set
validation.
The
comparison
feature
representation method used is a representation method of different feature combinations of 188D, BonG and IT, and a single feature representation method. In the paper, here we still compare the classification effects of different activation functions under ten-fold cross-validation. The activation function still selected Lin_kernel and RBF_kernel. The experimental results of the ten-fold cross-validation are shown in the figure 6. The independent test set verification is the experimental result as shown in the table 4. 18
ACS Paragon Plus Environment
Page 19 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
From the comparison results in Figure 6, we can get that the extreme learning machine works best when Lin_keneral that is selected as the activation function. Moreover, the three feature hybrid representation methods still achieve the best results. The classification accuracy reaches 94.42%. In the independent test set verification, we select Lin_kernel as the activation function to verify the result. Table 4 shows the results of the experiment. Table 4. Experimental data table for different feature representation methods. Method
ACC (%)
MCC
SE
SP
BonG+188D+IT
97.432
0.949
0.990
0.859
188-D+BonG
94.10
0.888
0.930
0.958
188-D+IT
92.02
0.829
0.862
0.963
IT
76.48
0.530
0.944
0.620
188-D
93..171
0.864
0.923
0.940
BonG
82.4943
0.664
0.823
0.827
According to table 4, we can get that the mixed feature representation of the three features achieved the best results. The classification accuracy is 97.432%, and good results have been obtained on the other four indicators such as SE and SP. This also proves the validity of the hybrid feature representation method of 188D, BonG and IT. Verification by ten-fold cross-validation and independent set, the accuracy of the recognition of the representation method selected in the paper is well proved, and the MHC I and MHC II can be well recognized. 3..3.2 Comparison with Other Classifiers In the paper, so as to verify the accuracy of the classifiers, the experimental results were compared with other classifiers. Through the comparison results of the previous experiment, we select the hybrid feature representation method of 188D, 19
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 37
BonG and IT with the highest classification accuracy. The classification effects of ELM were compared with the following classification methods, such as random forest (RF), support vector machine (LibSVM), and integrated classification algorithm (AdaBoost). The experimental results of the above ten-fold cross-validation can be obtained. When the activation function is selected as Lin_kernel, the accuracy of the ELM reaches a maximum of 94.442%. Compared with other classification algorithms, the validity of the classification method of the extreme learning machine is also well confirmed. Table 5. Experimental data table for different feature representation methods. Classifier
ACC (%)
MCC
SE
SP
ELM
94.442
0.822
0.893
0.908
Random Forest
88.52
0.7722
0.9206
0.8495
Naive Bayes
82.60
06520
0.8333
0.8186
SGD
82.12
0.6424
0.8268
0.8156
Nearest Neighbors
84.61
0.6963
0.7895
0.9021
Decision Tree
88.22
0.6963
0.7975
08022
LinearSVC
87.14
0.7656
0.8561
0.9086
Logistic Regression
90.31
0.8067
0.8859
0.9205
LibSVM
88.83
0.7820
0.8308
0.9464
ExtraTrees
85.46
0.7128
0.9063
0.8023
Bagging
89.77
0.7966
0.256
0.8695
AdaBoost
90.19
0.8038
0.9004
0.9034
GradientBoosting
92.80
0.8562
0.9357
0.9357
4 Conclusions In the paper, we have designed a new classifier ELM-MHC. We used a combination of SVMProt 188-D, bag-of-ngrams, and InformationTheory to represent feature and extreme learning machine classifiers. We use the constructed classifier to identify specific categories of MHC and MHC, respectively. So as to 20
ACS Paragon Plus Environment
Page 21 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
verify the effect of ELM-MHC, we performed 10 cross-validation and independent test set verification, respectively, and compared with other feature representations and classifiers. When MHC and MHC I were identified, 10-fold cross-validation achieved accuracy of 91.66% and 94.442%, respectively. The accuracy of the independent test set was 93.1712% and 97.432%, respectively. It shows that ELM-MHC has better prediction effect than other feature extraction algorithms and classifiers. It also proves that ELM and hybrid feature representation methods are effective for identifying MHC recognition. Online servers make MHC prediction and open source possible. The access URL of ELM-MHC proposed in this paper is: http://server.malab.cn/ELM-MHC/. In future, the next step can be to consider the combination of high accuracy classifiers, and establish an integrated classifier to optimize the prediction performance of ELM-MHC. And consider improving server parallel processing capabilities. Moreover, the use of computational intelligence computing platforms
[118-121]
[114-117]
and Cloud
may bring the increase in performance of the
classification process. Acknowledgement: The work was supported by the National Key R&D Program of China (2018YFC0910405), the Natural Science Foundation of China (No. 61771331, 61300098), and the Fundamental Research Funds for the Central Universities (No.2572017CB33). References 1.
Kaufman, J., Unfinished Business: Evolution of the MHC and the Adaptive Immune System of Jawed Vertebrates. Annual Review of Immunology, 2018. 36,
21
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 37
383. 2.
Monzóncasanova, E., et al., The Forgotten: Identification and Functional Characterization of MHC Class II Molecules H2-Eb2 and RT1-Db2. Journal of Immunology, 2016. 196, 988.
3.
Trowsdale, J. and J.C. Knight, Major histocompatibility complex genomics and human disease. Annual Review of Genomics & Human Genetics, 2013. 14, 301-323.
4.
Rock, K.L., E. Reits, and J. Neefjes, Present Yourself! By MHC Class I and MHC Class II Molecules. Trends in Immunology, 2016. 37, 724-737.
5.
Comber, J.D., et al., MHC Class I Presented T Cell Epitopes as Potential Antigens for Therapeutic Vaccine against HBV Chronic Infection. Hepatitis Research & Treatment, 2014. 2014, 423–431.
6.
Nakayama, M., Antigen Presentation by MHC-Dressed Cells. Front Immunol, 2014. 5, 672.
7.
Giguère, S., et al., MHC-NP: predicting peptides naturally processed by the MHC. Journal of Immunological Methods, 2013. 400-401, 30-36.
8.
Han, Y. and D. Kim, Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction. Bmc Bioinformatics, 2017. 18, 585.
9.
Xu, Y., et al., Deep learning of the splicing (epi)genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision. Nucleic Acids Res, 2017. 45, 12100-12112.
10.
H, N., et al., Hidden Markov Model-Based Prediction of Antigenic Peptides That 22
ACS Paragon Plus Environment
Page 23 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Interact with MHC Class II Molecules. Journal of Bioscience & Bioengineering, 2002. 94, 264-270. 11.
Yu, L., et al., Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk. Ieee-Acm Transactions on Computational Biology and Bioinformatics, 2017. 14, 966-977.
12.
Andreatta, M., et al., Accurate pan-specific prediction of peptide-MHC class II binding affinity with improved binding core identification. Immunogenetics, 2015. 67, 641-650.
13.
Kuksa, P.P., et al., High-order neural networks and kernel methods for peptide-MHC binding prediction. Bioinformatics, 2015. 31, 3600-3607.
14.
Bui, H.H., et al., Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics, 2005. 57, 304-314.
15.
Doytchinova, I.A. and D.R. Flower, Towards the in silico identification of class II restricted T-cell epitopes: a partial least squares iterative self-consistent algorithm for affinity prediction. Bioinformatics, 2003. 19, 2263.
16.
Kaufman, J.F., et al., Xenopus MHC class II molecules. I. Identification and structural characterization. The Journal of Immunology, 1985. 134, 3248.
17.
Malmstrøm, M., et al., Evolution of the immune system influences speciation rates in teleost fishes. Nature Genetics, 2016. 48, 1204.
18.
Edholm, E.S., M. Banach, and J. Robert, Evolution of innate-like T cells and their selection by MHC class I-like molecules. Immunogenetics, 2016, 525-536.
19.
Hearn, C., et al., An MHC class I immune evasion gene of Marek׳s disease virus. 23
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 37
Virology, 2015. 475, 88-95. 20.
Li, Y.H., et al., Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics. Nucleic Acids Res, 2018. 46, D1121-D1127.
21.
Li, B., et al., NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic Acids Res, 2017. 45, W162-W170.
22.
Yu, L., et al., Inferring drug-disease associations based on known protein complexes. Bmc Medical Genomics, 2015. 8, 13.
23.
Hopkins, J., B.M. Dutia, and I. Mcconnell, Monoclonal antibodies to sheep lymphocytes. I. Identification of MHC class II molecules on lymphoid tissue and changes in the level of class II expression on lymph-borne cells following antigen stimulation in vivo. Immunology, 1986. 59, 433.
24.
Westbrook, C.J., et al., No assembly required: Full-length MHC class I allele discovery by PacBio circular consensus sequencing. Human Immunology, 2015. 76, 891-896.
25.
Shiina, T., et al., Discovery of novel MHC-class I alleles and haplotypes in Filipino cynomolgus macaques (Macaca fascicularis) by pyrosequencing and Sanger sequencing: Mafa-class I polymorphism. Immunogenetics, 2015. 67, 563-578.
26.
Maccari, G., et al., IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Research, 2017. 45, D860-D864. 24
ACS Paragon Plus Environment
Page 25 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
27.
Maccari, G., et al., IPD-MHC: nomenclature requirements for the non-human major histocompatibility complex in the next-generation sequencing era. Immunogenetics, 2018. 70, 619-623.
28.
Dijkstra, J.M., et al., Comprehensive analysis of MHC class II genes in teleost fish genomes reveals dispensability of the peptide-loading DM system in a large part of vertebrates. BMC Evolutionary Biology,13,1(2013-11-26), 2013. 13, 260.
29.
Wei, L., et al., Improved prediction of protein–protein interactions using novel negative samples, features, and an ensemble classifier. Artificial Intelligence in Medicine, 2017. 83, 67-74.
30.
Wei, L., et al., A novel hierarchical selective ensemble classifier with bioinformatics application. Artificial Intelligence in Medicine, 2017. 83, 82-90.
31.
Niu, M., et al., RFAmyloid: A Web Server for Predicting Amyloid Proteins. International Journal of Molecular Sciences, 2018. 19, 2071.
32.
Yang, H., et al., iRNA-2OM: A Sequence-Based Predictor for Identifying 2 ′ -O-Methylation Sites in Homo sapiens. Journal of Computational Biology, 2018. 25, 1266-1277.
33.
Liu, B., BioSeq-Analysis: a platform for DNA, RNA, and protein sequence analysis based on machine learning approaches. Briefings in Bioinformatics, 2018. DOI: 10.1093/bib/bbx165.
34.
Chen, J., et al., A comprehensive review and comparison of different computational methods for protein remote homology detection. Briefings in 25
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 37
Bioinformatics, 2018. 9, 231-244. 35.
Zou, Q., et al., Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy. Bmc Systems Biology, 2016. 10, 114.
36.
Xu, L., et al., A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides. Genes, 2018. 9, 158.
37.
Xu, Y., et al., Identify bilayer modules via pseudo-3D clustering: applications to miRNA-gene bilayer networks. Nucleic acids research, 2016. 44, e152-e152.
38.
Zhu,
P.F.,
et
al.,
Combining
neighborhood
separable
subspaces
for
classification via sparsity regularized optimization. Information Sciences, 2016. 370, 270-287. 39.
Dubchak, I., et al., Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences of the United States of America, 1995. 92, 8700-8704.
40.
Li, Y.H., et al., SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. Plos One, 2016. 11, e0155290.
41.
Chen, L., et al., Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier. Plos One, 2013. 8, e56499.
42.
Zou, Q., et al., An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier. BioMed Research International,2013,(2013-8-21), 2013. 2013, 686090.
43.
Liu, B., H. Wu, and K.C. Chou, Pse-in-One 2.0: An Improved Package of Web 26
ACS Paragon Plus Environment
Page 27 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. Natural Science, 2017. 09, 67-91. 44.
Wang, G., et al., BinMemPredict: a Web Server and Software for Predicting Membrane Protein Types. Current Proteomics, 2013. 10, 1-2.
45.
Chen, J., et al., ProtDec-LTR2.0: An improved method for protein remote homology detection by combining pseudo protein and supervised Learning to Rank. Bioinformatics, 2017. 33, 3473–3476.
46.
Wan, S., Y. Duan, and Q. Zou, HPSLPred: An Ensemble Multi-label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source. Proteomics, 2017. 17, 1700262.
47.
Zhu, P.F., et al., Multi-view label embedding. Pattern Recognition, 2018. 84, 126-135.
48.
Yu, L., J. Zhao, and L. Gao, Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity. International Journal of Biological Sciences, 2018. 14, 971-980.
49.
Zhang,
J.
and
sequence-based
L.
Kurgan,
predictors
Review of
and
comparative
protein-binding
assessment
residues.
Briefings
of in
bioinformatics, 2017. 50.
Zhang, J., Z. Ma, and L. Kurgan, Comprehensive review and empirical analysis of hallmarks of DNA-, RNA-and protein-binding residues in protein chains. Briefings in bioinformatics, 2017.
51.
Zhu,
P.F.,
et
al.,
Co-regularized
unsupervised
feature
selection. 27
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 28 of 37
Neurocomputing, 2018. 275, 2855-2863. 52.
Zhu, P.F., et al., Multi-label feature selection with missing labels. Pattern Recognition, 2018. 74, 488-502.
53.
Zhu, P.F., et al., Subspace clustering guided unsupervised feature selection. Pattern Recognition, 2017. 66, 364-374.
54.
Cummins, N., et al. MULTIMODAL BAG-OF-WORDS FOR CROSS DOMAINS SENTIMENT ANALYSIS. in IEEE Interna- Tional Conference on Acoustics,
Speech, and Signal Processing, ICASSP. 2018. 55.
Liu, B., et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, 2015. 43, W65-W71.
56.
Qu, K., et al., Identification of DNA-Binding Proteins Using Mixed Feature Representation Methods. Molecules, 2017. 22, 1602.
57.
Wei, L., et al., ACPred-FL: a sequence-based predictor using effective feature representation
to
improve
the
prediction
of
anti-cancer
peptides.
Bioinformatics, 2018. 34, 4007-4016. 58.
Wei,
L.,
et
al.,
PhosPred-RF:
a
novel
sequence-based
predictor
for
phosphorylation sites using sequential information only. IEEE transactions on nanobioscience, 2017. 16, 240-247. 59.
Wei, L., et al., Fast prediction of protein methylation sites using a sequence-based feature selection technique. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017, 1-1. 28
ACS Paragon Plus Environment
Page 29 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
60.
Liu, B. and S. Li, ProtDet-CCH: Protein remote homology detection by combining Long Short-Term Memory and ranking methods. IEEE/ACM Transactions
on
Computational
Biology
and
Bioinformatics.
DOI:
10.1109/TCBB.2018.2789880. 61.
Deng, L., et al., PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties. Nucleic acids research, 2014. 42, W290-W295.
62.
Peng, J.J., et al., A novel method to measure the semantic similarity of HPO terms. International Journal of Data Mining and Bioinformatics, 2017. 17, 173-188.
63.
Hu, Y., et al., Identifying diseases-related metabolites using random walk. BMC Bioinformatics, 2018. 19, 116.
64.
Cheng, L., et al., InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics, 2018. 19, 919.
65.
Xu, Y., et al., A novel insight into Gene Ontology semantic similarity. Genomics, 2013. 101, 368-375.
66.
Wei, L., et al., Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience, 2015. 14, 649-659.
67.
Wei, L., et al., An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information. IEEE Transactions on Nanobioscience, 2015. 14, 339-349. 29
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
68.
Page 30 of 37
Cao, J., et al., Voting based extreme learning machine. Information Sciences, 2012. 185, 66-77.
69.
Cao, J. and L. Xiong, Protein Sequence Classification with Improved Extreme Learning Machine Algorithms. BioMed Research International, 2014. 2014, 12.
70.
Wang, D. and G.B. Huang. Protein sequence classification using extreme learning machine. in IEEE International Joint Conference on Neural Networks,
2005. IJCNN '05. Proceedings. 2005. 71.
Shen, Y., J. Tang, and F. Guo, Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC. Journal of Theoretical Biology, 2019. 462, 230-239.
72.
Limin Jiang, Y.X., Yijie Ding, Jijun Tang, Fei Guo, FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association. BMC Genomics, 2019. 19.
73.
Ding, Y., J. Tang, and F. Guo, Identification of drug-side effect association via multiple
information
integration
with
centered
kernel
alignment.
Neurocomputing, 2019. 325, 211-224. 74.
Song, J., J. Tang, and F. Guo, Identification of Inhibitors of MMPS Enzymes via a Novel Computational Approach. International Journal of Biological Sciences, 2018. 14, 863-871.
75.
Pan, G., et al., A Novel Computational Method for Detecting DNA Methylation Sites with DNA Sequence Information and Physicochemical Properties. International Journal of Molecular Sciences, 2018. 19, 511. 30
ACS Paragon Plus Environment
Page 31 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
76.
Jiang, L., et al., MDA-SKF: Similarity Kernel Fusion for Accurately Discovering miRNA-Disease Association. Frontiers in Genetics, 2018. 9.
77.
Huang, G., et al., Semi-supervised and unsupervised extreme learning machines. IEEE Transactions on Cybernetics, 2014. 44, 2405.
78.
Huang, G.B., Q.Y. Zhu, and C.K. Siew, Extreme learning machine: Theory and applications. Neurocomputing, 2006. 70, 489-501.
79.
Huang, G.-B., Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: a new learning scheme of feedforward neural networks. in Neural Networks, 2004.
Proceedings. 2004 IEEE International Joint Conference on. 2004. IEEE. 80.
Huang, G.B., X. Ding, and H. Zhou, Optimization method based extreme learning machine for classification. Neurocomputing, 2010. 74, 155-163.
81.
Frénay, B. and M. Verleysen. Using SVMs with randomised feature spaces: an extreme learning approach. in ESANN. 2010.
82.
Zeng, X.X., et al., Prediction of potential disease-associated microRNAs using structural perturbation method. Bioinformatics, 2018. 34, 2425-2432.
83.
Zeng, X., et al., Probability-based collaborative filtering model for predicting gene–disease associations. BMC Medical Genomics, 2017. 10, 76.
84.
Zou, Q., et al., Similarity computation strategies in the microRNA-disease network: a survey. Briefings in Functional Genomics, 2016. 15, 55-64.
85.
Zou, Q., et al., Reconstructing evolutionary trees in parallel for massive sequences. BMC Systems Biology, 2017. 11, 15-21.
86.
Wang, X., et al., A Classification Method for Microarrays Based on Diversity. 31
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 32 of 37
Current Bioinformatics, 2016. 11, 590-597. 87.
Zou, Q., et al., Machine learning and graph analytics in computational biomedicine. Artificial Intelligence in Medicine, 2017. 83, 1.
88.
Xuan, Z., et al., Meta-path methods for prioritizing candidate disease miRNAs. IEEE/ACM Transactions on Computational Biology & Bioinformatics, 2017. PP, 1-1.
89.
Zeng, X., X. Zhang, and Q. Zou, Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks. Briefings in Bioinformatics, 2016. 17, 193-203.
90.
Liu, Y., et al., Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2017. 14, 905-915.
91.
Xu, Y., et al., System-level insights into the cellular interactome of a non-model organism: inferring, modelling and analysing functional gene network of soybean (Glycine max). PloS one, 2014. 9, e113907.
92.
Xu, Y., et al., SoyFN: a knowledge database of soybean functional networks. Database, 2014. 2014.
93.
Wei, L., et al., Prediction of human protein subcellular localization using deep learning. Journal of Parallel and Distributed Computing, 2018. 117, 212-217.
94.
Wei, L., J. Tang, and Q. Zou, Local-DPP: An Improved DNA-binding Protein Prediction Method by Exploring Local Evolutionary Information. Information Sciences, 2017. 384, 135-144. 32
ACS Paragon Plus Environment
Page 33 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
95.
Yang,
H.,
et
al.,
iRSpot-Pse6NC:
Identifying
recombination
spots
in
Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. International Journal Of Biological Sciences, 2018. 14, 883-891. 96.
Tang, H., et al., HBPred: a tool to identify growth hormone-binding proteins. International Journal Of Biological Sciences, 2018. 14, 957-964.
97.
Su, Z.-D., et al., iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics, 2018. 34, 4196-4204.
98.
Liu, B., et al., iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 2017. 33, 35-41.
99.
Zhang, J., et al., Ontological function annotation of long non-coding RNAs through hierarchical multi-label classification. Bioinformatics, 2018. 34, 1750-1757.
100.
Peng, J.J., W.W. Hui, and X.Q. Shang, Measuring phenotype-phenotype similarity through the interactome. Bmc Bioinformatics, 2018. 19, 114.
101.
Cheng, L., et al., DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function. Bioinformatics, 2018, bty002-bty002.
102.
Cheng, L., et al., LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse. Nucleic Acids Research, 2018, gky1051-gky1051.
103.
Xu, L., et al., SeqSVM: A Sequence-Based Support Vector Machine Method for 33
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 34 of 37
Identifying Antioxidant Proteins. International Journal of Molecular Sciences, 2018. 19, 1773. 104.
Xu, L., et al., An Efficient Classifier for Alzheimer’s Disease Genes Identification. Molecules, 2018. 23, 3140.
105.
Basu, S., F. Söderquist, and B. Wallner, Proteus: a random forest classifier to predict
disorder-to-order
transitioning
binding
regions
in
intrinsically
disordered proteins. Journal of Computer-Aided Molecular Design, 2017. 31, 453-466. 106.
Zhang,
N.,
et
al.,
Computational
prediction
and
analysis
of
protein
γ-carboxylation sites based on a random forest method. Molecular Biosystems, 2012. 8, 2946-2955. 107.
Shu, Y., et al., Predicting A-to-I RNA Editing by Feature Selection and Random Forest. Plos One, 2014. 9, e110607.
108.
Liu,
B.,
et
al.,
iRO-3wPseKNC:
Identify
DNA
replication
origins
by
three-window-based PseKNC. Bioinformatics, 2018. 34, 3086-3093. 109.
Pan, Y.W., Zixiang;Zhan, Weihua;Deng, Lei, Computational identification of binding energy hot spots in protein-RNA complexes using an ensemble approach. Bioinformatics, 2017. 34, 1473–1480.
110.
Cai, C.Z., et al., SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research, 2003. 31, 3692-3697.
111.
Li, D., Y. Ju, and Q. Zou, Protein Folds Prediction with Hierarchical Structured 34
ACS Paragon Plus Environment
Page 35 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
SVM. Current Proteomics, 2016. 13, 79-85. 112.
Wang, S.P., et al., Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm. Current Bioinformatics, 2018. 13, 3-13.
113.
Zhang, N., et al., Discriminating Ramos and Jurkat Cells with Image Textures from Diffraction Imaging Flow Cytometry Based on a Support Vector Machine. Current Bioinformatics, 2018. 13, 50-56.
114.
Song, T., et al., A Parallel Workflow Pattern Modelling Using Spiking Neural P Systems with Colored Spikes. IEEE Transactions on Nanobioscience, 2018. 17, 474-484.
115.
Song, T., et al., Spiking Neural P Systems with Colored Spikes. IEEE Transactions on Cognitive and Developmental Systems, 2018. 10, 1106-1115.
116.
Hang, X., et al., An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization. IEEE Transactions on Cybernetics, 1-12.
117.
Hang, X., et al., MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition. IEEE Transactions on Cybernetics, 2017. PP, 1-10.
118.
Mrozek, D., P. Daniłowicz, and B. Małysiak-Mrozek, HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud. Information Sciences, 2016. 349-350, 77-101.
119.
Mrozek, D., P. Gosk, and B. Małysiak-Mrozek, Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud. Journal of Grid Computing, 2015. 13, 561-585. 35
ACS Paragon Plus Environment
Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
120.
Page 36 of 37
Mrozek, D., B. Małysiakmrozek, and A. Kłapciński, Cloud4Psi: cloud computing for 3D protein structure similarity searching. Bioinformatics, 2014. 30, 2822-5.
121.
Zou, Q., et al., Survey of MapReduce frame operation in bioinformatics. Briefings in Bioinformatics, 2014. 15, 637-647.
36
ACS Paragon Plus Environment
Page 37 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Table of Contents graphic (TOC) Input protein sequences
Dataset input
Three sequence-based feature descriptors
Feature representation
feature selection strategy
Feature selection
Prediction models
Prediction engine
Predicted MHC 、MHC I and MHC II
Predictor
ACS Paragon Plus Environment