Learning from Semantically Dependent Multi-Tasks Bin Liu∗† , Zenglin Xu∗† , Bo Dai‡ , Haoli Bai∗† , Xianghong Fang∗ , Yazhou Ren∗† and Shandian Zhe§ ∗
†
Big Data Research Center, University of Electronic Science and Technology of China, Chengdu, China School of Computer Science & Engineering, University of Electronic Science and Technology of China, Chengdu, China Email: {liu@std,zlxu@,bai@std}uestc.edu.cn,xianghong
[email protected],
[email protected] ‡ Georgia Institute of Technology, North Ave NW, Atlanta, USA Email:
[email protected] § Purdue University, West Lafayette, IN, USA Email:
[email protected] Abstract—We consider a different setting from regular multitask learning, where data from different tasks share no common instances and no common feature dictionary, while the features can be semantically correlated and all tasks share the same class space. For example, in the two tasks of identifying terrorism information from English news and Arabic news respectively, one associated dataset could be news from Cable News Network (CNN), and the other be news crawled from websites of Arabic countries. Intuitively, these two tasks could help each other, although they share no common feature space. This new setting has brought obstacles to traditional multi-task learning algorithms and multi-view learning algorithms. We argue that these different data sources can be co-trained together by exploring the latent semantics among them. To this end, we propose a new graphical model based on sparse Gaussian Conditional Random Fields (GCRF) and Hilbert-Schmidt Independence Criterion (HSIC). In additional to output the prediction accuracy for each single task, it can also model (1) the dependency between the latent feature spaces of different tasks, (2) the dependency of the category spaces, and (3) the dependency between the latent feature space and the category space in each task. To make the model inference effective, we have provided an efficient variational EM algorithm. Experiments on both synthetic data sets and real-world data sets have indicated the feasibility and effectiveness of the proposed framework.
I. I NTRODUCTION Multi-task learning has been an active research topic in recent years. It usually assumes that data collected from multiple domains share a common feature space (a.k.a. dictionary), such that the information across multiple domains can help improve learning performances [1]. Despite the success of multi-task learning in a number of applications, it is hard to be applied in some practical scenarios where data from different domains do not share a common feature dictionary, while all the domains correspond to the same label space and thus can be semantically dependent. As a motivating example, let us consider a task of categorizing English news on American football and news on soccer into two different categories. Thanks to their popularity, sport news can be found in many languages, such as French, Chinese or Spanish. Intuitively, the news in English can be used to construct classification tasks and to boost the task of classifying news in French. For another example, in biology, scientists from different countries may describe genomes from different groups of people via different measurements.
978-1-5090-6182-2/17/$31.00 ©2017 IEEE
The above scenarios have suggested a different setting from traditional multi-task learning and multi-view learning. Here, data from different domains usually have the following properties: (1) different data domains share no common instances, (2) different domains share no common feature dictionary, and (3) they share a common feature subset. The first property brings obstacles for standard multi-view learning techniques [2]–[5], which usually assumes an instance can be represented in different views. The second property brought challenges to standard multi-task learning techniques and transfer learning [6], most of which model the dependence between tasks in the same input feature space. This setting is also different from (Heterogeneous) Multi-task Multi-view Learning (MTMVL) [7]–[9], which is a combination of multi-task learning and multi-view learning. To differentiate from literature on standard multi-task learning, we name this new setting as Semantically Dependent Multi-Task Learning (SDMTL), which assumes that dependencies occur semantically in the latent feature spaces and in the output label (concept) space. The work presented in [10] also belongs to SDMTL except that it requires to know the feature-label association graph in advance. We illustrate their differences in Figure 1. Here the squares in dashed lines mean that the corresponding data views are missing. Task 1
Label Set 1
Task 2
Label Set 2
View 1 View 2 Label Set
(a) Multi-task learning
(b) Multi-view learning View 1 View 2
View 1 View 2 Task 1
Label Set 1
Task 1
Task 2
Label Set 2
Task 2
Label Set
(c) (Heterogeneous) MTMVL
(d) SDMTL
Fig. 1. Comparison of four types of learning paradigms. Each row denotes a task and each column denotes a data view. Here the squares in dashed lines represents missing data views. In Heterogeneous MTMVL, one data view is missing; in SDMTL, two crossing views are missing leading to no shared views.
In order to learn the dependence among tasks for better prediction, we propose a graphical model based on Multi-
3498
task Sparse Gaussian Conditional Random Field [11], [12] and Hilbert-Schmidt Independence Criterion (HSIC) [13]. In this graphical model, different data sources are generated by their representations in the latent space, where the dependence among latent representations occurs. Different from traditional literature, where the latent representations are in common, we introduce the HSIC to model the dependence between latent feature spaces. Furthermore, in order to utilize the structure in the label space, we adopt a Sparse Gaussian Conditional Random Field (SGCRF) to describe the dependence of the category space, and the association between the latent feature space and the category space. In order to make efficient model inference, we have developed an efficient Variational Expectation-Maximization (VEM) framework. For model evaluation, we have conducted experiments on both synthetic data and real world data. Experimental results have indicated that the discussed setting is meaningful and the proposed framework can not only improve the classification accuracy for multi-category classification and multi-label classification, but also output the concept structure of the label space, and the dependence graph between features and the label set. This paper is organized as follows. In Section 2, we review related work in multi-task learning. In Section 3, we present the proposed multi-task sparse Gaussian conditional random field, followed by the inference method in Section 4. We present experimental results in Section 5 and conclude in Section 6.
the heterogeneous multi-task semantic feature learning [10] are similar to our work in that they do not require to share a common dictionary. The first one requires strong dependencies between different feature dictionaries; the second one requires the association graph between the class label and the task features as a prior input. Instead, SDMTL can control the dependence between the feature dictionaries with the HSIC measure and learn the association graph through the sparse Guassian conditional random field. The new setting of SDMTL is also related to Self-Taught Learning (STL) [31], [32], which can also transfer knowledge from unlabeled data. Self-taught learning assumes that the basis information from the unlabeled data can be used to reconstruct the feature space of the labeled data. It is significantly different from the SDMTL discussed in this paper since STL does not explore the label connections between the source data and the target data. III. M ODEL Before presenting the model, we first introduce the notations. Suppose there are T tasks, and the t-th task is represented by {Xt , Ft }, where Xt ∈ Rkt ×Nt denoting Nt examples in a kt dimensional space. We assume the labels are i.i.d. generated. We consider two different settings– multi-category classification and multi-label classification – according to the data type of labels. We denote the label by Ft ∈ {0, 1}D×Nt for both settings, where D is the number of classes.
II. R ELATED W ORK Multi-task learning (MTL) is an inductive transfer mechanism for improving generalization performance by incorporating knowledge from relevant tasks. MTL algorithms in the early stage usually assume that all tasks are homogeneous and learn a common low-dimensional representation without considering the task relationship [14]–[17]. Since tasks are usually different, recent algorithms try to utilize the prior knowledge on task relationship, e.g., hierarchical structure and graph structure in document or image classification. It learns model parameters in the way that similar tasks share similar parameters [18]–[22]. As a byproduct, the task relationships can also be learned [23]–[26]. Many multi-task learning methods usually assume that data from different tasks must locate in the same feature space, which is not always guaranteed. To solve this problem, heterogeneous multi-task learning has been proposed by projecting data from different tasks into a shared common subspace [10], [24], [27]–[29]. Although the heterogeneity in heterogeneous multi-task learning methods lie in different forms, such as missing features or coming from different sources, most of these methods require to share a common feature dictionary. Different from these methods, the Semantically Dependent Multi-task Learning (SDMTL) leverages this requirement – tasks can have no common feature dictionary, but require to share a common concept structure. The translated learning [30] and
ଵ௧
ଶ௧
௧
௧
௧
௧
௧
௧
௧
௧
Fig. 2. The graphical illustration of the Sparse Multi-task Gaussian Conditional Random Field.
In order to model the dependence between different data sources, we propose a generative model, as shown in Figure 2, for the Semantically Dependent Multi-task Learning (SDMTL), where different domains of data share no common feature dictionary and no common instances. For the t-th (t = 1, . . . , T ) task, we assume that a latent representation denoted by Ut generates the observed data Xt via the latent projection matrix Wt , where Ut ∈ Rp×Nt , Wt ∈ Rkt ×p , and p denotes the number of latent dimensions. The label matrices are also generated from Ut in a complex manner that will be given later. The latent variable Zt denotes the output
3499
Ft in a continuous space, and Yt is an auxiliary variable for inference convenience. In convenience of presentation and without confusion, we use X to denote the set {Xt } for t = 1, ...T , and U, W, Z, Y and F can be similarly defined. In detail, the whole generation procedures are described as follows. A. Observation Generation Draw each row of the projection matrix [Wt ]i· ∼ N ([Wt ]i· ; 0, σ1t Ip ), where [Wt ]i· ∈ R1×p and σ1t is the Gaussian noise parameter for [Wt ]i· . Each instance, denoted by a column in the latent representation matrix Ut , is generated from a normal distribution.
Since Zt denotes the output label Ft in the continuous space, in order to establish a connection between Zt and Ft , we introduce a link function p([Ft ]d,n |[Zt ]d,n ) as: Nt p([Ft ]d,n |[Zt ]d,n ), p(Ft |Zt ) = i
where [Ft ]d,n is the d-th output label for the n-th example, and [Zt ]d,n is its corresponding latent variable. Note that the link function could be different in various settings. For multi-label problems, a common way is to assign a binary probit noise function on each category. Under such settings, however, the direct inference on Zt is intractable, and therefore we further follow [36]–[38] to introduce another auxiliary variable Yt as an augmented representation of the probit model: p([Ft ]d,n |[Yt ]d,n ) =δ(Ft ]d,n = 1)δ([Yt ]d,n > 0)+ δ([Ft ]d,n = 0)δ([Yt ]d,n ≤ 0), (2)
[Ut ]·j ∼ N ([Ut ]·j ; 0, σ2t Ip ), where [Ut ]·j ∈ Rp×1 and σ2t is the Gaussian noise parameter for [Ut ]·j . Then we can write the data distribution for Xt as Xt ∼ MN kt ,Nt (Xt ; Wt Ut , βt Ikt , INt ). For simplicity, we simplify the covariance structure of Xt by re-parameterizing βt = σ1t × σ2t . For the convenience of presentation, we introduce the matrix-variate distributions [33]– [36] to describe the distribution of matrices. For example, if a matrix A follows a normal distribution MN (A; 0, Σ1 , Σ2 ), the vectorization of A also follows a normal distribution, i.e., vec(A) ∼ N (vec(A); 0, Σ1 ⊗ Σ2 ), where ⊗ denotes the Kronecker product, and Σ1 and Σ2 are the row and column covariance matrices, respectively. B. Label Generation In order to model both the structure of the label space and the interaction between the latent attributes and the labels, we employ Gaussian Conditional Random Fields (GCRF) to model the conditional distributions of p(Zt |Ut ). The conditional distribution can be expressed as p(zt |ut ) =
1 1 Λzt − u exp (− z t Θt zt ), H(ut ) 2 t
p([Yt ]d,n |[Zt ]d,n ) = N ([Yt ]d,n ; [Zt ]d,n , 1),
where δ is the indicator function (its value is 1 if the statement inside is true, and 0 otherwise). For multi-category problems, we adopt [39] to assign a multinomial probit regression on the link function p([Ft ]d,n |[Zt ]d,n ). By similarly introducing an auxiliary variable Yt , we have p([Ft ]d,n = 1|[Yt ]d,n ) =δ([Ft ]d,n = 1)· δ([Yt ]d,n > [Yt ]i,n ∀i = d), p([Yt ]d,n |[Zt ]d,n ) = N ([Yt ]d,n ; [Zt ]d,n , 1),
1 −1 = const|Λ| exp(−u Θt ut ), t Θt Λ H(ut )
p(Zt |Ut ) = MN D,Nt (Zt ; −Λ
(5)
The introduction of sparse Gaussian conditional random field enjoys two folds benefits: 1. The dependence between labels, which is shared by all tasks, can be explored by learning Λ given data. 2. For all tasks, the dependence between the feature space and the label space can be incorporated by exploiting the ˆ t = Λ−1 Θt . structure of Θ Denoting Θ = {Θ1 , . . . , Θt }, we can combine the model components (Eq.(1), (2), (3), (4), (5)) together and obtain the joint distribution as follows: p(U, W, X, Z, Y, F|β, Θ, Λ) = p(X|W, U; β)· p(U)p(W)p(Z|U; Θ, Λ)p(Y|Z)p(F|Y). We can calculate the marginal likelihood L(β, Λ, Θ) or equivalently log p(X, F) as
where the | · | is the determinant operation. We can employ matrix-variate normal distributions to represent the conditional distribution p(Zt |Ut ). −1 Θ , INt ), t Ut , Λ
(4)
C. Joint Distribution
where Λ denotes the inverse covariance matrix of labels, whose non-zero elements correspond to the conditional dependence between two variables. Θt in the task describes the association between the latent factors and the label space. The 1 is defined as, partition function H(u t)
−1
(3)
(1)
L(β, Λ, Θ) = log
p(U, W, X, Z, Y, F)dUdWdZdY. (6)
Given the observed data X and F, the inference task is to estimate the latent variables and the hyper-parameters {βt , Λ, Θ}.
3500
D. Posterior Regularization
Therefore, we can obtain the evidence lower bound as
Since different tasks may share some common knowledge in the concept level, it is important to incorporate into the model this ‘prior’ knowledge. There are three sets of knowledge to consider: (1) the dependence matrix between the feature space to the label space should be sparse, (2) the dependence matrix between different categories should be sparse, and (3) the latent spaces of different tasks tend to be dependent. To this end, we employ the idea of posterior regularization [40] to satisfy the above requirements. In detail, to prefer sparse solutions, we put L1 -norms constraint on Θ and Λ and minimize the L1 -norms; and to model the dependence between Ut (t = 1, . . . , T ), we add an additional regularization term related to the Hilbert-Schmidt Independence Criterion (HSIC) [13] and maximize the dependence. Thus, the expected HSIC measure between the source domain Us and the target domain Ut under the distribution of p(Us , Ut ) can be calculated as Φ(Ut , Us ) p(Us ,Ut ) = trace(Kt HKs H), where H = I − p1 11 , Ks ∈ Rp×p , and Kt ∈ Rp×p are the kernel matrices calculated on Us and Ut , respectively. Although we can pick arbitrary kernels, we select the linear kernel considering the easy computation of the integral operators. Therefore the marginal likelihood with posterior regularization can be written as max L(β, Λ, Θ) − λ3 ||Θt ||1 − λ2 ||Λ||1 + β,Θ,Λ∈S
+ λ1
t
˜ L(β, Θ, Λ) ≥ L(Q, β, Θ, Λ) p(U, W, X, Z, Y, F) q(W)q(U)q(Z)q(Y) , = log q(W)q(U)q(Z)q(Y) where Q denotes q(W), q(U), q(Z) and q(Y). A. E-step In the E-step, we iteratively update q(W), q(U), q(Z), and q(Y). More specifically, similar to using a coordinate descent algorithm, the variational approach updates one approximate distribution at a time while with all the others fixed. Given the current q(U), q(Z), and q(Y), we obtain the variational distribution of Wt as q(Wt )
q(Ut ) where
IV. I NFERENCE To estimate the latent variables (i.e., U, W, Z, Y), and the hyper-parameters (i.e., Λ, Θ, and β), we use an expectation maximization algorithm. The E-step computes the expected log probability of the joint model over the posterior distribution p(U, W, Z, Y|X, F) and the M-step optimizes the expected log probability of the joint model over Λ, Θ, and β. However, it is not hard to find that the calculation of the integral in Eq.(6) is intractable. Therefore, we seek to calculate the variational lower bound to the marginal likelihood. The idea of variational inference is to approximate the posterior distribution of p(U, W, Z, Y|X, F) by a fully factorized distribution: p(U, W, Z, Y) = q(W)q(U)q(Z)q(Y).
=
ˆt , Ωu , I), MN p,Nt (Ut ; U
⎛
⎞−1 1 ⎠ Wt Wt − ΛHUi U Ωu = ⎝Θt Λ−1 Θ t + i H + I β i=t
ˆt =Ωu (β −1 W Xt − Θt Zt ). U t Given the current q(W), q(U), and q(Y), we obtain the variational distribution of Zt as q(Zt )
where Λ is constrained to be positive semi-definite (PSD) and S+ is the set of PSD, and P is the set of all non-exchangeable pairs of source and target tasks. Here λ1 , λ2 , and λ3 are tradeoff parameters. In particular, a large value of λ1 will force Us and Ut highly dependent to each other; and large values of λ2 and λ3 will output sparse graphs.
ˆ t , I, Ωw ) MN kt ,p (Wt ; W
−1 ˆ t = 1 Xt U Ωw . where Ωw = ( β1t Ut U , and W t + I) t βt Given the current q(W), q(Z), and q(Y), we obtain the variational distribution of Wt as
Φ(Ut , Us ) p(Us ,Ut ) ,
t,s∈P
=
=
MN D,Nt (Zt ; Zˆt , Ωz , I),
where Ωz = (Λ + I)−1 and Zˆt = Ωz (Yt − Θ t Ut ). Finally, if data have multiple labels, we employ the multilabel binary probit noise model to update Ytij as follows: Ytij = Zij t +
(2Fij t − 1)NZij (0, 1) t
ij Φ((2Fij t − 1)Zt )
,
where the superscript denotes the element index. The derivation can be obtained similarly as [36], [37]. If data labels are multi-class, according to [39], we can use multinomial probit noise model to model the multi-class setting. Assuming the i-th example has the class index k, i.e., fti = k, then for all l = k, we have Ytil = Zil t −
ik i,k,l Ep(u) (Nu (Zil t − Zt , 1)Φu )
i,k,l il Ep(u) ((Φ(u + Zik t − Zt )Φu ) ij Ytik = Zik (Zt − Ytij ) t +
,
j=k
ij ik = where Φi,k,l u j=k,l Φ(u + Zt − Zt ) and p(u) = N (u; 0, 1). Since the expectation cannot be calculated analytically, we can calculate it by sampling.
3501
B. M-step Based on the expectations of the latent variables, the M-step is to optimize the hyper-parameters, Θ, Λ, and β. ˆ − λ2 ||Λ||1 − λ3 max Eq L ||Θt ||1 ,
Considering the types of the output, we generate synthetic data in two settings: the multi-category setting and the multilabel setting.
t
ˆ is the expectation of L. ˆ where Eq L Eliminating the constants, we have the following optimization problem: 1 max f (Λ, Θ, β) = N log |Λ| − tr (At ) (7) 2 t Nt ||Θt ||1 log |Dt | − λ2 ||Λ||1 − λ3 + 2 t t
where N = t Nt and −1 Θt + 2Zt Θ At = Zt Z t Λ + Ut Ut Θt Λ t Ut , t −1 −1 ΛHUi U Dt = Θ Λ Θt + β Wt Wt − i H + I. i=t
Taking derivative of f (Λ, Θ, β) (Eq.(7))to Θ and Λ respectively, we obtain the gradient as follows: ∂f Nt −1 D )Θt Λ−1 , = Ut Z t + (Ut Ut − ∂Θt 2 t ∂f −1 −1 = (Zt Z Θt Ut U ) t −Λ t Θt Λ ∂Λ t 1 −1 −1 N (Λ Θt Dt Nt Θt Λ−1 ) − Λ−1 + 2 2 t We can use L-BFGS [41] to optimize the above problem. When optimizing βt , the objective function becomes Nt log |β −1 I| − tr (Bt ) max g(βt ) = − βt 2 Nt kt log |βt−1 Ut U log |Dt |, + t + I| + 2 2 −1 where Bt = βt−1 X t Xt + βt Ut Ut Wt Wt − −1 2βt Xt Wt Ut . Thus, the optimal βt could be obtained by solving the following equation: Nt k t kt ηi βt Nt +E+ βt + tr(D−1 t Wt Wt ) = 0, 2 2 p ηi + β t 2 where ηi is the eigenvalue of Ut U t and E = tr(Xt Xt ) + tr(Ut Ut Wt Wt − 2Xt Wt Ut ).
V. E XPERIMENTS We design experiments to evaluate the performance of the proposed method. For comparison methods, since multi-task learning methods cannot be applied in the setting discussed in this paper, we compared our method with several single-task based methods: Support Vector Machine (SVM) [42], Logistic Regression (LR) [43] and Sparse Gaussian Conditional Random Field (GCRF) [11]. Additionally, GCRF can also model the dependence between the feature space and the output label space.
íORVV
Λ,Θ,β
A. Synthetic Data
7DVN 7DVN
S
Fig. 3. Variance of latent dimensions on multi-category synthetic data.
1) Multi-category setting: We first generate a synthetic twotask multi-class data set. For each task, the latent variable Ut is randomly sampled by columns from a shared 10-dimensional multivariate Gaussian distribution. The projection matrix Wt is randomly sampled by rows from a multivariate Gaussian distribution N (0, Ip ). Then the observations Xt are obtained from the projection of latent variables Ut via Wt . There are 6 categories in the data. We generate the class labels according to GCRF with parameters λ and Θ sampled from random Gaussian distributions. The class labels are set to the indexes of the maximum output of GCRF. We generate 100 instances with 90 dimensions for the first task, and 200 instances with 180 dimensions for the second task. We use the empirical zero-one loss to measure all methods in the multi-class setting, and use cross-validation to set all the parameters. Especially, for the number of latent dimensions (denoted as p), we show the sensitivity analysis in Figure 3. The results demonstrate that when p = 8, our method achieves a better performance even if the latent dimension is less than the true dimension. We thus set the number of latent dimensions p to 8. We repeat each experiment for 10 times and show the average results with standard deviation in Figure 4(a). We observe that our method can achieve significantly lower losses than the rest algorithms in both tasks. This indicates the benefits of exploring a semantically dependent task although these two tasks share no common dictionary and represent different instances. 2) Multi-label setting: We adopt a similar strategy to generate the synthetic data for the multi-label classification setting. There are 6 categories in both tasks and we set the labels to be 1 if the outputs of GCRF are greater than zero. The empirical Hamming loss rate between the prediction to the ground truth is used as the evaluation measure. Parameters are set in the same way. Experimental results are shown in Figure 4(b). It can be observed that our method achieves a lower Hamming loss rate than the rest models, proving the effectiveness of modeling the semantic relation between two tasks in multilabel settings.
3502
0.45
0.4 SVM Softmax GCRF Our Method
0.35
0.35
Hamming loss(%)
0−1 loss(%)
0.3
SVM Softmax GCRF Our Method
0.4
0.25 0.2 0.15
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05
0
Task1
0
Task2
Task1
Task2
(a) Multi-category synthetic data (b) Multi-label synthetic data Fig. 4. Experimental results on different datasets. The colorbars in (b), (c) and (d) denote the same meaning as that in (a). 0.5
0.7
0.45
0.5
Hamming loss(%)
Hamming loss(%)
0.4 0.35
SVM Softmax GCRF Our Method
0.6
SVM Softmax GCRF Our Method
0.3 0.25 0.2 0.15
0.4 0.3 0.2
0.1 0.1
0.05 0
Images selected
0
Images remained
English
French
(a) Corel5k (b) DMOZ Fig. 5. Experimental results on different datasets. The colorbars in (b), (c) and (d) denote the same meaning as that in (a). 3K\VLFV
VRLO
ORZ
IXQG
7HFKQRORJ\
&KHPLVWU\
QDWLRQDO
UHVRXUFH SURGXFW
FODVVLILFDWLRQ
$JULFXOWXUH
SODQW
GLVFULPLQDWLRQ
QDWLRQDO
SURJUDP
FRQWDFW
0DWK
FRVW FRVW
HTXDO
$OJULFXOWXUH
VFLHQFH SURGXFW XQLYHUVLW\
SODQW
VFLHQFH
7HFKQRORJ\
GHSDUWPHQW
(a) Dependence among the label space
(b) Dependence between the agriculture and its (c) Dependence between the technology and its latent feature space latent feature space Fig. 6. The graphical illustration of dependence matrices. Labels are colored gray, and latent features are numbered and colored in light gray. Original words are in the outer layer.
B. Real-world Data We use two real-world datasets to evaluate the performance of our proposed method. The first dataset, Corel5k, is a multilabel image dataset containing 5000 instances with a resolution of 192×168 ( http://mulan.sourceforge.net/datasets-mlc.html). In this paper, we randomly select 500 images to form the first task, and the resolution of the images is reduced to 96×64. The remaining data with the original resolution forms the second task. Then we extract the SIFT features [44] from both datasets to form the input matrix Xt , setting the widths of the histogram windows of the SIFT statistics to 30. The results presented in Figure 5(a) show that our method outperforms the rest algorithms apparently on two tasks. Specifically, compared with SVM, LR and GCRF, the hamming loss achieved by our method decreases 31.6%, 31.5% and 6.5% on the first task and 36.8%, 35.8% and 8.3% on the second task, respectively.
The second dataset, named as DMOZ, is a collection of web pages in English or French crawled from the Science directory of DMOZ (http://www.dmoz.org/Science). It contains five classes, i.e. agriculture, chemistry, math, physics, and technology. The English web pages form the first task; and the French web pages form the second task . Both tasks can be seen as multi-label classification tasks since each web page can be categorized into multiple categories. We extract 2,000 articles in English and 1,000 articles in French. After the feature selection procedure via TFIDF [45] and Relieff [46], the input feature dimensions of Xt are reduced to 2,000 for both languages. Again, the results plotted in Figure 5(b) shows that our method achieves significantly lower or at least equal loss rates comparing to the rest algorithms. In addition to make prediction, our method can output
3503
the label dependence as well as the dependence between features and the labels. We visualize the label dependence in Figure 6(a) by selecting the most significant links in the dependence matrix. Due to the space limit, we only draw the dependence graph of the “Agriculture” class, and the dependence graph of the “Agriculture” class, in Figures 6(b) and 6(c), respectively. Note that since Ut can be regarded as topics ˆ t controls to the original feature dimensions (i.e. words), and Θ the dependence between latent features and output labels, we can establish a hierarchical connection with the labels connected to latent features, and latent features connected to the original words. It is shown that the words of two labels in Figure 6(b) and 6(c) are intuitively representative. VI. C ONCLUSION In this paper, we deal with the setting of Semantically Dependent Multi-Task Learning, shorted as SDMTL, where different tasks share no common feature dictionary and no common instances, but share a common label set. In order to learn the dependencies between different dictionaries and between the label set and each feature space, we have designed a model based on sparse Gaussian Conditional Random Filed and Hilbert-Schmidt Independence Criterion. Experiments on both simulated data and real-world data have shown the improved classification accuracy and the meaningful dependence graphs. VII. ACKNOWLEDGEMENT This paper was in part supported by Grants from the Natural Science Foundation of China (No. 61572111), the National High Technology Research and Development Program of China (863 Program) (No. 2015AA015408), the joint Foundation of the Ministry of Education of China and China Mobile Communication Corporation (No. MCM20150505), a Project funded by China Postdoctoral Science Foundation(No. 2016M602674), a 985 Project of UESTC (No.A1098531023601041) and two Fundamental Research Funds for the Central Universities of China (Nos. ZYGX2014J058, A03012023601042). R EFERENCES [1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41– 75, 1997. [2] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” CoRR, vol. abs/1304.5634, 2013. [Online]. Available: http://arxiv.org/abs/1304.5634 [3] S. Zhe, Z. Xu, Y. Qi, and P. Yu, “Sparse bayesian multiview learning for simultaneous association discovery and diagnosis of alzheimer’s disease,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA., 2015, pp. 1966–1972. [4] Z. Xu, S. Zhe, Y. Qi, and P. Yu, “Association discovery and diagnosis of alzheimer’s disease with bayesian multiview learning,” J. Artif. Intell. Res. (JAIR), vol. 56, pp. 247–268, 2016. [5] Y. Li, M. Yang, Z. Xu, and Z. M. Zhang, “Multi-view learning with limited and noisy tagging,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, 2016, pp. 1718–1724. [6] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. on Knowl. and Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010. [7] J. He and R. Lawrence, “A graphbased framework for multi-task multiview learning,” in Proc. of the 28th Int. Conf. on Mach. Learn., 2011, pp. 25–32.
[8] J. Zhang and J. Huan, “Inductive multi-task learning with multiple view data,” in Proc. of the ACM SIGKDD Int. Conf. on Knowl. Disc. and Data Mining, 2012, pp. 543–551. [9] H. Yang and J. He, “Learning with dual heterogeneity: A nonparametric bayes model,” in Proc. of the 20th ACM SIGKDD Int. Conf. on Knowl. Disc. and Data Mining, 2014, pp. 582–590. [10] X. Jin, F. Zhuang, S. J. Pan, C. Du, P. Luo, and Q. He, “Heterogeneous multi-task semantic feature learning for classification,” in Proc. of the 24th ACM Int. Conf. on Information and Knowl. Management, 2015, pp. 1847–1850. [11] K. Sohn and S. Kim, “Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization,” in Proc. of the 15th Int. Conf. on Art. Intell. and Stat., 2012, pp. 1081–1089. [12] M. Wytock and Z. Kolter, “Sparse gaussian conditional random fields: Algorithms, theory, and application to energy forecasting,” in Proc. of the 30th Int. Conf. on Mach. Learn., 2013, pp. 1265–1273. [13] A. Gretton, O. Bousquet, A. J. Smola, and B. Sch¨olkopf, “Measuring statistical dependence with hilbert-schmidt norms,” in Proc. of Int. Conf. on Algorithmic Learn. Theory, 2005, pp. 63–77. [14] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral regularization framework for multi-task structure learning.” Advances in Neural Information Processing System, pp. 25–32, 2007. [15] H. Liu, M. Palatucci, and J. Zhang, “Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery,” in Proc. of Int. Conf. on Mach. Learn., 2009, pp. 649– 656. [16] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient L2,1norm minimization,” in Proc. of the 25th Conf. on Uncert. in Art. Intell., 2009, pp. 339–348. [17] A. Quattoni, X. Carreras, M. Collins, and T. Darrell, “An efficient projection for L1,∞ regularization,” in Proc. of the 26th Int. Conf. on Mach. Learn., 2009, pp. 857–864. [18] C. N. Silla and A. A. Freitas, “A survey of hierarchical classification across different application domains,” Data Mining and Knowl. Disc., vol. 22, no. 1-2, pp. 31–72, 2011. [19] R. Eisner, B. Poulin, D. Szafron, P. Lu, and R. Greiner, “Improving protein function prediction using the hierarchical structure of the gene ontology,” in IEEE Symposium on Computational Intell. in Bioinformatics and Computational Biology, 2005, pp. 1–10. [20] S. Kim, “Tree-guided group lasso for multi-task regression with structured sparsity,” in Proc. of the 27th Int. Conf. on Mach. Learn., 2010, pp. 543–550. [21] H. Wang, X. Shen, and W. Pan, “Large margin hierarchical classification with mutually exclusive class membership,” J. Mach. Learn. Res., vol. 12, no. 3, pp. 2721–2748, 2011. [22] A. Ahmed, A. Das, and A. J. Smola, “Scalable hierarchical multitask learning algorithms for conversion optimization in display advertising,” in Proc. of the 7th ACM Int. Conf. on Web Search and Data Mining, 2014, pp. 153–162. [23] L. Jacob, J.-p. Vert, and F. R. Bach, “Clustered multi-task learning: A convex formulation,” in Advances in Neural Information Processing Systems, 2008, pp. 745–752. [24] Y. Zhang and D.-Y. Yeung, “Multi-task learning in heterogeneous feature spaces,” in Proc. of the 25th AAAI Conf. on Art. Intell., 2011, pp. 63–77. [25] S. Han, X. Liao, and L. Carin, “Cross-domain multitask learning with latent probit models,” in Proc. of the 29th Int. Conf. on Mach. Learn., 2012, pp. 1463–1470. [26] G. Lee, E. Yang, and S. J. Hwang, “Asymmetric multi-task learning based on task relatedness and loss,” in Proc. of the 33rd Int. Conf. on Mach. Learn., 2016, pp. 230–238. [27] J. He, L. Yan, and Y. Qiang, “Linking heterogeneous input spaces with pivots for multi-task learning,” in Proc. of SIAM Int. Conf. on Data Mining, 2014, pp. 181–189. [28] M. G¨onen and A. A. Margolin, “Kernelized bayesian transfer learning,” in Proc. of the 28th AAAI Conf. on Art. Intell., 2014, pp. 1831–1839. [29] X. Jin, F. Zhuang, H. Xiong, C. Du, P. Luo, and Q. He, “Multi-task multi-view learning for heterogeneous tasks,” in Proc. of the 23rd ACM Int. Conf. on Information and Knowl. Management, 2014, pp. 441–450. [30] W. Dai, Y. Chen, G. R. Xue, Q. Yang, and Y. Yu, “Translated learning: Transfer learning across different feature spaces,” in Advances in Neural Information Processing Systems, 2008, pp. 353–360.
3504
[31] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learning from unlabeled data,” in Proc. of the 24th Int. Conf. on Mach. Learn., 2007, pp. 759–766. [32] K. Huang, Z. Xu, I. King, M. R. Lyu, and C. Campbell, “Supervised selftaught learning: Actively transferring knowledge from unlabeled data,” in International Joint Conference on Neural Networks, IJCNN 2009, Atlanta, Georgia, USA, 14-19 June 2009, 2009, pp. 1272–1277. [33] A. K. Gupta and D. K. Nagar, Matrix variate distributions. CRC Press, 2000. [34] Z. Xu, F. Yan, and Y. A. Qi, “Sparse matrix-variate t process blockmodels,” in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2011, San Francisco, California, USA, August 7-11, 2011, 2011. [35] ——, “Infinite tucker decomposition: Nonparametric bayesian models for multiway data analysis,” in Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012. [36] F. Yan, Z. Xu, and Y. Qi, “Sparse matrix-variate Gaussian process blockmodels for network modeling,” in Proc. of the 27th Conf. in Uncert. in Art. Intell., 2011, pp. 745–752. [37] J. H. Albert and S. Chib, “Bayesian analysis of binary and polychotomous response data,” J. Americ. Stat. Ass., vol. 88, pp. 669–679, 1993. [38] Z. Xu, F. Yan, and Y. A. Qi, “Bayesian nonparametric models for
[39] [40] [41] [42] [43] [44] [45] [46]
3505
multiway data analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 475–487, 2015. M. Girolami and S. Rogers, “Variational bayesian multinomial probit regression with gaussian process priors,” Neural Computation, vol. 18, no. 8, pp. 1790–1817, 2006. K. Ganchev, J. a. Grac¸a, J. Gillenwater, and B. Taskar, “Posterior regularization for structured latent variable models,” J. Mach. Learn. Res., vol. 11, pp. 2001–2049, Aug. 2010. G. Andrew and J. Gao, “Scalable training of L1-regularized log-linear models,” in Proc. of the 24th Int. Conf. on Mach. Learn., 2007, pp. 33– 40. [Online]. Available: http://doi.acm.org/10.1145/1273496.1273501 C. Cortes and V. Vapnik, “Support-vector networks.” Mach. Learn., vol. 20, pp. 273–297, 1995. C. M. Bishop, Pattern Recognition and Machine Learning, 2006. Springer Press, 2006. D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. of the 7th IEEE Int. Conf. on Computer Vision, 1999, pp. 1150– 1157. G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513–523, 1988. I. Kononenko, “Estimating attributes: Analysis and extensions of relief,” Mach. Learn., vol. 784, pp. 171–182, 1994.