2017 2nd IEEE International Conference on Computational Intelligence and Applications
A Hybrid Model of VSM and LDA for Text Clusteing Xiaomeng Liu School of Computer and Information Engineering Beijing Technology and Business University Beijing, China e-mail:
[email protected] Haitao Xiong , Nan Shen School of Computer and Information Engineering Beijing Technology and Business University Beijing, China e-mail:
[email protected],
[email protected] Abstract—In today's era, the number of today's web text is exploding. The analysis of the text is still a hot topic. The traditional VSM model in the weight statistics and similarity calculation, due to the data latitude is too high, lack of understanding and other issues, will lead to the final clustering inaccurate. In view of this, this paper presents a hybrid model of VSM and LDA for text clustering. Through the collection of text, filtering, application of statistical methods we calculated VSM model and LDA model similarity respectively. The two similarity models are combined by linear addition method, and the mixed similarity is obtained. Then through the K-means algorithm for text clustering and the three models of clustering results we can get the visual effect of clustering. Finally we can judge the merits of the model. The experimental results show that this hybrid model is effective. Keywords-clustering; similarity; LDA; VSM
I.
INTRODUCTION
With the rapid development of science and technology in twenty-first Century, information have a large number of accumulation and increase rapid in people's daily life. For these massive and chaotic text information, how to dig valuable information has been a hot topic of conversation in the field of Natural Language Processing. As an unsupervised learning methods, clustering[1] can divide massive unknown text information into the most appropriate cluster. It makes the objects in the same cluster as far as possible similar and objects in the different clusters of the difference increases as much as possible. Then from the text collection we can find the distribution of these information and narrow the search range. Then we can locate directly of the target information. Because of unsupervised learning, clustering technology is more and more difficult to meet the actual requirements. More and more researchers are aware of these problems, so they add the known knowledge to the unsupervised learning process. Some researchers proposed their thinking of introducing the semantic knowledge base of existing. It can combine word similarity, concept similarity and category similarity to calculate the text similarity. The categories of information in Wikipedia is used to collect the strongly associated items. And this kind of information can also be used as a guide to the text structure.
978-1-5386-2030-4/17/$31.00 ©2017 IEEE
230
Except the external semantics, the semantics of the text itself is also very important. Vector space model (VSM)[2] is used to represent text. On the basis of vector space model, TF-IDF value is used to weight the feature words and it uses the weight of texts to represent the texts. This method can be very convenient to extract the characteristic values of the text. However, it is inevitable that the vector dimension is too high and the data is sparse. At the same time, we think that there may be latent semantic knowledge in the text. If we can find the knowledge which comes from the text itself. Then it can better describe the content of the text. But the VSM model can not solve this problem so we introduce the concept of topic model to better describe the content of the text. There is PLSA[3] model before the LDA model[2], but there are some problems of PLSA. So there is a LDA model. The advantage of LDA is that it has a rich internal structure, and can use the probabilistic algorithm to train the model. Second LDA can play the effect of dimensionality reduction, suitable for large-scale corpus. So the LDA model is used in many areas[4-7]. In this paper, LDA model is used to model the topic, and the corpus is mapped to each topic space so we can find the relationship between the topic and the word in the text. Then the topic distribution of the text is obtained, and the distribution is used as a feature value into the traditional vector space model to calculate the similarity. Finally, the similarity matrix of the document set is obtained and we can make it clustered. II.
MODEL
A. Vector Space Model(VSM) At the end of the 1960s, Salton et al proposed vector space model (VSM)[2] for the first time. Because of the integrity and easy implementation, it is widely used in related fields. It represents the text in a vector way. A document is described as a vector of a series of keyword. The text is abstracted into a vector and be use to judge whether a text is your favorite text in a word. The vector is made up of many keywords and each word has a weight. Different words affect documents according to their own weights in the document.
the probability distribution of a topic and w represents the distribution of words. N represents the number of words and M represents the number of documents.
Document = {keyword1,keyword2, … , keywordN} Document Vector = {weight1, weight2,… ,weightN} V(d)={ t1w1(d);…tnwn(d)}
The formula is as follow:
ti(i=1…n) is a series of different words and wi (d) is the corresponding weight of ti in D. When selecting the feature words, we need to reduce the dimension to select the representative feature words. B. TF-IDF Term Frequency-Inverse Document Frequency(TFIDF)[8] is a weighted technique commonly used in information processing and data mining. The method is based on statistics, which is used to calculate the importance of a word in the corpus. Its advantage is that it can filter out some of the common but insignificant words and retain the most important words that affect the whole text.
p(T , z, w | DˈE )= p( zn | T ) p( wn | zn , E )* p(T | D ) n 1
Fig.1 show LDA model. D. Gibbs Sampling The parameter inference method based on Gibbs sampling[4] is easy to understand and easy to implement. It is very effective to extract topics from large scale text collections. Therefore, Gibbs sampling becomes the most popular. In this paper, parameter estimation uses the Gibbs sampling in Markov-chain Monte Carlo. In the case of known text sets, the parameter values are obtained by parameter estimation. The probability value of a text is: p( w | DˈE )= ³ p(T | D )( ¦ p( zn | T ) p( wn | zn , E ))dT n 1 z
Figure 1. Lda model.
Usually the number of words in the document appears as a word frequency, but this statistical method for the different length of the document collection will cause some errors. In this paper, the method of standardization is selected. The advantage of this is that it can reduce the error brought by the different lengths of the document. Here is the formula:
TF
tfi , j tfi
max
C. Latent Dirichlet Allocation(LDA) LDA was proposed by Blei in 2002 and it was a probabilistic generative model which was used to solve the latent semantic analysis[9]. Its basic assumption is that a text usually discusses a number of topics, and the specific terms in the text can reflect specific topics discussed. Thus, LDA treats each text as the probability distribution of several topics in the text set and each topic is considered as the probability distribution of all keywords. The process is as follows: Choose parameter θ ̚p(θ); For each of the N words wn Choose a topic zn̚p(z|θ); Choose a word wn̚p(w|z); The α and β are parameters of corpus level. The vector α reflects the corpus of the relative strength between implicit topics. Matrix β is used to describe the probability distribution of all the implicit topics in words. θ is a text level parameter that represents the distribution of each text on the topic. w and z are word level parameters. z represents
231
(3)
Once the topic of each word is determined, the parameters can be calculated after the statistics. Therefore, the parameter estimation problem becomes the conditional probability of the topic. Once the topic label of each word is obtained, the parameter calculation formula is obtained by the following:
Ik ,t
nkt Et V
¦ nkt Et
(4)
t 1
(1)
tfi,j represents the first j words in text i. tfimax represents the largest number of words in text i.
(2)
T m, k
k Dk nm K
¦ nmk D k
(5)
k 1
Φk,t represents the probability of word t in the topic k. θm,k represents the probability of topic k in the text m. E. Text Similarity Calculation Based on LDA and VSM According to the method of combining VSM model with LDA model by WANG et al.[10], And then based on the idea of similarity by another WANG et al.[11].And HUN's[12] case of the VSM model combined with LDA model. We extract the text vector of the hidden topic and combine the word vector with TF-IDF weight. Using the weighted sum method makes the two text vector fusion. Thus, the similarity between the texts is more efficient. For each document di. Combined with the word vector representation of TF-IDF weights are di_v=(w1,w2,…,wn). N is dimension for VSM. Define the similarity value of Simv as VSM. Cosine similarity:
SimV di , d j
III.
di _ V * d j _ V | di _ V | *| d j _ V |
(6)
The vector representation of the LDA model is di_l = (t1,t2,…,tk). K is the dimension of topic space. Define the similarity value of Siml(di,dj)as LDA. Cosine similarity is Siml(di,dj). Define Sim(di,dj) as a hybrid similarity value.
In this part we will detail the implementation of clustering process, through this part of the introduction can be a basic understanding of the whole process of the experiment. z
Step0, prepare the relevant materials required for the experiment, and install the software that is required for the configuration.
z
Step1data crawling and data processing: We crawl from the Wikipedia of the corresponding text data and use the PYTHON software for regular processing of the text. After processing, we can get a collection of documents.
The calculation formula is as follows:
di , d j k * SimV di , d j 1 k * SimL di , d j k 0,1 (7) k is a parameter. It expresses the weighted sum of the VSM and LDA. Sim
PROCESS
Figure 2. The whole process of the experiment
z
z
Step2 calculate the required data: The main task of this part is to model the data needed to compute the clusters. Step2.1 calculating TF-IDF. Step2.1.1 vectorizing the text, extract keywordsˈ get a matrix, statistics TF value. Step2.1.2 according to the TF value and TFIDF formula, the final TF-IDF value is obtained. Step2.2calculating the value of Simv and Siml Step2.2.1perform modeling of VSM and LDA. Step2.2.2the value of Simv is calculated from the data obtained in step 2.1 and the formula mentioned above. Siml's calculations are similar to Simv's calculations. Both calculations are independent of each other. Step2.3 calculating mixed similarity of Sim Step2.3.1 according to the value obtained in Step 2.2, the value of Sim is calculated according to the formula listed above. Step3 this part is clustered by the data calculated in step 2, and the results of the clustering are visualized
232
to obtain the visual map. Finally, the results are compared and the conclusion is obtained. Step3.1 clustering calculation using K-means[13] algorithm. Step3.2.1 assigning the value to the K value according to the number of clusters, and then run the algorithm. Step3.2.2 the number of texts per cluster is counted according to the label. Get clustered results Step3.2 according to the results of clustering, get a visual map. Step3.3 performing a number of experiments to choose the best results. Step3.4 the researchers describe the results and draw conclusions. Figure 2 is an intuitive display of the process. IV.
EXPERIMENTAL RESULTS
The experimental data were from Wikipedia, which are characters, attractions, animal and countries. Each type has 200 texts. In this paper, the K-means is used to cluster, and the evaluation criterion is F value, which is used to measure the similarity of text. F value is a standard which is a
combination of accuracy and recall index in information retrieval. as:
The following figure shows the F for each category and the model accuracy:
The accuracy of P (i, j) and recall R (i, j) can be defined
V.
CONCLUSION
As can be seen from the above results, the VSM model is better than the LDA model when using the VSM model and P i, j (8) the LDA model alone. After the linear weighting of the two, nj the accuracy of the hybrid model reaches a higher value. n represents the number of text. nij represents the number First, the method of using text is valid, and the accuracy of of texts belonging to i in cluster j. text clustering can be improved obviously after considering the potential semantics of the text. At the same time, it can F value is defined as: be seen that the accuracy of LDA model is unsatisfactory when LDA model is used alone. Due to the existence of the 2 * P i, j * R i, j F i, j (9) LDA model, after studying the mixed model, the future P i, j R i, j research direction is in the LDA model. The F value of global clustering is defined as: ACKNOWLEDGMENT n This research was supported by the Beijing Natural Scien F ¦ i max F i, j (10) ce Foundation under Grant No. 4172014. n i We handle texts so we can get space model and use the REFERENCES VSM model to calculate similarity of Simv(di,dj). Then we [1] Salton G.Automatic., Text Processing.Boston:Addison Wesley use the LDA model and calculate similarity of Siml(di,dj). Longman Publishing Company,1998. Finally we can get the similarity matrix by calculated. [2] Salton G,Wong A,Yang C S., A vector space model for automatic In LDA modeling process, we use Gibbs sampling to get indexing.Communications of the ACM,1975,18(11):613-620. parameter estimation. In this paper, we give the K = 50. [3] Thomas Hofmann., Unsupervised Learning by Probabilistic Latenr Super parameter α = 50/K and β = 0.01. Select k value = Semantic Analysis.JASIS,1990,41(6):391-407. 0.85. [4] Blei, D. M., Ng, A. Y., and Jordan , M. I. (2003). Latent Dirichlet In this paper, we have done two experiments. Allocation. Journal of Machine Learning Research3 : 993-1022. Respectively, compared with the LDA model and the VSM [5] Bhattacharya,Indrani,Sil,Jaya,Sparse representation based query model, and then the results are described. classification using LDA topic modeling,Advances in Intelligent Systems and Computing,v 469,p 621-629,2016. VSM is calculated directly using Simv (di,dj). LDA is the [6] Liu,Q.,Chen,E.,Xiong,H.,Ge,Y.,Li,Z.,and Wu,X., "A Cocktail same like VSM. VSM+LDA uses the method mentioned in Approach for Travel Package Recommendation," IEEE this paper. From the results, the effect of using the LDA TRANSACTIONS ON KNOWLEDGE AND DATA model is relatively poor, the VSM model can achieve a good ENGINEERING,VOL.26,NO.2,FEBRUARY 278-293,2014. accuracy, the efficiency of the mixed model is the highest. [7] Yue Liu.,Shimin Wang., and Qian Cao., Research on Commodities You can see the results from the following figure. Figure 3 Classification Based on LDA IMM 2015,Lancaster:DEStech shows values for each category. Figure 4 shows Model Publivations 2015:189-191. accuracy. [8] Wang,C.,and Blei, D., "Collaborative Topic Modeling for
ni j
[9]
[10]
[11] Figure 3. values for each category
[12]
[13]
Figure 4. Model accuracy
233
Recommending Scientific Articles,"Proc.ACM 17th ACM SIGKDD Int'l Conf.Knowledge Discovery and Data Mining,pp.488-456,2011. Zheng Lin.,XU De-hua., Research on Text Categorization Based on Improved TFIDF Algorithm.Compuer and Modernization.2014,229(9):6-15. Wang Peng.,GAO Cheng.,CHEN Xiao-mei., Research on LDA Model Based on Text Clustering. Information Science.2015,33(1):6368. Wang Zhen-zhen.,He Ming.,DU Yong-ping., Text Similarity Computing Based on Topic Model LDA.Computer Science.2013,40(12):229-232. HU Xiu-li., Micro-blog topic drift detection based on VSM and LDA models.Journal of Lanzhou University of Technology.2015,41(5):104-109. Chen Lei-lei., Text Clustering Study with K-Means Algorithm of Different Distance Measures.SOFTWORE.2015,36(1):56-61.