Molecular Pharmaceutics, a Vision for the Future - ACS Publications

cell research, cloning, chemical biology, molecular and cell biology, and the basic and clinical implications of pharmacogenomics and pharmacogene...
0 downloads 0 Views 198KB Size
Computing Similarity Between XML Documents for XML Mining Jung-Won Lee and Seung-Soo Park Dept. of Computer Science and Engineering, Ewha Womans University, 11-1 Daehyun-dong, Sudaemun-ku, Seoul, Korea {jungwony,sspark}@ewha.ac.kr

Abstract. The self-describing feature of XML offers both challenges and opportunities in document management and data mining. We propose new metric for computing similarity between XML documents for XML mining.

1 Introduction We expect that many Web applications that process XML documents, such as grouping similar XML documents and searching for XML documents that match a sample XML document, will require techniques for clustering and classifying XML documents. It is intuitively obvious that if some of the rich semantics of XML can be taken into account, we should have a more powerful basis for XML mining. In this paper, we propose new metric for computing similarity between XML documents for XML mining and presents preliminary experimental results.

2 Pre-processing It is essential to preprocess XML documents for quantitative determination of similarity between XML documents. The following is for preprocessing XML documents1. • Structure Discovery: The goal for discovering XML structures is to extract unique and minimized structures of XML documents. We formalize XML structures using finite automata and then apply a state-minimization algorithm to minimize them. • Identification of Similar Elements: A lot of synonyms, compound words, or abbreviations may be used for defining XML elements in multiple documents. We generate extended-element vectors with synonym information for the elements in an XML document using WordNet. • Common Feature Extraction: Among paths from minimized XML structures and elements with synonyms, we extract common paths between XML documents using sequential pattern mining algorithms. 1

For further details of preprocessing for XML, see [1].

E. Motta et al. (Eds.): EKAW 2004, LNAI 3257, pp. 492–493, 2004. © Springer-Verlag Berlin Heidelberg 2004

Computing Similarity Between XML Documents for XML Mining

493

3 Similarity Metric To quantify similarity between XML structures, we define new similarity metric based on common paths. The key concept of the metric is to assign different weights to each element on the path. The more similar paths they share, the more weights it may be assigned. 1 Similarity = T

T

6

i=1

1 L(PEk) 2 × L(PEi)  1 6 V(Ek) k=1

Here, T is a number of total paths of a base document, PE is a path expression, L(PE) is the total number of elements on PE, and Ek is kth element of PE. V(Ek) may have one value among 0, 1, or 2 according to the degree of match between elements of two documents.

4 Experimental Results We collected 763 HTML pages from yahoo! site. There are 2 categories: charts, and messages. We randomly selected 100 HTML pages from the collection and translated them to XML documents with meaningful elements. We got preliminary results of similarity computation among all documents as the following. 1 0 0

1

1 0 0

0

8 0

6 0

6 0

Similarity

8 0

4 0

4 0

C h a r ts

2 0

0 1

1 0

2 0

3 0

4 0

5 0

6 0

M e s s a g e s

2 0

7 0

8 0

9 0

0 1 0 0 1

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

1 0 0

D o c . N o .

Fig. 1. Only documents that have similarity over threshold 70% are over the line. We confirmed that these documents were grouped into their categories correctly

5 Discussion Although the dataset for experiment is small, our similarity metric provides high accuracy for XML document mining. We’ll do more experiments with various and large datasets and then revise our metric for computing similarity.

References 1. J. W. Lee, K. Lee, and W. Kim.: Preparations for Semantics-based XML Mining. In Proc. of IEEE International Conference on Data Mining (ICDM). pages 345~352. Nov./Dec. 2001.