Estimating Ground-Level PM2.5 in the Eastern United States Using

Mar 10, 2005 - findings of this study illustrate the strong potential of satellite ... Observing System (GEOS-3) have been included in the model. (15,...
1 downloads 0 Views 100KB Size
Clustering with Noising Method Yongguo Liu1,2, Yan Liu3, and Kefei Chen1 1

Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai 200030, P.R. China {liu-yg, chen-kf}@cs.sjtu.edu.cn 2 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, P.R. China 3 School of Applied Mathematics, University of Electronic Science and Technology of China, Chengdu 610054, P.R. China [email protected]

Abstract. The minimum sum of squares clustering problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. In this article, a recent metaheuristic technique, the noising method, is introduced to explore the proper clustering of data sets under the criterion of minimum sum of squares clustering. Meanwhile, K-means algorithm as a local improvement operation is integrated into the noising method to improve the performance of the clustering algorithm. Extensive computer simulations show that the proposed approach is feasible and effective.

1 Introduction The clustering problem is a fundamental problem that frequently arises in a great variety of fields such as pattern recognition, machine learning, and data mining. In m

this article, we consider this problem stated as follows: Given N objects in R , allocate each object to one of K clusters such that the sum of squared Euclidean distances between each object and the center of its belonging cluster for every such allocated object is minimized. This clustering problem can be mathematically described as follows: N

K

min F (W , C ) = ∑∑ wij || xi − c j || 2 W ,C

K

where

∑w

ij

(1)

i =1 j =1

= 1 , i = 1, … , N . If object xi is allocated to cluster C j , then wij is

j =1

equal to 1; otherwise objects,

wij is equal to 0. In Equation 1, N denotes the number of

K denotes the number of clusters, X = {x1 , … , x N } denotes the set of N

C = {C1 ,…, C K } denotes the set of K clusters, and W denotes the N × K 0 − 1 matrix. Cluster center c j is calculated as follows:

objects,

X. Li, S. Wang, and Z.Y. Dong (Eds.): ADMA 2005, LNAI 3584, pp. 209–216, 2005. © Springer-Verlag Berlin Heidelberg 2005

210

Y. Liu, Y. Liu, and K. Chen

cj = where

1 nj

∑x

i

(2)

xi ∈C j

n j denotes the number of objects belonging to cluster C j . It is known that

this problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. Many clustering approaches have been developed [1]. Among them, K-means algorithm is a very important one, but it is proved to fail to converge to a local minimum under certain conditions [2]. In [3], genetic algorithms are applied to the clustering problem, called GAC in this paper. GAC encodes the clustering partition as a chromosome. After a specified number of generations, the best individual obtained is viewed as the final solution. In [4], tabu search is reported to deal with this problem, called TSC in this paper. The clustering solution is encoded as a string similar to that in GAC. The best solution obtained after a specified number of iterations is viewed as the clustering result. In [5], a simulated annealing based clustering method is proposed, called SAC in this paper. By redistributing objects among clusters probabilistically, this approach can obtain the globally optimal solution under certain conditions. The noising method guiding the local heuristic search procedures to explore the solution space beyond the local optimality is a recent metaheuristic technique firstly reported in [6]. The noising method [7] has been successfully applied to traveling salesman problem, scheduling problem, and multicriteria decision, etc. In this article, our aim is to introduce the noising method to explore the proper clustering under the criterion of minimum sum of squares clustering. In the field of other metaheuristics such as tabu search, to efficiently use tabu search in various kinds of applications, researchers often combine it with the local descent approaches. In [8], Nelder–Mead simplex algorithm, a classical local descent algorithm, and tabu search are hybridized to solve the global optimization problem of multiminima functions. This idea is introduced in this article. Since K-means algorithm is simple and computationally attractive, we view it as a local improvement operation and combine it with the noising method. So, we give two ways to deal with the clustering problem in this article, one does not have K-means operation and another has. These two methods are called NMC and KNMC, respectively. The choice of the algorithm parameters is extensively discussed, and performance comparisons among six techniques are conducted on experimental data sets. As a result, spending much less computational resources than GAC, TSC, and SAC, the noising method based clustering algorithm can get feasible and effective clustering results.

2 The Proposed Method Instead of taking the genuine data into account directly, the noising method considers the optimal result as the outcome of a series of fluctuating data converging towards the genuine ones. Like some other metaheuristics, the noising method is based on a descent. The main difference with a descent is that, when the objective function value for a given solution is considered, a perturbation called a noise is added to this value. This noise is randomly chosen in an interval of which the range decreases during the

Clustering with Noising Method

211

iteration process. The final solution is the best solution computed during the iteration process. In this article, noises are added to the variation of the objective function value so as to avoid the clustering problem being trapped by the locally optimal values. It means that the original value of the noise rate rn should be chosen in such a way that, at the beginning of the iteration process, a bad neighboring solution may be accepted, as it is also the case in simulated annealing for instance. As added noises are chosen in an interval centered on 0, a good neighboring solution may be also rejected, which is different from simulated annealing. The detail discussion about the noising method can be found in [7]. Figure 1 gives the general description of NMC and KNMC. It is seen that they are both observe the architecture of the noising method.

Fig. 1. General description of NMC (L) and KNMC (R)

In this article, we define the solution the same as that in GAC, TSC, and SAC, which is suitable for computing the objective function value and comparing our methods with these methods. Here, two important concepts are given: probability threshold and K-means operation. In NMC and KNMC, the probability threshold is used to provide a proper neighboring solution for the noising method so as to avoid getting stuck in local optima and find the optimal result. In [4], it is used to create the neighborhood of tabu search. It is described as follows: Given the current solution X c = x1 ,…, xi ,…, x N , xi = j , j = 1,…, K and the probability threshold P , for

i = 1,…, N , draw a random number pi ~ u (0,1) . If pi < P , then xi' = xi ; othxi' = k , k = 1, … , K , k ≠ j . Here, X ' ≠ X c , where X ' denotes the neighboring solution. In this paper, the probability threshold P is chosen to be 0.95,

erwise

which is recommended by computer simulations in [4]. Based on the structure of the noising method, KNMC gathers the global optimization property of the noising method and the local search capability of K-means operation together. Here, K-means operation is used to fine-tune the distribution of objects belonging to different clusters and to improve the similarity between objects and their centroids. It is described as follows: Given a solution X = x1 ,…, xi ,…, x N , reassign object

xi to cluster C k , k = 1, … , K , iff

212

Y. Liu, Y. Liu, and K. Chen

|| xi − ck || ≤ || xi − cl || , l = 1, … , K , and k ≠ l Then new cluster centers,

(3)

c1' ,…, c K' , is calculated as follows: ck' =

1 nk

∑x

(4)

i

xi ∈Ck

nk denotes the number of objects belonging to cluster Ck . After this operation, the modified solution is viewed as the current solution X c .

where

In order to explore the good performance of NMC and KNMC, we here discuss the choice of different parameters as shown in Figure 2. Each experiment includes 20 independent trials.

(a)

(b)

(c)

(d) Fig. 2. Comparison of different parameters

The noise range, the first parameter we consider, is used to determine the range in which the noise rate rn varies. Based on the noise range and the terminal noise rate

rmin , the original noise rate rmax can be calculated. In Figure 2(a), the average objective function values for different noise ranges are compared. It is found that this parameter is overlarge or over small will reduce the performance of the proposed ap-

Clustering with Noising Method

213

proach. When the size of the noise range is equal to 10, the best performance is attained. So, we choose this parameter to be 10. It is found that, in Figure 2(b), the larger the value of rmin the worse. The reason is that an added noise is a random real

[−rn ,+ rn ] , and the noise rate rn is and rmin , then we may make rn decrease down

drawn with a uniform distribution in the interval

bounded by two extreme values rmax to 0 so as to get back the genuine function at the end of the noising method. So, in this paper, we choose rmin to be 0, and then rmax is equal to be 10. To control the decrease speed of added noises, we discuss the number of iterations at the fixed noise rate denoted by N f as shown in Figure 2(c). We find a slow decrease speed can obtain slightly better results than a quick one. In this paper, we take

N f to be 20. In

Figure 2(d), NMC and KNMC are compared. It is seen that KNMC equipped with Kmeans operation is obviously superior to NMC.

3 Experiment Analysis Before conducting simulation experiments, we analyze the time complexities of algorithms employed in this paper. For TSC, the time complexity is O (GN n mN ) , where

N n is the size of the neighborhood and G is the number of iterations. For GAC, the time complexity is O (GPmN ) , where P denotes the population size and G denotes the number of generations. The time complexity of SAC is O (GN s KmN ) , where N s denotes the number of iterations at the fixed temperature and G denotes the number of iterations during the process that the annealing temperature drops. In our methods, creating the neighboring solution takes O ( N ) time and K-means op-

O( KmN ) time. Hence, the time complexity of NMC is O( N t mN ) and the time complexity of KNMC is O ( N t KmN ) , where N t is the total number of

eration takes

iterations. It is seen that the computational cost of KNMC is higher than that of NMC. However, equipped with K-means operation, the performance of the noising method based clustering algorithm is greatly improved. Furthermore, compared with TSC, GAC, and SAC, KNMC spends the least computational resources. That is, its computational cost of is K 60 of GAC, K 20 of TSC, and about 1/28 of SAC, respectively. In most cases, K is a small constant. Therefore, the computational cost of KNMC is still very low. Performance comparisons between our methods and other techniques are conducted in Matlab on an Intel Pentium III processor running at 800MHz with 128MB real memory. Five data sets representing different distribution of objects are chosen to test the adaptability of the proposed method: two artificial data sets (Data-52, Data62) and three real life data sets (Iris, Crude Oil, and Vowel). Data-52 is a twodimensional data set having 250 overlapping objects where the number of clusters is five. Data-62 is a two-dimensional data set having 300 nonoverlapping objects where

214

Y. Liu, Y. Liu, and K. Chen

the number of clusters is six. Iris represents different categories of irises having four feature values. The four feature values represent sepal length, sepal width, petal length, and petal width in centimeters. It has three classes with 50 samples per class [9]. Crude Oil has 56 objects, five features and three classes [10]. Vowel consists of 871 Indian Telugu vowel sounds having three features and six classes [11]. In computer simulations, experimental results of NMC and KNMC are obtained after 1000 iterations. Each experiment for all algorithms in this paper includes 20 independent trials. The detail settings of parameters in GAC, TSC, and SAC can be found in their corresponding references. The average and minimum values of the clustering results obtained by six methods for five data sets are shown as Table 1. Table 1. Results of six clustering algorithms for five data sets

K-means

Data-52

Data-62

Iris

Crude Oil

Vowel

Avg(min)

Avg(min)

Avg(min)

Avg(min)

Avg(min)

488.95(488.09)

1469.85(543.17)

91.76(78.94)

1656.98(1647.19)

GAC

1464.19(1269.87)

9959.12(8714.13)

96.44(83.40)

1649.99(1647.19) 158113587.46(143218656.39)

TSC

2590.21(2517.35) 19155.37(18406.54) 282.64(256.70)

SAC NMC KNMC

488.02(488.02)

821.56(543.17)

78.94(78.94)

2654.52(2557.31) 19303.58(18005.98) 302.99(242.15) 488.69(488.02)

1230.02(543.17)

85.37(78.94)

32782041.43(30724312.47)

2122.05(1952.38) 248267782.61(245672828.41) 1647.24(1647.19)

32243759.40(30724196.02)

1995.44(1787.43) 250796549.46(245737316.31) 1647.27(1647.19)

31554139.24(30718120.60)

For Data-52, the optimal value is 488.02, which is only found by SAC and KNMC. Since objects of this data set are overlapping, other four approaches cannot reach the best result in all runs. For Data-62, the optimal result is 543.17. K-means, SAC, and KNMC can attain this value. In most cases, K-means gets stuck at suboptimal values. For Iris, the best value is 78.94, which is attained by K-means, SAC, and KNMC. But K-means is found to achieve this value in 4 of all trials. SAC and KNMC can attain the best value more stably. For Crude Oil, the best value is 1647.19. In this experiment, the performance of KNMC is close to that of SAC. But for Vowel, KNMC is the best one among all methods. It is seen that, for most data sets, KNMC can obtain best values and is superior to others except SAC. Noticeably, GAC, TSC, and NMC fail to attain the best values for most data sets even once and their best values obtained are far worse than the optimal ones. However, we find that these three algorithms can still obtain improved results if more iterations are executed. Meanwhile, the performance of NMC is close to that of TSC but its computational cost is only 1/20 of TSC, which shows NMC is promising to a certain extent. According to Table 1, we find combining K-means algorithm with the noising method to deal with the clustering problem can take their respective advantages: the global optimization ability of the noising method and the local search capability of Kmeans algorithm. By combing these two methods, we are able to obtain better results than those obtained by K-means algorithm and NMC. Meanwhile, in most cases, SAC is better than KNMC. But we should remember that the cost of KNMC is only about

Clustering with Noising Method

215

1/28 of SAC and the performance of KNMC is very close to even superior to that of SAC. Here, we may greatly increase the computational resource such as the number of iterations in order to attain the results similar to or superior to those of SAC, but we do not think that it is a good way to attain the optimal result by only adding this parameter. For example, in [3], the specified number of iterations where GAC obtains the best result for Crude Oil is up to 10000, which is 20 times computational resources than that of KNMC. Even so, there are still 22% of trials where it cannot obtain the best result and the performance of GAC is still inferior to that of KNMC.

(a)

(b)

Fig. 3. Comparison of NMC and KNMC for Crude Oil and Vowel

Since Iris has been used to choose the proper parameter settings for the clustering algorithm based on the noising method, in order to understand the performance of KNMC and NMC better, we here use Crude Oil and Vowel to show the iteration process. One should remember that the definition of an iteration is different from one algorithm to another. In NMC and KNMC, one iteration corresponds actually to one neighbor, while it corresponds to 20 neighbors for TSC, and to 6 neighbors for GAC. For SAC, one iteration also corresponds to one neighbor but this iteration is over after 100 iterations are performed at a specified annealing temperature. The clustering results of KNMC and NMC for Crude Oil and Vowel are shown as Figures 3(a) and (b), respectively. It is seen that applying the noising method to solve the clustering problem under consideration is feasible and effective. Moreover, in order to improve the performance of the clustering algorithm based on the noising method and accelerate the convergence speed, we introduce K-means operation to modulate the distribution of objects among clusters. To avoid getting stuck in local optima, we adopt the probability threshold to provide diverse neighboring solutions and explore the global optimal result. It is seen that KNMC equipped with K-means operation can attain better results for Crude Oil and Vowel much sooner than NMC. Compared with GAC and TSC, KNMC spends much less computational cost and achieve much better clustering results. Compared with that of SAC, the performance of KNMC is promising. The more important fact is that the computational cost of KNMC is much less than that of SAC. In this article, how to attain the optimal result as much as possible in the finite number of iterations by properly establishing the clustering algorithm is the main goal. This is our aim in future research work.

216

Y. Liu, Y. Liu, and K. Chen

4 Conclusions In this paper, we introduce the noising method to solve the clustering problem under the criterion of minimum sum of squares clustering, and develop two clustering approaches, NMC and KNMC. The choice of the algorithm parameters is extensively discussed, and performance comparisons between our methods and other techniques are conducted on experimental data sets. As a result, with the much less computational cost than GAC, TSC, and SAC, KNMC can get much better clustering results sooner than GAC and TSC, and obtain results close to those of SAC. In future, the estimation of the number of clusters has to be incorporated in NMC and KNMC, and different local search procedures should be tested within the noising method framework.

Acknowledgements This research was partially supported by National Natural Science Foundation of China (#90104005, #60273049) and State Key Laboratory for Novel Software Technology at Nanjing University.

References 1. Jain, A.K., Dubes, R.: Algorithms for clustering data. Prentice-Hall, New Jersey (1988) 2. Selim, S.Z., Ismail, M.A.: K-means-type algorithm: generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence. 6 (1984) 81-87 3. Murthy, C.A., Chowdhury, N.: In search of optimal clusters using genetic algorithms. Pattern Recognition Letters. 17 (1996) 825-832 4. Al-sultan, K.S.: A tabu search approach to the clustering problem. Pattern Recognition. 28 (1995) 1443-1451 5. Bandyopadhyay, S., Maulik, U., Pakhira, M.K.: Clustering using simulated annealing with probabilisitc redistribution. International Journal of Pattern Recognition and Artificial Intelligence. 15(2001) 269-285 6. Charon, I., Hudry, O.: The noising method: a new method for combinatorial optimization. Operations Research Letters. 14(1993) 133-137 7. Charon, I., Hudry, O.: The noising method: a generalization of some metaheuristics. European Journal of Operational Research. 135(2001) 86-101 8. Chelouah, R., Siarry, P.: A hybrid method combining continuous tabu search and Nelder– Mead simplex algorithms for the global optimization of multiminima functions. European Journal of Operational Research. 161 (2005) 636-654 9. Fisher, R.A.: The use of multiple measurements in taxonomic problem. Annals of Eugenics. 7 (1936) 179-188 10. Johnson, R.A., Wichern, D.W.: Applied multivariate statistical analysis. Prentice-Hall, New Jersey (1982) 11. Pal, S.K., Majumder, D.D.: Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Transactions on System, Man and Cybernetics. SMC-7 (1977) 625-629