2019 IEEE 4th International Conference on Cloud Computing and Big Data Analytics
Research on Label Propagation Algorithms Based on Clustering Coefficient Mengjie Wang, Yusheng Xu School of Information Science and Engineering Lanzhou University Lanzhou, China e-mail: {wangmj2017, xuyusheng}@lzu.edu.cn put forward various optimization algorithms on the basis of it, so that it can be better applied to community detection. Liu Shichao et al. [3] proposed an algorithm LPPB for overlapping community discovery based on label propagation probability, applying the improved label propagation algorithm to overlapping community detection. Li Lei and Ni Lin [4] proposed a label propagation community discovery algorithm based on modularity optimization, which improved the stability of the label propagation algorithm. Zhang Meiqin and others [5] proposed a label propagation algorithm based on weighted clustering ensemble, which significantly improved the accuracy of community detection of label propagation algorithm. In addition, Zheng Shaoqiang et al. [6] proposed an optimization algorithm LPA_D_CC based on clustering coefficients. By sorting the nodes according to the degree and the size of clustering coefficients, the initial division is made according to the influence, and then tag propagation is carried out. LPA_D_CC algorithm is more stable and correct than LPA algorithm. In fact, it is not difficult to understand that the clustering coefficient of nodes reflects the tightness of their neighbor nodes, which coincides with the characteristics of community structure of complex networks. The closer the neighbor nodes of a node connect, the greater the possibility that these nodes belong to a community. In fact, as early as 2014, MariC.V. Nascimento [7] proposed a spectral heuristic algorithm SPEC based on clustering coefficients, which overcomes the scale defects of heuristic algorithm based on modular maximization and has good performance. Deng Xiaolong et al. [8] proposed an efficient directed network community partitioning algorithm based on the clustering coefficient of vector influence. The algorithm is relatively novel in mathematical derivation of triangular maximal community, which is the most basic component of community structure, and proposed a directed community structure partitioning objective function with good accuracy. This paper proposes LPAc algorithm based on clustering coefficient and LPA algorithm, and optimizes and improves it on the basis of LPA_D_CC algorithm. It changes the ranking rules of nodes’ influence, and makes the nodes with small clustering coefficient and degree have higher priority, thus avoiding the formation of giant communities. The experimental data show that the algorithm has better partition results for networks with more overlapping nodes.
Abstract—Label propagation algorithm (LPA) is one of the classical community detection algorithms, with high efficiency, quick speed and no need for any prior information. However, it has the disadvantage of poor stability, which causes the detection results to be random. In order to improve the stability of label propagation algorithm, an algorithm with an adjustable parameter based on clustering coefficient and label propagation is proposed in this article. The algorithm is divided into two steps. The first step is to prioritize the nodes according to their degree and clustering coefficient, and initialize the label according to the ranking result. In the process of initializing the label, only the nodes with clustering coefficient in a certain range are selected to filter out the noisy nodes. The second step is based on the first step. In order to avoid randomness, the neighbor nodes are sorted according to their clustering coefficient and degree, the optimal neighbor node is selected to update the label. By applying the algorithm to LFR artificial network data sets and real networks data sets, the results show that the algorithm reduces the randomness of the label propagation algorithm, enhances the stability and accuracy of detection result, and its adjustable parameter make it possible to have a good quality of community division for various types of networks. Keywords-clustering community detection
I.
coefficient;
label
propagation;
INTRODUCTION
Label Propagation Algorithms (LPA) is a graph-based machine learning algorithm, which is widely used in textual retrieval and classification, multimedia information retrieval, community detection and other fields. [1] In 2007, Raghavan et al. [2] firstly applied label propagation algorithm to community detection in complex networks. The main idea of this algorithm is to explore community structure by using the structure of complex network itself. At first, each node is given a unique label. In the next iteration process, the label of the node is updated to the largest number of labels in its neighbor nodes, which are executed repeatedly. The densely connected group of nodes changes from independent labels to a consensus group node. Nodes with the same label are considered a community. The steps of the algorithm include tag initialization and tag update. LPA algorithm is concise, has near linear time complexity, and has a large space for improvement. Many researchers have carried out relevant research, and constantly
978-1-7281-1410-1/19/$31.00 ©2019 IEEE
348
II.
coefficient and degree the node has, the more nodes it affected. If the node with higher clustering coefficient and degree has higher priority, the giant community will easily be formed in the process of tag propagation, which makes the community detection having a bad result.
BASIC CONCEPTS
A. Clustering Coefficient The clustering coefficient of nodes are defined as 𝐸(𝑖)
𝑐(𝑖) =
(1) 𝑇(𝑖) Let 𝑖 be the node in the network and 𝑘 be the degree of node 𝑖. In (1), 𝑇(𝑖) is the maximum number of possible edges between neighbor nodes of node 𝑖, i. e. 𝑇(𝑖) = 𝑘(𝑘 − 1)/2; 𝐸(𝑖) is the actual number of edges between neighbor nodes of node 𝑖. Clustering coefficient 𝑐(𝑖) ∈ [0,1] . 𝑐(𝑖) = 0 means that there is no connection between the neighbor nodes of node 𝑖 , or the degree of the node is 0 or 1. 𝑐(𝑖) = 1 indicates that all of the neighbors of node 𝑖 have interconnected edges, that is, the node and its neighbors constitute a fully connected subgraph.
Figure 1. Example network
B. Modularity Modularity, proposed by Mark Newman and others [9] in 2004, is an important index to measure the quality of community division. The basic idea is to compare the result of community partition with the corresponding Null model. Modularity is defined as the ratio of the difference between the number of internal edges in the community of the network and the number of internal edges in the community of the corresponding Null model to the total number of edges in the network, that is: 𝑄 −𝑄 1 𝑄 = 𝑟𝑒𝑎𝑙 𝑛𝑢𝑙𝑙 = ∑𝑖𝑗 (𝑎𝑖𝑗 − 𝑝𝑖𝑗 )𝛿(𝐶𝑖 , 𝐶𝑗 ) (2) 𝑀 2𝑀 In (2), 𝐴 = (𝑎𝑖𝑗 ) is the adjacency matrix of the actual network, 𝑝𝑖𝑗 is the expected number of connections between node 𝑖 and node 𝑗 in the Null model, and 𝐶𝑖 and 𝐶𝑗 represent the communities of node 𝑖 and node 𝑗 respectively: if they belong to a community, δ is 1, otherwise δ is 0. A commonly used equivalent expression of modularity is: 𝑛
𝑙
𝑑𝑣 2
𝑀
2𝑀
𝑐 𝑄 = ∑𝑣=1 [𝑣−(
) ]
Figure 2. Label initialation result
Next, all nodes are traversed to initialize the label. Firstly, the node with the highest priority and not visited is selected, and all its neighbor nodes are traversed extensively, and the unvisited nodes are given the same label. Until all nodes are traversed, the initialization tag is completed. For networks with clear community structure and fewer overlapping nodes, it is not necessary to add restriction on clustering coefficient when initializing tags, a good community partition result can be obtained. But for the network with complex community structure and overlapping nodes, without restriction on clustering coefficient, the result of community division will be poor. Therefore, in the process of initializing tags, LPAc algorithm can adopt different strategies according to different network characteristics, that is, to cluster the nodes in different ranges of coefficients, so as to achieve the purpose of eliminating the interference of irrelevant nodes. Next, the network in Fig. 1 is taken as an example for analysis (in Fig. 1, the number in the circle is the node number, and the number next to the circle is the clustering coefficient of the node). The sorted result is 3,1,2,4,5,0,6 according to the clustering coefficient and degree of nodes. Next, the initial label 3 is assigned to node 3 and its neighbors, 0 to node 0 and 6 to node 6. The result of tag initialization is shown in Figure 2 (in Fig. 2, the number in the circle is the node serial number, and the number next to the circle is the node tag). Then, tag propagation is carried out. Starting from node 0, when there are more than one tag in the neighbor nodes, the node with largest clustering coefficient is selected to update
(3)
In (3), 𝑛𝑐 is the number of communities in the network, 𝑙𝑣 is the number of links within the community 𝑣, and 𝑑𝑣 is the sum of all node degrees in the community 𝑣.[10] III.
ALGORITHM
A. Algorithm Description Before initializing the label, the clustering coefficient and degree of all nodes are calculated and nodes are sorted according to the size of clustering coefficient and degree. The ranking rule is that the smaller the node’s clustering coefficient is, the higher priority it has; if the clustering coefficients are the same, the smaller the degree of the node is, the higher priority it has; if the degrees are equal too, the order is from small to large according to the node ordinal number. This sort is used in the initialization tag process. The reason for this ranking is that the larger clustering
349
the tag. Without intervention, take Fig. 2 as an example for label dissemination, and the result will be that the whole network is divided into a community. From the above process, it is not difficult to see the problem. When the tag is initialized, it is initialized by node 3. The same tag 3 is assigned to node 1, 2, 3, 4, 5. This directly result in the whole network being affected by tag 3 when the tag propagates, and finally forms a community.
9: 𝑣𝑖𝑠𝑖𝑡(𝑖) ← 𝐟𝐚𝐥𝐬𝐞 10: end for 11: 𝐟𝐨𝐫 𝑖 𝐢𝐧 𝐶: 12: 𝐢𝐟 𝑣𝑖𝑠𝑖𝑡(𝑖) = false && 𝑐(𝑖) ≥ 𝑐: 13: 𝑣𝑖𝑠𝑖𝑡(𝑖) ← 𝐭𝐫𝐮𝐞 14: 𝐟𝐨𝐫 𝑗 𝐢𝐧 𝑛𝑒𝑖𝑔ℎ(𝑖): 15: 𝐢𝐟 𝑣𝑖𝑠𝑖𝑡(𝑗) = flase: 16: 𝑙𝑎𝑏𝑒𝑙(𝑗) = 𝑖 17: 𝑣𝑖𝑠𝑖𝑡(𝑗) ← true 18: end if 19: end for 20: 𝑛𝑐 ← 𝑛𝑐 + 1 21: end if 22: end for Step 2: Label propagation 1: 𝑐𝑛𝑡 ← {0} 2: 𝑚𝑎𝑥𝑣 ← {0} 3: 𝑚𝑎𝑥𝑛𝑢𝑚 ← 0 4: 𝑚𝑎𝑥𝑐𝑛𝑡 ← 0 5: 𝐟𝐨𝐫 𝑖 𝐢𝐧 𝑉: 6: 𝐟𝐨𝐫 𝑗 𝐢𝐧 𝑛𝑒𝑖𝑔ℎ(𝑖): 7: 𝑐𝑛𝑡(𝑙𝑎𝑏𝑒𝑙(𝑗))++ 8: 𝐢𝐟 𝑚𝑎𝑥𝑛𝑢𝑚 < 𝑐𝑛𝑡(𝑙𝑎𝑏𝑒𝑙(𝑗)): 9: 𝑚𝑎𝑥𝑛𝑢𝑚 ← 𝑐𝑛𝑡(𝑙𝑎𝑏𝑒𝑙(𝑗)) 10: end if 11: end for 12: 𝐟𝐨𝐫 𝑗 𝐢𝐧 𝑐𝑛𝑡: 13: 𝐢𝐟 𝑐𝑛𝑡(𝑗) = 𝑚𝑎𝑥𝑛𝑢𝑚: 14: 𝑚𝑎𝑥𝑣 ← 𝑚𝑎𝑥𝑣 ∪ 𝑗 15: 𝑚𝑎𝑥𝑐𝑛𝑡 ← 𝑚𝑎𝑥𝑐𝑛𝑡 + 1 16: end if 17: end for 18: 𝐢𝐟 𝑚𝑎𝑥𝑐𝑛𝑡 = 1: 19: 𝑙𝑎𝑏𝑒𝑙(𝑖) ← 𝑚𝑎𝑥𝑣(0) 20: end if 21: else: 22: 𝑚𝑎𝑥𝑐𝑐 ← −0.1 23: 𝑚𝑎𝑥𝑗 ← 0 24: 𝑚𝑎𝑥𝑑 ← 0 25: 𝐟𝐨𝐫 𝑗 𝐢𝐧 𝑛𝑒𝑖𝑔ℎ(𝑖): 26: 𝐢𝐟 (𝑚𝑎𝑥𝑐𝑐 < 𝑐(𝑗))||(𝑚𝑎𝑥𝑐𝑐 = 𝑐(𝑗) &&(𝑚𝑎𝑥𝑑 < 𝑑𝑒𝑔𝑟𝑒𝑒(𝑗))) : 27: 𝑚𝑎𝑥𝑐𝑐 ← 𝑐(𝑗) 28: 𝑚𝑎𝑥𝑗 ← 𝑗 29: 𝑚𝑎𝑥𝑑 ← 𝑑𝑒𝑔𝑟𝑒𝑒(𝑗) 30: end if 31: end for 32: 𝑙𝑎𝑏𝑒𝑙(𝑖) ← 𝑙𝑎𝑏𝑒𝑙(𝑚𝑎𝑥𝑗) 33: end else 34: end for 35: comms ← ∅ 36: 𝑛𝑐 ← 0 37: 𝐟𝐨𝐫 𝑖 𝐢𝐧 𝑉: 38: 𝐢𝐟 𝑙𝑎𝑏𝑒𝑙(𝑖) 𝐧𝐨𝐭 𝐢𝐧 𝑐𝑜𝑚𝑚𝑠: 39: 𝑐𝑜𝑚𝑚𝑠 ← 𝑐𝑜𝑚𝑚𝑠 ∪ 𝑙𝑎𝑏𝑒𝑙(𝑖) 40: 𝑛𝑐 ← 𝑛𝑐 + 1 41: end if 42: end for
Figure 3. Label initialation result after adding the restriction c >= 0.4
So the simple solution is that the initialization tag does not start at Node 3, just add the restriction C (i) >= 0.4. After adding restrictions, the result of initialization tag and tag propagation are in Fig. 3. As can be seen from Fig. 3 (in Fig. 3, the number in the circle is the serial number of nodes, and the number next to the circle is the label of nodes). After initialization of the label, the network can be divided into two communities after a label propagation. Similarly, for more complex networks, there are many nodes like Node 3 in the network of Fig. 1. Only by shielding these nodes first, a good label group can be formed when the label is initialized, so that the result of community division can be obtained in the process of label propagation. B. Algorithm Implementation LPAc Algorithm: Input : 𝐺 = (𝑉, 𝐸), Minimum Clustering Coefficient 𝑐 Output: Numbers of Community 𝑛𝑐 , result: 𝑙𝑎𝑏𝑒𝑙 Step 1: Initialize label 1: 𝑛𝑐 ← 0 2: 𝐟𝐨𝐫 𝑖 𝐢𝐧 𝑉: 3: 𝑙𝑎𝑏𝑒𝑙(𝑖) ← 𝑖 4: end for 5: 𝐶 ← ∅ 6: 𝐶 ← 𝑐(𝑖) 7: 𝐬𝐨𝐫𝐭(𝐶) 8: 𝐟𝐨𝐫 𝑖 𝐢𝐧 𝑉:
350
C. Time Complexity of the Algorithm For an undirected graph with 𝑛 nodes and m edges, the time complexity of computing clustering coefficients is 𝑂(𝑛𝑑 2 ), where 𝑑 is the average of nodes, 𝑑 = 2𝑚/𝑛; the time complexity of initializing tags is 𝑂(𝑛); and the time complexity of tag propagation is 𝑂(𝑚). Therefore, the total time complexity of this algorithm is 𝑂(𝑚 + 𝑛 + 𝑛𝑑 2 ). In general complex networks, 𝑑 ≪ 𝑛 < 𝑚, 𝑑 can be regarded as a constant, so the time complexity of LPAc algorithm is 𝑂(𝑚 + 𝑛). IV.
Data Set Football
115
613
Description American Football League Network
TABLE IV. COMPARISONS OF LPA, LPA_D_CC AND LPAC IN THE DIVISION OF COMMUNITY MODULARITY ON REAL NETWORKS
RESULT ANALYSIS
Algorithm
Real Network Data Set Karate
Dolphins
Lesmis
Football
LPA
0.34
0.48
0.36
0.58
LPA_D_CC
0.34
0.48
0.41
0.58
0.38(c=0.1)
0.53(c=0.3)
0.51(c=0.4)
0.60(c=0.3)
LPAc
TABLE V. COMPARISON OF LPA, LPA_D_CC AND LPAC IN THE NUMBER OF COMMUNITIES ON REAL NETWORKS Algorithm
COMPARISON OF MODULARITY OF LPA, LPA_D_CC AND LPAC ON LFR NETWORK
Real Network Data Set Karate
Dolphins
Lesmis
Football
LPA
0.34
0.48
0.36
0.58
LPA_D_CC
0.34
0.48
0.41
0.58
0.38(c=0.1)
0.53(c=0.3)
0.51(c=0.4)
0.60(c=0.3)
LPAc
μ
Algorithm
0.30
0.45
0.60
LPA
0.63
0.45
0.01
LPA_D_CC
0.63
0.48
0.17
0.63(c=0)
0.48(c=0)
0.21(c=0.04)
LPAc
Before analyzing the results, it should be noted that because of the randomness of LPA algorithm and LPA_D_CC algorithm, their results are the average after running many times, so the number of communities they divided appears decimal; and because LPAc algorithm is very stable, the results of each experiment are the same, so there is no decimal. From the experimental results, we can see that in the experiment of LFR artificial network data set, the modularity of LPAc and LPA_D_CC partitioning results is the same when μ =0.30 and μ =0.45, and the difference of LPA is not significant. But whenμ =0.60, the modularity of LPA algorithm results is 0.01, that is to say, the partitioning results are poor, the number of communities is 8, about half of the answer 17, so we can infer that a huge network is formed in its result. And the modularity of LPA_D_CC is 0.17, while LPAc is much better, with modularity of 0.21, and the number of communities is 25, which is closer to the real value than LPA_D_CC’s 34.4. In four real network experiments, the modularity of LPAc partitioning results is greater than that of LPA and LPA_D_CC, which shows that the results of LPAc are more reasonable. For Karate network, the number of communities is similar; for Dolphins network, the result of LPA is 3.4, LPA_D_CC is 4.2, LPAc is 5, which shows that the stability of LPAc is still good, but the number of communities is larger. The results in Lesmis and Football are similar. This shows a disadvantage of LPAc, that is, compared with other label propagation algorithms, the number of communities of the results are always more, which indicates that the process of label propagation ended
TABLE II. COMPARISON OF LPA, LPA_D_CC AND LPAC IN THE NUMBER OF COMMUNITIES ON LFR NETWORK μ
Algorithm LPA LPA_D_CC LPAc
0.30
0.45
0.60
19.8
17.4
8
20
20
34.4
19(c=0)
19(c=0)
25(c=0.04)
B. Real Network Data Sets We choose four real network data sets to identify communities, they are Karate [12], Dolphins [13], Lesmis [14], Football [15]. Their basic informations can be seen in Table III: TABLE III. Data Set
Edge number
The three algorithms LPA, LPA_D_CC and LPAc are applied to the above four real networks respectively, and the results of community partition are obtained.
A. LFR Artificial Network Data Sets LFR network [11] is a classical artificial network data set in community detection. In the experiment, the number of LFR benchmark network nodes is 1000, the average degree of nodes is 25, the maximum degree of nodes is 50, the smallest and largest number of nodes in a single community are 20 and 100, and the three groups of mixing coefficients are 0.30, 0.45 and 0.60, respectively. The larger the mixing coefficient, the more ambiguous the network community structure is. The number of associations in the three groups of networks is 19, 19 and 17 in turn. TABLE I.
Information Node number
REAL NETWORK DATA SET INFORMATION Information
Node number
Edge number
Karate
34
77
Dolphins
62
159
Lesmis
77
254
Description Zachary Karate Club Network Dolphin Social Network The Network of Les Miserables
351
ahead of schedule. This is also easy to understand, LPAc algorithm takes some strategies to sort the nodes in the initial label and label propagation. These sorting strategies destroy the normal tag propagation among nodes to a certain extent, so the label propagation may end earlier. The experimental results show that, whether LFR artificial network or real network, the modularity of the experimental results of LPAc algorithm in this paper is higher than that of the other two algorithms, and the number of communities divided is more accurate in some cases. Generally speaking, LPAc algorithm has good stability and accuracy, but it also has the disadvantage of large number of communities. V.
[3]
[4]
[5]
[6]
[7]
CONCLUSIONS AND FUTURE WORK
[8]
This paper proposes LPAc algorithm based on label propagation algorithm and LPA_D_CC algorithm, and adds the minimum clustering coefficient c as a parameter, which improves the applicability of the algorithm and the stability and correctness of the algorithm to a certain extent. But there are still many shortcomings, such as for different networks, how to quickly find the best minimum clustering coefficient c, whether it is related to the clustering coefficient of the whole network; considering that there are many overlapping communities in complex networks in reality, whether this algorithm can be further improved to apply to the detection of overlapping communities. In the next step, the algorithm will be further improved and optimized in the above aspects.
[9] [10] [11]
[12]
[13]
REFERENCES [1]
[2]
[14]
Zhang Jun-li, CHANG Yan-li, Shi Wen. Overview on label propagation algorithm and applications [J]. Application Research of Computers,2013,30(01):21-25. Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks.Physical Review E, 2007, 76(3): 036106.
[15]
352
LIU Shi-Chao, ZHU Fu-Xi, GAN Lin. A Label-Propagation-Probability-Based Algorithm for Overlapping Community Detection [J]. Chinese Journal of Computers, 2016,39(04):717-729. LI Lei, NI Lin. Community Detection for Label Propagation with Modularity Optimization [J]. Computer Systems & Applications, 2016,25(09):212-215. ZHANG Meiqin, BAI Liang, WANG Junbin. Label propagation algorithm based on weighted clustering ensemble. CAAI Transactions on Intelligent Systems, 2018,13(06):994-998. Zheng Shao-qiang, Zhao Zhong-ying, Feng Hui-zi, Li Chao. Highly Robust Community Detection Algorithm Based on Label Propagation [J]. Journal of Chinese Computer Systems,2018, 39(08):1809-1813. Nascimento, Mariá C. V. Community detection in networks via a spectral heuristic based on the clustering coefficient[J]. Discrete Applied Mathematics, 2014, Vol.176( ):89-99. DENG Xiaolong, ZHAI Jiayu, YIN Luanyu. Vector Influence Clustering Coefficient Based Efficient Directed Community Detection Algorithm [J]. Journal of Electronics & Information Technology, 2017, 39(9):2071-2080. Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69(2):026113. Wang Xiao-fan,Li Xiang,Chen Guan-rong.Network Science:An Introduction[M].BeiJing: Higher Education Press,2012.4:131-133. Lancichinetti A, Fortunato S, Radicchi F. Benchmark graphs for testing community detection algorithms. Physical Review E, 2008, 78(4): 046110. W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977). D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003). D. E. Knuth, The Stanford GraphBase: A Platform for Combinatorial Computing, Addison-Wesley, Reading, MA (1993). M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002).