Multi-objective NSGA-II based community detection using dynamical evolution social network

ABSTRACT


INTRODUCTION
The community detection CD is one of the problems in the social network (SN) applications [1]- [3]. It aims at partitioning the social network into a set of disjoint communities with achieving maximum intracommunity similarity and minimum inter-community similarity [4]. However, most of the approaches developed for CD in the literature deal with SN from a static perspective which makes the community detection not suitable for handling the progressive increase in the social network. Hence, it is important to develop a model that provides not only satisfying modularity which is related to the quality of the community detection decision but also a satisfying smoothness which indicates to the smooth changes in the community partitioning decision concerning time [5]. Meta-heuristic searching is a set of a random searching Algorithm that contains some heuristics for the goal of optimizing a certain set of objectives [6], [7]. The meta-heuristic searching Algorithm encodes the solution of the problem and creates a set of candidate solutions, then it evaluates them based on the objective function. The good solutions are favoured and get more influence on creating other solutions, while remaining solutions get less attention. Some of the famous approaches of meta-heuristic searching algorithms are particle swarm optimization [8] and genetic algorithms [9]. Using meta-heuristics approaches for the goal of community detection was proposed by many researchers [10]- [12]. However, most of the approaches have not tackled the problem from the perspective of multi-objective with considering the evolving of the network as an aspect of performance. This goal opens the new direction of development of community detection methods based on the concept of non-dominated sorting or multiobjective meta-heuristic algorithms [13]. There are many multi-objective meta-heuristic searching algorithms. Most of them use the concept of non-dominated sorting as the main criterion in ranking the solutions while searching. One of the state-of-the-art approaches for multi-objective meta-heuristic searching is the non-dominated sorting genetic algorithm NSGA-II [14], [15]. It uses two concepts for providing the final set of non-dominated solutions. The first one is non-dominated sorting and the second one is the crowding distance. The role of the non-dominated sorting is to enable exploitation and the role of the crowding is to enable exploration. The balance between both is by giving priority to the rank of the solution for the goal of exploitation or domination and then gives priority to the crowding distance for the goal of exploration. The goal of the article is to present a non-dominated aware searching Algorithm designated as non-dominated sorting based community detection with dynamical awareness NDS-CD-DA. The problem will be formulated as a multi-objective optimization problem. The objectives will consider not only community detection quality in terms of static aspect but it will include dynamic orientation for the sake of accomplishing smooth transition over time from community detection decision and another. The remaining of the article is organized is being as. The related work is presented in section 2. Next, the methodology is provided in section 3. Afterwards, the evaluation and experimental work are given in section 4. Lastly, the conclusion and future work are provided in section 5.

RELATED WORKS
Numerous articles in the literature have done the tackling of the community detection Algorithm from the perspective of multi-objective optimization. Many researchers have assumed convexity of the problem and solve it using the sum of objectives such as the work of [16] and the work of [17], while less of them have separated the objectives and handled the problem using Pareto concept. Some researchers have formulated the problem using both inter and intra distance and found a set of solutions that balances between them with non-domination aspect [18]. This does not follow the classical non-dominated solutions searching Algorithms of multi-objectives. However, many researchers have adopted traditional multi-objective evolutionary searching optimization.
While some researchers have focused on the community structure and its influence on the resilience of the network as a whole [19], many researchers have focused on the community detection from the perspective of multi-objective optimization. In the work of [16], the multi-objective evolutionary Algorithm has been used for identifying overlapping community structure in a complex network. This work has ignored the dynamic aspect of the network and its temporal changes. In the work of [20] multi-objective genetic optimization for minimizing the external connectivity as one objective and maximizing the internal connectivity as another objective. The author Bello-Orgaz of [20] use for external connectivity, expansion, separability and cut ratio and use for maximizing the internal connectivity the density, triangle participation ratio, clique number and clustering coefficient. However, their optimization does not include any term for achieving the smoothness that is required to handle the dynamics. The work of [21] has used temporal change as an objective along with the modularity which expresses the quality of the detection, however, in their multi-objective optimization there non-incorporation of non-dominated sorting [22].
In the work of [23], the similarity function of the unsigned network has been extended to include the singed network. Their evolutionary multi-objective optimization network has a crossover merging operator that enables cluster merging while searching. For multi-objective, they derived negatively correlated objective functions based on the signed network to accomplish better performance. However, their work has ignored the aspect of dynamic change optimization of the evolving social network. One of the famous approaches of community detection named a label propagation-based algorithm (LPAm) was also developed by using meta-heuristics.
In the work of [24], ant colony optimization with non-dominated sorting has been used for community detection. The authors have considered complex structures of networks and used more than objective functions for calculating the quality of community detection. The approach is competitive in terms of static community. However, the authors have ignored the evolving aspect in their calculation which is essential to identify a community in evolving nature. In the work of [17] an evolutionary genetic searching Algorithm was developed for community detection with using two objective functions: the first one indicates to the quality of CD or the modularity and the second one indicates to the quality of the smoothness defined by temporal change (TC). The proposed genetic with mutation and migration operator which provides bettersearching capability in terms of exploration and exploitation. However, there was no incorporation to the crowding distance CDIST which plays a crucial role in the exploration to provide more exploration which leads to diverse Pareto front.
Overall, it is observed from the literature that the majority of the work that has been proposed for solving community detection has not paid attention to the smoothness of the community solution concerning time. Some approaches have used NMI as an indicator of the smoothness between two consecutive time steps; however, they have not solved using non-dominated sorting which assures avoidance of local minima that occur when the weighted average is used for solving it. The contribution of the research is given is being as; i) It presents a novel framework for dynamic community detection in evolving social network with multiobjective awareness; ii) It uses famous non-dominated sorting with crowding distance genetic Algorithm NSGA-II for solving the problem of dynamic community detection; iii) It generates all multi-objective evaluation metrics and it compares with the classical multi-objective genetic approach.

METHODOLOGY
This section presents the developed methodology for accomplishing dynamic aware community detection system with multi-objective perspective. We present the problem formulation in sub-section 3.1. Next, the framework of optimization in sub-section 3.2. Next, present the solution encoding in subsection 3.3. Afterwards, provide objective functions in sub-section 3.4. Next, present the Algorithm of nondominated sorting based community detection with dynamical awareness (NDS-CD-DA) in sub-section 3.5. Lastly, the evaluation metrics that are used for comparison are given in sub-section 3.6.

Problem formulation
Assume we have an evolutionary social network = { 1 , 2 , … . } where represents the social network at moment . Our goal is to provide a dynamic community detection (DCD) system that takes the at moment and provide the community detection at moment that will be given as = { 1 , 2 , … . } where denotes the community decision at moment . Hence, = ( , ) where denotes the vertices and denotes the edges and denotes the number of detected communities at moment t. Our goal is to provide the decision in general with maintaining two aspects of quality: the snapshot quality represented by modularity which needs to be maximized and the quality of the temporal change which is represented by NMI and it needs to also maximize. Hence, our problem is a multi-objective optimization problem. The first objective is the modularity at moment which is and is given by (1): where Denotes the number of edges in the community ; denotes the sum of the degree of nodes inside the community . The second objective is the normalized mutual information NMI between two partitions and measure that is given by (2): where denotes the number of communities in ; denotes the number of communities of ; denotes the confusion matrix whose element is a count of the number of nodes that belong to both the; ℎ a community of partition and ℎ a community of partition ; denotes the summation of the elements of in the row ; denotes the summation of the elements of in the row ; and | | denotes the number of nodes in a network.
Assuming that = ( 1 , 2 ) = ( , ) our goal is to is accomplish more dominance that is defined by the definition A. The challenges that are related to this problem can be summarized by the handling of the high dynamic that is related the evolving of the social network which leads to the need of trade-off between the smoothness of the decision of community detection with respect to time and the quality of the proposed community detection from stand-alone clustering perspective. Another issue is the high number of candidate solutions or the huge solution space which requires considerable running iterations and solutions for convergence. However, it is also important to consider the continuous change in the social

Framework of optimization
The optimization that will perform community detection is presented under the framework of Figure 1. As we see in the Figure, the framework is combined with the social network as input and community partitioned social network CPSN as output. There are four essential blocks: the community detection (CD) block, the modularity block, normalization mutual information (NMI) block, and multiobjective optimization (MOO) block. The community detection provides the community partitioned social network based on two inputs: the social network and the decision provided by the MOO. The MOO block provides the searching with the assist of the CD based on reading two objective functions values, namely, modularity and NMI. NMI is calculated based on two inputs: 〖CPSN〗_t and 〖CPSN〗_(t-1) which is calculated based on the memory block.

Solution encoding
The goal of the solution encoding is to enable conducting the optimization based on defining an appropriate solution space. Based on the problem formulation that was presented in section 3. = [ ], = 1,2. . and 1≤ ≤ .

Objective functions
Each solution of the solution space is evaluated according to the two objectives. And our goal is to find the set of non-dominated solutions or Pareto front for the two objective functions.

Non-dominated sorting based community detection with dynamical awareness (NDS-CD-DA)
This section presents the algorithm of NDS-CD-DA. The input of the algorithm is the social network = { 1 , 2 , … . } and the output of the algorithm is the Pareto fronts of each of the solutions of community detection of the social networks = { 1 , 2 , … }. Hence, in each of the iterations, the algorithm will run NSGA-II on the corresponding social network and the result will be the Pareto front. The dynamic awareness is obtained from maximizing the normalized mutual information NMI between the community result of the current social network and the previous social network. The Algorithm 1 runs for several iterations equals to T. At each iteration, it determines the number of nodes in the social network as it is given in line number 7.1. Next, it creates the boundary of the searching accordingly as it is given inline 7.3. Besides, it sets a different solution dimension according to the maximum potential We present the pseudocode of NSGA-II in Algorithm 2. It accepts the number iterations maxIterations, the number of solutions numberofSolutions, the number of objective functions objectiveFuntions, the parameter of crossoverRate, the parameter of mutationRate. The output is given in ParetoFront. At each iteration, the Algorithm performs an evaluation of the entire population using the function evaluatePop. Also, it performs the non-dominated sorting using NonDominatedSorting(). After performing the non-dominated sorting, the algorithm selects the parents and generated the off-spring as it is given in lines number 8.3 and 8.4 consequently. Next, the parents and offspring are combined to produce the extended population as it is given in 8.5. The extended population is evaluated using the objective functions and it is used to do non-dominated sorting in lines number 8.6 and 8.7 consequently. Afterwards, we perform the selection giving first priorities to the most dominating ranks and second priorities to the highest crowding distance as it is given in lines number 8

Dataset
This section presents the datasets that are used for evaluation. We use two real datasets Last.fm and Douban as well as one synthetic dataset named SYNFIX: a. Last.fm: it was released in the framework of the 2nd international workshop on information heterogeneity and fusion in recommender systems [25]. It's consisted of more than 40 million active clients distributed in more than 190 countries. Last.fm dataset allows the clients listing online music and allow the clients to extend the friendly relations with other clients. This dataset only records the listening of each client for particular artists. b. Douban: is a social website designed in 6/3/2005 in hina, it provides books, music and movies recommendation based on user rating and evaluation. it's like Facebook where user can have friendship relation through email communication. the rating is secret range from 1/5 which is useless to the 5/5 which is most useful [26]. c. SYNFIX: is a dynamic dataset which contains into 128 nodes totally [17]. It is divided into 4 communities each one has 32 nodes. the average degree of each node is 16 degree and shares zout edges with nodes in other communities. The community structures become weaker as zout declines. Three nodes randomly selected for the simulate evolving community structures to assigned to other communities at each time. The number of edges that are shared with other nodes is 3. SYN-FIX dynamic network is generated over 10 iterations. (two variants were used SYN-FIX3 and SYN-FIX5).
Last.fm and Douban are not evolving social network. Hence, we use Algorithm 3 for converting the social network to the evolving network. The Algorithm takes the social network and the time of evolving as inputs and provides the evolving social network as output. At each iteration, the algorithm selects some nodes randomly breaks some of their links and adds new links.

Evaluation metrics
The other name is set coverage or C-metric which compares the Pareto sets 1 and 2 as in (3): Equals the ratio of non-dominated solutions in Ps2 that are dominated by non-dominated solutions in Ps1, to the number of solutions in 2. Thus, when evaluating a set Ps, it is important to minimize the value of ( ; ) for all Pareto sets .

EVALUATION AND EXPERIMENTAL WORK
This section presents the experimental work and evaluation of the developed NDS-CD-DA and its comparison with the benchmark DECS cAs it is been shown in Figure 2, NDS-CD-DA has accomplished a domination percentage of 100% over DECS for almost all iterations. This is clearer in Lastfm dataset and is only less with 10% for moment two of the social network evolution in Douban dataset. This interpreted by the enabling of non-domination awareness of searching that exists in NDS-CD-DA comparing with weighted sum evaluation of DECS. Table 1 represent the parameter names and the value used in the proposed work.
The evaluation of community detection is decomposed into two parts: the first one is the selfvalidation of the progress of our NDS-CD-DA algorithm. It shows the evolution of the social network with two graphs: the first one is the NMI metric and the second one is the modularity metric while the Algorithm is doing the searching. We presented that in Figures 3 until 5 for Lastfm, Douban and SYNFIX respectively.
The second one is to compare the set coverage of NDS-CD-DA over DECS for 10 iterations of the evolution of the social network and it is given in Figures 2 for Lastfm and Douban, and in Figure 6 for SYNFIX. We observe that in the entire evolution of the social network, NDS-CD-DA has a positive set coverage over DECS and reaching one in most cases. Hence, it can be said that NDS-CD-DA has more capability of attaining higher modularity and NMI over DECS.   The Figures 2 and 6 shows that using single objective optimization for community detection with only focusing on the modularity as a metric for decomposing the social network into sub-groups named communities leads to falling in sub-optimality. The reason is that the performance needs not only to provide the community but also to smooth the decision of the optimization which might have a conflicting nature. Hence, using the result of multi-objective optimization concerning the two objectives lead to higher optimality. This is why we see that for all snapshots of the social network in the Figures NDS-CD-DA is dominating over DESC. Figure 6. Set coverage for NDS-CD-DA over DESC for two datasets SYNFIX (two variants were used syn-fix3 and syn-fix5) for 10 iterations

CONCLUSION
This work has presented a novel community detection approach based on famous multi-objective optimization Algorithm named non-dominated sorting genetic algorithm NSGA-II. It is designated as nondominated sorting based community detection with dynamical awareness NDS-CD-DA. It uses two objective functions: The first one is the modularity which indicates to quality of community detection one single snapshot of the social network and the second one is the NMI which indicates to the quality of community detection concerning the temporal change of the social network. The implementation of the approach was done on several social networks data; namely, dataset Lastfm, Douban, SYNFIX of that are evolving network. Lastfm, Douban one is snapshot datasets that have been converted to evolving using an approach that has been presented. The results reveal superiority in the performance in terms of the domination aspect which is represented by the set coverage metric. The limitation is that the proposed optimization consumes a considerable amount of iterations before convergence. Future work first is to reduce the number of iterations by starting from random population close to the last search population assuming that the dynamic is not high in the evolving of the social network, second is to incorporate this model in other useful systems such as the recommendation system and the implicit link predictions.