Community detection of political blogs network based on structure-attribute graph clustering model

ABSTRACT


INTRODUCTION
In recent years, social networks are significantly considered as an important domain of complex networks, where networks can be modeled as graphs [1], [2]. The graph structure is a useful approach for studying social networks, where objects (such as people and authors) can be modeled as nodes and the relationships among objects can be represented as edges connecting these nodes. In social networks analysis, the graph clustering [3] is a great tool that is widely used to partition the large network into several densely connected community structures based on similarity measures. In result, the partitioned structures facilitate the understanding of large network visualization and make it easier to be analyzed.
The strategy of the graph clustering was used in many domains of social network analysis including, biological networks [4], community detection [5]- [9], and websites social networks [10]. Mainly, there are several graph clustering techniques, most of these techniques consider only the similarity of the topological structures [11], [12], others focus on the attributes of the of contents of the nodes [13], while few of them have considered both approaches [8].
Nowadays, and with increasing the influence of the weblogs on the human lifestyle especially in the periods of the US elections. Thus, it is the time to develop appropriate methods that are able to detect the community structures within these weblog networks. As a result, make it easier to visualize and analyze such networks. In this work, a new graph clustering method is proposed for community detection in social networks called Structure-Attribute Similarity Clustering (SAS-Cluster), that take into the account the similarities of the topological structures and the nodes attributes. Two concepts are introduced in this paper the Mean Gravity and the Path Degree, which are used to increase the community structure cohesiveness. The contributions of this paper are summarized below: a. A new graph clustering algorithm is proposed which considers the structural similarity and the nodes attributes. b. Two concepts (Mean Gravity and Path Degree) are introduced which are used to increase the cohesiveness of the structure of the clusters. The remaining sections of this paper are arranged as follows. Section 2 introduces a review of some most recent related works. Section 3 focuses on the graph clustering technique. Section 4 describes Political Blogosphere Network. Section 5 introduces the proposed method. Section 6 reviews the experimental results. Finally, section 7 concludes this paper.

RELATED WORKS
This section reviews a summary of most recent related works concerning the graph clustering methods of social networks. The goal of graph clustering is to group the nodes of the network that have denser connections among them. Some methods such as Clique Peculation Methods CPM focus on internal/external edge counting [3] while ignoring the interactions and vertex characteristics, in [14] the authors propose a clique method on co-purchased network weighted graph, to find micro-cluster, the algorithm works in two phases graph polishing to enumerate intersections of neighbors and clique enumeration to count maximum cliques.
Newman-Girvan is considered as well-known divisive algorithm [15] for community detection which based on two main steps; first detects some edges based on betweenness measure then splits the network into communities based on the detected edges finally it requires betweenness recalculation after each splitting, the quality of the communities is measured using the maximal modularity. However, the method is not suitable for large networks and it suffers from the resolution limit.
In ABCD [6], the authors introduced new algorithm based on bi-directional connections and nodes features to detect community attractiveness of OSN, the algorithm was validated in SNAP platform and compared with CNM [5], according to the researcher ABCD is outperformed CNM and it can discover smaller communities in contrast with CNM, however, the method was not shown the comparative results of the modularity values to prove its effectiveness.
The k-prototype algorithm ISCD+ [16] an iterative model for fast graph clustering, the authors introduce a new idea for detecting communities, the algorithm imposes two factors namely local importance and importance concentration to select nodes with different weights to represent communities.
The KNN-based algorithms in [12] the authors propose a directed weighted graph clustering algorithm for community detection, the algorithm considers network topology only and it is significantly focused on the path traversed frequency and neighborhood nodes, nevertheless, the method suffers from computational complexity since it is based on k-nearest neighbors' computations.
In [10] the authors introduced a new approach for community detection in social network websites. besides structure similarity a frequent pattern mining of nodes contents was contributed, the algorithm is implemented in four steps, preprocessing, frequent pattern computing to obtain harmonious groups, extending harmonious groups into small communities, finally small communities expansion, however, the method suffers from some disadvantages such as, time complexity which is caused by the input parameters, a trial, and error concept was used to determine the appropriate parameters.
Roy et al. in [17] proposed a graph-based spectral clustering model, the method uses novel affinity matrix for spatial clustering with Mahalanobis distance, however, the method has some limitations, the distance metric can only measure from a single point, this reduces results quality.
Jinarat et al. in [18] have introduced a graph clustering algorithm for web search results, the core idea of the method is to combine web search results with external knowledge data from Wikipedia to attain better clustering quality. the method uses graph-based construction for text clustering to connect related documents, nevertheless, the similarity threshold parameter for subgraph detection must be in a certain range, when the threshold parameter increases, the clustering quality decreases.

GRAPH CLUSTERING TECHNIQUE
An indirect weighted graph { , , , }, The purpose of graph clustering is to partition a graph into k-disjoint subgraphs, depending on some topological structures and attributes similarity measures, the communities should have the following aspects; a. Similar vertices should be participated in a similar group, while the dissimilar ones should go to different groups.
b. The vertices that belong to the community should be densely connected to each other and sparsely connected to the other vertices within different communities. The goal of the proposed algorithm is to introduce a weighted measure. Thus can effectively reflects the characteristics of network topology and vertices features to strengthen the similarity cohesiveness. The strategy of clustering and the similarity measure will be discussed in the next section.

Contrast comparative method
W-Cluster [9] is an emerged algorithm of SA-Cluster, which considers both structures and attributes aspects by applying a unified distance measure and neighborhood random walk strategy. The method uses the probability of edge belongs to the community to Measure link strength and Jaccard coefficient similarity to estimate content similarity. Eventually, W-cluster can automatically learn the degree of both topological similarities and attribute similarity through utilizing the probability transition matrix to build unified distance measure. The method partition large graph into numbers of clusters.

Evaluation measures
To evaluate the clusters quality results of the SAS-Cluster, two evaluation measures are used for this purpose; Density [1] and Entropy [19]. Both measures have the following definitions. Density measure is used to estimate the structural closeness to each cluster. Density is denoted in Equation (1).
where i is the number of clusters in the range {1, 2,3, , } k , ac is the attribute value at index number in range the [1, 2,3, , ] m . n is the attribute values. , m nc is a number of attribute values. cin P is the percentage of cluster vertices that have th n attribute value on ac

POLITICAL BLOGOSPHERE NETWORK
The political web-blogs have played an important role in US Presidential Elections since the year of 2000 and after, and it is gained more influence at the 2004 US Presidential Elections. First, the blog can be understood as an informational website placed on World Wide Web WWW which is devoted to publishing a diary-style text or posts, sometimes contains links to other websites.
According to Adamic et al. in [20] the year 2004 demonstrated a rapid increase in the popularity of blogs, accordingly, the significant fraction of internet traffic was directed to these blogs. However, there is 9% of internet users acknowledged that they read political blogs during US Political Campion. Therefore, the weblogs may be followed by a small number of readers but its influence extends beyond that.
To discover the behavior of Blogosphere network, Adamic et al. in [20], have analyzed the landscape of most influential political blogs in two months before the US Presidential Elections. The analysis was based on the topics of the discussion and linking structure among blogs. The top 40 influential blogs were considered and added to the original list.
According to Adamic et al. a set of URLs was gathered from seven online weblog directories including, eTalkingHead, BlogCatalog, CampaignLine, and Blogarama. Each URL represents the political weblog. A one-day snapshot is taken for URLs, for each downloaded page, the citation was considered, and any newly discovered page was added to the list.
Next step, for all the discovered blogs, the citations are counted up, if the discovered page was cited for 17 times or more, then, its orientation is labeled manually depending on blogrolls and posts and added to the original list. The final set consists of 1494 blogs, divided into 759 liberals and 735 conservative blogs. The pattern in which the blogs are linked together was done, by counting the number of posts in such that each blog cites to another blog is counted as an edge between the two blogs. However, the link was not duplicated if the blog was cited by another blog more than once within the same post.  [20]. The colors reflect political orientations, the red for conservatives, the blue for liberals, the color purple from conservatives to liberals and the color orange from liberals to conservatives Finally, the authors concluded network description, in such that, each political learning is more likely to talk about certain topics, one can notice an interesting pattern which is conservative bloggers tend to link to other conservative blogs and it is more densely linked (Table 1).

THE PROPOSED METHOD 5.1. Block diagram
In Figure 2 we can see the SAS-Cluster block diagram.
where ( ) ( ), thus one can determine which vertex is more important.

Definition 3 (Path Degree
where path is taken into account as the weighted shortest path between a pair of indirectly connected vertices. Structural/Attribute Similarity (SAS), in the proposed method, the Jaccard similarity coefficient is adopted to compute the similarity measure, as defined in Equation (7).
where , XY are vertices, , | | X Y V  , Jaccard similarity in the equation (7) is a well-known similarity measure, therefore it has been used to find out the relevance among vertices. There are two main similarity calculations are taken into account. Directly connection Equation (8). To calculate the similarity between a pair of directly connected vertices.
Next, the vertices attributes are considered. Each vertex is characterized by multiple attributes. Obtaining attribute similarity increases the cohesiveness among vertices. Attributes values can be either 1 or 0 reflect the appearance or disappearance of that attribute at a certain vertex. The mathematical formulations of the attribute similarity are defined in Equation (11) and Equation (12 12 , , k k clusters C C C  Initially and regarding Figure 2, SAS-Cluster algorithm requires two predetermined parameters α and number of clusters k. at step one algorithm reads the raw data to create the network and determines the number of attributes associated with each vertex. Step two multiple paths among nodes are considered, many calculations must be insured, Mean Gravity Equation (5) and Path Degree Equation (6), are required to establish SAS similarity calculations as in Equation (13), Distance values among each couple of vertices in Equation (14) are computed based on SAS similarity results. All data must be stored in the database.
Step four and regarding the number of k and distance values, the communities are extracted from the original graph. Finally, and concerning the Equations (1), (2) and (3), the results are evaluated using Density and Entropy.

EXPERIMENTAL RESULTS
In this section, extensive experiments are performed to evaluate the performance of the SAS-Cluster method. All experiments are conducted on PC with Windows 10 Pro 64 bit, an i7-6700 HG CPU (260 GHz, and 16 GB RAM. The programming environment is Python 3.6.2 (MSC v.1900 32 bit (Intel)).

Dataset
Political Blogs Dataset, as real network dataset, which is used to for the evaluation and analyzing the proposed method. The dataset is based on blog-blog connection [20]. It consists of 1,490 nodes and each node contains an attributes description to characterize its political learning, which is either conservative or democrat.

Results
The proposed SAS-Cluster algorithm is extensively evaluated with the state-of-art method W-cluster [9] through well-known evaluating measures, Density, and Entropy. The density as given in Equation (1), reflects the extent of how tight structure is connected among vertices in each cluster, the higher density value reflects the community structure cohesiveness. The entropy that is described in Equation (2) and Equation (3), which is used to rate the attribute relationship among vertices, low entropy reflects better relevance among vertices in each cluster. Figure 3 and Figure 4 show the performance of SAS-Cluster concerning Density and Entropy, where the number of clusters 3,5,7,9 k  .  is set in the range [0, 1] and 1   . The algorithm is run for at least three iterations. Figure 3, reviews the density values. When setting  to 0 the density value is the lowest, this because of the similarity of the structural topology is not taken into account. At 3 k  the density values declines when is set to 0.6 or 0.7. At 5,7,9 k  and 0.5

 
the density values drops down. Figure 4, reviews the entropy values, the best-given values when α equal to 0, since the algorithm is run based on the attribute similarity only. In contrast, when α equal to 1 the given-values is the worst this because the attribute similarity is not taken into the account. At 3,5,7,9 k  the best given-results when  is set to 0.5 or 0.8. While the quality of the results tends to decrease when 0.8   . As illustrated in Figure 3 and Figure 4, the best performance for SAS-Cluster when  is either 0.5 or 0.8.  To show the effectiveness of the proposed method, SAS-Cluster is compared with the state-of-art method, W-cluster. Both methods are tested for a fixed number of clusters 3,5,7,9 k  and  is set to 0.5. Figure 5 and Figure 6 illustrate the comparison results of the density and the entropy respectively for each of SAS-Cluster and W-cluster. All results have shown that SAS-Cluster outperformed W-Cluster concerning the density and the entropy measures.

CONCLUSION
Nowadays, social networks have become more influential in individual's opinion, decisions, and their lifestyle. Therefore, and with the accelerated increase in social networks data, it is important to adopt a more reliable graph clustering methods for community detection. In this paper, a graph clustering method for community detection is proposed. The method introduces two concepts, Gravity degree and Path degree, to increase the structural similarities within the detected communities. In addition, the adopted method combines structural similarities with the multiple attributes of nodes to attain more cohesiveness similarity. The experimental results have shown that SAS-Cluster is better than W-cluster according to Density and Entropy evaluation measures.