Local network community detection with continuous optimization of conductance and weighted kernel k-means

Twan van Laarhoven and Elena Marchiori
(under review)

Abstract

Local network community detection is the task of finding a single community of nodes concentrated around few given seed nodes in a localized way. Conductance is a popular objective function used in many algorithms for local community detection. This paper studies a continuous relaxation of conductance. We show that continuous optimization of this objective still leads to discrete communities. We investigate the relation of conductance with weighted kernel k-means for a single community, which leads to the introduction of a new objective function, σ-conductance. Conductance is obtained by setting σ to 0. Two algorithms, EMcand {\pgd}, are proposed to locally optimize σ-conductance and automatically tune the parameter σ. They are based on expectation maximization and projected gradient descent, respectively. We prove locality and give performance guarantees for EMcand PGDcfor a class of dense and well separated communities centered around the seeds. Experiments are conducted on networks with ground-truth communities, comparing to state-of-the-art graph diffusion algorithms for conductance optimization. On large graphs, results indicate that EMcand PGDc stay localized and produce communities most similar to the ground, while graph diffusion algorithms generate large communities of lower quality.

Downloads

Source code

The source code can be used under a MIT style license. When using the method in a paper, please cite

Local network community detection with continuous optimization of conductance and weighted kernel k-means; Twan van Laarhoven, Elena Marchiori, (submitted) 2016

The code requires MATLAB or GNU Octave. Short summary of usage:

The load_dataset function can load any of the datasets. Example:
```
[A,C] = load_dataset('dblp');
```
The random_seeds function selects one or more random seeds from each cluster. Example:
```
R = random_seeds(C,1); % one seed per cluster
```
The optimize_cluster function performs clustering. By default the gradient descend method is used. Example
```
C2 = optimize_cluster(A,R);
C3 = optimize_cluster(A,R, 'method','em', 'sigma',0.2);
```
The run_methods_on_datasets function runs all experiments (this takes a long time). Subsequent calls to write_results_tables will write the results to a tex file (the tables included in the paper).

To use the heat kernel method the hkgrow matlab package must be installed and added to the path. After that it can be called from optimize_cluster(...,'method','hkgrow');

Datasets

We used datasets from the following sources:

Yeast Protein interaction based on Biogrid (version 3.4.13) and complexes from CYC2008.
The datasets can be downloaded in MATLAB matrix format here [1.6MB; zip].
This download contains the script used for constructing these datasets.
Other formats are available upon request.
Santo Fortunato's LFR benchmark generator (Package 1).
Stanford Network Analysis Project (SNAP). The source code contains a script for preprocessing these datasets to be suitable for MATLAB (this scrip requires perl).
flickr photo sharing dataset.
The datasets can be downloaded in MATLAB matrix format here [47MB; zip].
Mark Newman's network datasets.

Contact

Contact the authors at tvanlaarhoven@cs.ru.nl for any questions or comments.