GUISE - Uniform Sampling of Graphlets for Large Graph Analysis

[Abstract]

[1. Introduction]

key challenge for obtaining GFD : excessive computational cost
graphlet-size “five” is the best considering cost-benefit trade off
Existing tool regard to GFD (exact counting) GraphCrunch2 fails to count in real datas
exact counting is required for a few scenarios
- desining graph kernels
- resconstruct the counts from the count of a specific one graphlet
contribution
- propose GUISE: construct graphlet frequency distribution (use MCMC)

[2. Background]

GFD
- $log(\frac{f(i)+1}{\sum_{i=1}^{29} f(i)+29 })$ where f(i) is the frequency of graphlet $g_i$
- deal with graphlets of size 3,4 and 5
Markov chains
- random variable X has a range of values (states) that are defined in a state space S
- $X_t$ : X at (discrete) time t
- the random variable is called Markov process if the transition probabilities between a pair of states in S depends only on the current value(state) of X
- The matrix of transition probabilities (state i → state j) : T(i, j)
  - each entry is between 0 and 1
  - for a fixed i, the sum of T(i, j) is equal to 1
- said to reach stationary distribution $\pi$ when $\pi = \pi T$ (eigenvalue)
- said to reversible when $\pi(i)T(i,j)=\pi(j)T(j,i)$ (sufficient condition for $\pi$ to be a stationary distribution of the Markov chain)

[3. Method]

[A. Uniform samplming of graphlets for GFD construction]

the number of all graphlets : $\abs{S}$
choose each graphlet with the probability : $1/\abs{S}$
want to sample a graphlet uniformly without explicitly enumerating all the embeddings of all graphlets ⇒ solution : MCMC algorithms
MCMC : perform a random walk on the sample space with a locally computable transition probability matrix in such as a manner that the stationary distribution of the random walk aligns with the desired probability distribution
construction GFD by GUISE
- keeps one counter for each of the graphlets (a total of 29 counters all initialized to 1)
- calls the sampler repeatedly for a large number of iterations
Lemma 1: When the size of the sample set approaches to infinity, GUISE returns the correct GFD for a graph

[B. MCMC algorithm for Uniform Sampling of a Graphlet]

⇒ MCMC는 마르코프 체인에 기반한 확률 분포로부터 원하는 분포의 정적 분포를 갖는 표본을 추출하는 알고리즘의 한 종류이다

neighboring graphlets: For a k-graphlet, all the graphlets of size k-1, k, k+1 having k-1, k-1, k nodes in common, respectively, are its neighboring graph
define the sample space, the state transition process, the transition probability matrix, desired probability distribution
states : embeddings of any 3-, 4-, 5- graphlets
time T, state S → walks to one of neighboroing states with the probability that is defined by an appropriate state transition probability matrix T
add vertex → (k-1)- to k- graphlet
delete vertex → k- to (k-1)- graphlet
replace vertex → k- to k- graphlet
transition probability T of graphlet p and q that are not neighbors of each other : 0
for uniform sampling, the stationary distribiution : [1/m, .., 1/m] where $m=\abs{S}$
- to make this, symmetric transition probability might be helpful
- set T(p, q) and T(q, p) : = min(1/d(p), 1/d(q))
  - if the sum of row entries is not 1, allocate the remaining probability as a self-loop to the resident state
- symmetric transition probability matrix == doubly stochastic matrix
Lemma 2: An ergodic random walk achieves a uniform stationary distribution iff transition probability matrix is doubly stochastic
Lemma 3: The random walk that GUISE (self-loop) uses is ergodic
mixing time : since social network graphs have small diameter, the diameter of transition graph is small, so the mixing time of random walk is small → converge very fast

[C. pseudo-code]

Untitled