[Abstract]
- This paper deals with GFD (graphlet frequency distribution).
- Propose MCMC sampling method for constructing approx GFD.
[1. Introduction]
- key challenge for obtaining GFD : excessive computational cost
- graphlet-size “five” is the best considering cost-benefit trade off
- Existing tool regard to GFD (exact counting) GraphCrunch2 fails to count in real datas
- exact counting is required for a few scenarios
- desining graph kernels
- resconstruct the counts from the count of a specific one graphlet
- contribution
- propose GUISE: construct graphlet frequency distribution (use MCMC)
[2. Background]
- GFD
- $log(\frac{f(i)+1}{\sum_{i=1}^{29} f(i)+29 })$ where f(i) is the frequency of graphlet $g_i$
- deal with graphlets of size 3,4 and 5
- Markov chains
- random variable X has a range of values (states) that are defined in a state space S
- $X_t$ : X at (discrete) time t
- the random variable is called Markov process if the transition probabilities between a pair of states in S depends only on the current value(state) of X
- The matrix of transition probabilities (state i → state j) : T(i, j)
- each entry is between 0 and 1
- for a fixed i, the sum of T(i, j) is equal to 1
- said to reach stationary distribution $\pi$ when $\pi = \pi T$ (eigenvalue)
- said to reversible when $\pi(i)T(i,j)=\pi(j)T(j,i)$ (sufficient condition for $\pi$ to be a stationary distribution of the Markov chain)
[3. Method]
[A. Uniform samplming of graphlets for GFD construction]
- the number of all graphlets : $\abs{S}$
- choose each graphlet with the probability : $1/\abs{S}$
- want to sample a graphlet uniformly without explicitly enumerating all the embeddings of all graphlets ⇒ solution : MCMC algorithms
- MCMC : perform a random walk on the sample space with a locally computable transition probability matrix in such as a manner that the stationary distribution of the random walk aligns with the desired probability distribution
- construction GFD by GUISE
- keeps one counter for each of the graphlets (a total of 29 counters all initialized to 1)
- calls the sampler repeatedly for a large number of iterations
- Lemma 1: When the size of the sample set approaches to infinity, GUISE returns the correct GFD for a graph
[B. MCMC algorithm for Uniform Sampling of a Graphlet]
⇒ MCMC는 마르코프 체인에 기반한 확률 분포로부터 원하는 분포의 정적 분포를 갖는 표본을 추출하는 알고리즘의 한 종류이다
- neighboring graphlets: For a k-graphlet, all the graphlets of size k-1, k, k+1 having k-1, k-1, k nodes in common, respectively, are its neighboring graph
- define the sample space, the state transition process, the transition probability matrix, desired probability distribution
- states : embeddings of any 3-, 4-, 5- graphlets
- time T, state S → walks to one of neighboroing states with the probability that is defined by an appropriate state transition probability matrix T
- add vertex → (k-1)- to k- graphlet
- delete vertex → k- to (k-1)- graphlet
- replace vertex → k- to k- graphlet
- transition probability T of graphlet p and q that are not neighbors of each other : 0
- for uniform sampling, the stationary distribiution : [1/m, .., 1/m] where $m=\abs{S}$
- to make this, symmetric transition probability might be helpful
- set T(p, q) and T(q, p) : = min(1/d(p), 1/d(q))
- if the sum of row entries is not 1, allocate the remaining probability as a self-loop to the resident state
- symmetric transition probability matrix == doubly stochastic matrix
- Lemma 2: An ergodic random walk achieves a uniform stationary distribution iff transition probability matrix is doubly stochastic
- Lemma 3: The random walk that GUISE (self-loop) uses is ergodic
- mixing time : since social network graphs have small diameter, the diameter of transition graph is small, so the mixing time of random walk is small → converge very fast
[C. pseudo-code]
- choose $g_y$ with the probability of $1/\abs{d_{g_x}}$
- accept with the probability of $min(\abs{d_{g_x}}/\abs{d_{g_y}}, 1)$