[Abstract]

  • This paper deals with GFD (graphlet frequency distribution).
  • Propose MCMC sampling method for constructing approx GFD.

[1. Introduction]

  • key challenge for obtaining GFD : excessive computational cost
  • graphlet-size “five” is the best considering cost-benefit trade off
  • Existing tool regard to GFD (exact counting) GraphCrunch2 fails to count in real datas
  • exact counting is required for a few scenarios
    • desining graph kernels
    • resconstruct the counts from the count of a specific one graphlet
  • contribution
    • propose GUISE: construct graphlet frequency distribution (use MCMC)

[2. Background]

  • GFD
    • $log(\frac{f(i)+1}{\sum_{i=1}^{29} f(i)+29 })$ where f(i) is the frequency of graphlet $g_i$
    • deal with graphlets of size 3,4 and 5
  • Markov chains
    • random variable X has a range of values (states) that are defined in a state space S
    • $X_t$ : X at (discrete) time t
    • the random variable is called Markov process if the transition probabilities between a pair of states in S depends only on the current value(state) of X
    • The matrix of transition probabilities (state i → state j) : T(i, j)
      • each entry is between 0 and 1
      • for a fixed i, the sum of T(i, j) is equal to 1
    • said to reach stationary distribution $\pi$ when $\pi = \pi T$ (eigenvalue)
    • said to reversible when $\pi(i)T(i,j)=\pi(j)T(j,i)$ (sufficient condition for $\pi$ to be a stationary distribution of the Markov chain)

[3. Method]

[A. Uniform samplming of graphlets for GFD construction]

  • the number of all graphlets : $\abs{S}$
  • choose each graphlet with the probability : $1/\abs{S}$
  • want to sample a graphlet uniformly without explicitly enumerating all the embeddings of all graphlets ⇒ solution : MCMC algorithms
  • MCMC : perform a random walk on the sample space with a locally computable transition probability matrix in such as a manner that the stationary distribution of the random walk aligns with the desired probability distribution
  • construction GFD by GUISE
    • keeps one counter for each of the graphlets (a total of 29 counters all initialized to 1)
    • calls the sampler repeatedly for a large number of iterations
  • Lemma 1: When the size of the sample set approaches to infinity, GUISE returns the correct GFD for a graph

[B. MCMC algorithm for Uniform Sampling of a Graphlet]

⇒ MCMC는 마르코프 체인에 기반한 확률 분포로부터 원하는 분포의 정적 분포를 갖는 표본을 추출하는 알고리즘의 한 종류이다

  • neighboring graphlets: For a k-graphlet, all the graphlets of size k-1, k, k+1 having k-1, k-1, k nodes in common, respectively, are its neighboring graph
  • define the sample space, the state transition process, the transition probability matrix, desired probability distribution
  • states : embeddings of any 3-, 4-, 5- graphlets
  • time T, state S → walks to one of neighboroing states with the probability that is defined by an appropriate state transition probability matrix T
  • add vertex → (k-1)- to k- graphlet
  • delete vertex → k- to (k-1)- graphlet
  • replace vertex → k- to k- graphlet
  • transition probability T of graphlet p and q that are not neighbors of each other : 0
  • for uniform sampling, the stationary distribiution : [1/m, .., 1/m] where $m=\abs{S}$
    • to make this, symmetric transition probability might be helpful
    • set T(p, q) and T(q, p) : = min(1/d(p), 1/d(q))
      • if the sum of row entries is not 1, allocate the remaining probability as a self-loop to the resident state
    • symmetric transition probability matrix == doubly stochastic matrix
  • Lemma 2: An ergodic random walk achieves a uniform stationary distribution iff transition probability matrix is doubly stochastic
  • Lemma 3: The random walk that GUISE (self-loop) uses is ergodic
  • mixing time : since social network graphs have small diameter, the diameter of transition graph is small, so the mixing time of random walk is small → converge very fast

[C. pseudo-code]

Untitled

  • choose $g_y$ with the probability of $1/\abs{d_{g_x}}$
  • accept with the probability of $min(\abs{d_{g_x}}/\abs{d_{g_y}}, 1)$