randomized color coding algorithms for estimating graph motif counts

  1. Introduction
    • Graphlets, motifs: inuced subgraphs
    • approximate counting
    • goal: size가 5~6인 motif도 counting하는 practical algorithm 개발
    • ongoing research (sampling)
      • random walk - inefficient
      • CC algorithm - use color coding technique
        1. build an abstract “urn(바구니)” which conatins a sub population of all the k-trees of G
        2. task of sampling k-graphlets -(reduced)→ sampling k-trees from the urn

        ⇒ Therefore, 1. build-up phase, 2. sampling phase.

      → CC algorithm outperforms random walk

    • What problems CC have?
      1. time and space limit
      2. taking s samples gives additive 1/s- approximation ⇒ 1/s 이상의 fraction을 가진 motif만 count 가능
    • contributions of the paper
      1. 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
        1. 하나의 machine word로 up to 16개의 node를 가진 colored rooted trees를 표현하는 simple color coding data structure을 소개하고자 함 (build phase)
        2. large graph에서는, biased coloring trick을 제시하고자 함 (space-time trade-off)
        3. describe set of architectural and implementation optimizations
      2. for the sampling phase
        • 가장 frequent한 motif를 estimate하고, urn에서 제거. 이를 반복
        • small relative error가 있을 수 있음

        ⇒ AGS: adative graphlet sampling 알고리즘 제시

    • result of the work: MOTIVO
      • scalable, takes little time

1.1 Related work

  • ESCAPE: used for computing ground-truth counts here
  • sampling path or random walk: only counts frequencies
  • CC: not fittable for k>7
  • clique counting algorithm [16]: only for cliques
  1. Color coding and CC
    • detect paths and trees
    • consists of a build-up phases and a sampling phase

2.1 The build-up phase

  • goal: build a treelet count table (abstract urn, used for sampling)
  • steps
    1. do coloring of G
    2. look at treelets of G that are colorful (non-induced copies)
    3. for each v, initialize $c(T_c, v)=1$, where T is the trivial treelet on 1 node and $C=\set{c_v}$.
    4. count $c(T_C, v)$ via dynamic programming
      • Theorem 1.1: build-up takes time $O(a^k m)$ and space $O(a^kn)$.

2.2 The sampling phase

  • goal: sample colorful graphlet copies using treelet count table from the build-up phase
  • multi-stage
    1. draw a node and $T_C$.
    2. decompose $T_C$ and recursively sample a copy of $T_{C’}’$ and so on.
    3. combine

    ⇒ This gives a colorful copy of T drawn uniformly at random from G

    • Using $E[x_i]=c_i\sigma_i/t$ where $\sigma_i$: number of spanning trees in $H_i$, t: total number of colorful k-treelets (both can be computed quickly), let $\hat{c_i}=t\sigma_i^{-1}x_i$. (unbiased)
  • How to estimate the number of total (uncolored) copies $g_i$ of $H_i$?

    ⇒ $p_k$=possibility that a fixed subset of k nodes in G becomes colorful

    ⇒ $E[c_i]=p_kg_i$, so that $\hat{g_i}=c_i/p_k$. (unbiased)

  1. Speeding up color coding
    • data structures

3.1 Succint data structures

  • a hash table maps the pointer of each $T_C$ to the count $c(T_C, v)$.
  • The internals of CC
    • check-and-merge step: for every u~v, counts c(T’, v) and c(T’’, u) in the hash tables of v and u, check their intersection is emty, and merge

      → takes 30%~70% of build-up phase

  • Motivo’s treelet
    • encode T → binary string $s_T$
      • perform DFS starting from r and i-th bit becomes 0 or 1
      • merging T: concantenating 1, $s_{T’’}$, $s_{T’}$.
    • so we can get followings easily:
      • getsize()
      • merge(T’, T’’)
      • decomp(T)
      • sub(T)