randomized color coding algorithms for estimating graph motif counts
- Introduction
- Graphlets, motifs: inuced subgraphs
- approximate counting
- goal: size가 5~6인 motif도 counting하는 practical algorithm 개발
- ongoing research (sampling)
- random walk - inefficient
- CC algorithm - use color coding technique
- build an abstract “urn(바구니)” which conatins a sub population of all the k-trees of G
- task of sampling k-graphlets -(reduced)→ sampling k-trees from the urn
⇒ Therefore, 1. build-up phase, 2. sampling phase.
→ CC algorithm outperforms random walk
- What problems CC have?
- time and space limit
- taking s samples gives additive 1/s- approximation ⇒ 1/s 이상의 fraction을 가진 motif만 count 가능
- contributions of the paper
- 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
- 하나의 machine word로 up to 16개의 node를 가진 colored rooted trees를 표현하는 simple color coding data structure을 소개하고자 함 (build phase)
- large graph에서는, biased coloring trick을 제시하고자 함 (space-time trade-off)
- describe set of architectural and implementation optimizations
- for the sampling phase
- 가장 frequent한 motif를 estimate하고, urn에서 제거. 이를 반복
- small relative error가 있을 수 있음
⇒ AGS: adative graphlet sampling 알고리즘 제시
- 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
- result of the work: MOTIVO
- scalable, takes little time
1.1 Related work
- ESCAPE: used for computing ground-truth counts here
- sampling path or random walk: only counts frequencies
- CC: not fittable for k>7
- clique counting algorithm [16]: only for cliques
- Color coding and CC
- detect paths and trees
- consists of a build-up phases and a sampling phase
2.1 The build-up phase
- goal: build a treelet count table (abstract urn, used for sampling)
- steps
- do coloring of G
- look at treelets of G that are colorful (non-induced copies)
- for each v, initialize $c(T_c, v)=1$, where T is the trivial treelet on 1 node and $C=\set{c_v}$.
- count $c(T_C, v)$ via dynamic programming
- Theorem 1.1: build-up takes time $O(a^k m)$ and space $O(a^kn)$.
2.2 The sampling phase
- goal: sample colorful graphlet copies using treelet count table from the build-up phase
- multi-stage
- draw a node and $T_C$.
- decompose $T_C$ and recursively sample a copy of $T_{C’}’$ and so on.
- combine
⇒ This gives a colorful copy of T drawn uniformly at random from G
- Using $E[x_i]=c_i\sigma_i/t$ where $\sigma_i$: number of spanning trees in $H_i$, t: total number of colorful k-treelets (both can be computed quickly), let $\hat{c_i}=t\sigma_i^{-1}x_i$. (unbiased)
-
How to estimate the number of total (uncolored) copies $g_i$ of $H_i$?
⇒ $p_k$=possibility that a fixed subset of k nodes in G becomes colorful
⇒ $E[c_i]=p_kg_i$, so that $\hat{g_i}=c_i/p_k$. (unbiased)
- Speeding up color coding
- data structures
3.1 Succint data structures
- a hash table maps the pointer of each $T_C$ to the count $c(T_C, v)$.
- The internals of CC
-
check-and-merge step: for every u~v, counts c(T’, v) and c(T’’, u) in the hash tables of v and u, check their intersection is emty, and merge
→ takes 30%~70% of build-up phase
-
- Motivo’s treelet
- encode T → binary string $s_T$
- perform DFS starting from r and i-th bit becomes 0 or 1
- merging T: concantenating 1, $s_{T’’}$, $s_{T’}$.
- so we can get followings easily:
- getsize()
- merge(T’, T’’)
- decomp(T)
- sub(T)
- encode T → binary string $s_T$