randomized color coding algorithms for estimating graph motif counts
- Introduction
    
- Graphlets, motifs: inuced subgraphs
 - approximate counting
 - goal: size가 5~6인 motif도 counting하는 practical algorithm 개발
 - ongoing research (sampling)
        
- random walk - inefficient
 - CC algorithm - use color coding technique
            
- build an abstract “urn(바구니)” which conatins a sub population of all the k-trees of G
 - task of sampling k-graphlets -(reduced)→ sampling k-trees from the urn
 
⇒ Therefore, 1. build-up phase, 2. sampling phase.
 
→ CC algorithm outperforms random walk
 - What problems CC have?
        
- time and space limit
 - taking s samples gives additive 1/s- approximation ⇒ 1/s 이상의 fraction을 가진 motif만 count 가능
 
 - contributions of the paper
        
- 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
            
- 하나의 machine word로 up to 16개의 node를 가진 colored rooted trees를 표현하는 simple color coding data structure을 소개하고자 함 (build phase)
 - large graph에서는, biased coloring trick을 제시하고자 함 (space-time trade-off)
 - describe set of architectural and implementation optimizations
 
 - for the sampling phase
            
- 가장 frequent한 motif를 estimate하고, urn에서 제거. 이를 반복
 - small relative error가 있을 수 있음
 
⇒ AGS: adative graphlet sampling 알고리즘 제시
 
 - 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
            
 - result of the work: MOTIVO
        
- scalable, takes little time
 
 
 
1.1 Related work
- ESCAPE: used for computing ground-truth counts here
 - sampling path or random walk: only counts frequencies
 - CC: not fittable for k>7
 - clique counting algorithm [16]: only for cliques
 
- Color coding and CC
    
- detect paths and trees
 - consists of a build-up phases and a sampling phase
 
 
2.1 The build-up phase
- goal: build a treelet count table (abstract urn, used for sampling)
 - steps
    
- do coloring of G
 - look at treelets of G that are colorful (non-induced copies)
 - for each v, initialize $c(T_c, v)=1$, where T is the trivial treelet on 1 node and $C=\set{c_v}$.
 - count $c(T_C, v)$ via dynamic programming
        
- Theorem 1.1: build-up takes time $O(a^k m)$ and space $O(a^kn)$.
 
 
 
2.2 The sampling phase
- goal: sample colorful graphlet copies using treelet count table from the build-up phase
 - multi-stage
    
- draw a node and $T_C$.
 - decompose $T_C$ and recursively sample a copy of $T_{C’}’$ and so on.
 - combine
 
⇒ This gives a colorful copy of T drawn uniformly at random from G
- Using $E[x_i]=c_i\sigma_i/t$ where $\sigma_i$: number of spanning trees in $H_i$, t: total number of colorful k-treelets (both can be computed quickly), let $\hat{c_i}=t\sigma_i^{-1}x_i$. (unbiased)
 
 - 
    
How to estimate the number of total (uncolored) copies $g_i$ of $H_i$?
⇒ $p_k$=possibility that a fixed subset of k nodes in G becomes colorful
⇒ $E[c_i]=p_kg_i$, so that $\hat{g_i}=c_i/p_k$. (unbiased)
 
- Speeding up color coding
    
- data structures
 
 
3.1 Succint data structures
- a hash table maps the pointer of each $T_C$ to the count $c(T_C, v)$.
 - The internals of CC
    
- 
        
check-and-merge step: for every u~v, counts c(T’, v) and c(T’’, u) in the hash tables of v and u, check their intersection is emty, and merge
→ takes 30%~70% of build-up phase
 
 - 
        
 - Motivo’s treelet
    
- encode T → binary string $s_T$
        
- perform DFS starting from r and i-th bit becomes 0 or 1
 - merging T: concantenating 1, $s_{T’’}$, $s_{T’}$.
 
 - so we can get followings easily:
        
- getsize()
 - merge(T’, T’’)
 - decomp(T)
 - sub(T)
 
 
 - encode T → binary string $s_T$