$$ \newcommand{\floor}[1]{\left\lfloor #1 \right\rfloor} \newcommand{\ceil}[1]{\left\lceil #1 \right\rceil} \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\C}{\mathbb{C}} \renewcommand{\L}{\mathcal{L}} \newcommand{\x}{\times} \newcommand{\contra}{\scalebox{1.5}{$\lightning$}} \newcommand{\inner}[2]{\left\langle #1 , #2 \right\rangle} \newcommand{\st}{\text{ such that }} \newcommand{\for}{\text{ for }} \newcommand{\Setcond}[2]{ \left\{\, #1 \mid #2 \, \right\}} \newcommand{\setcond}[2]{\Setcond{#1}{#2}} \newcommand{\seq}[1]{ \left\langle #1 \right\rangle} \newcommand{\Set}[1]{ \left\{ #1 \right\}} \newcommand{\set}[1]{ \Set{#1} } \newcommand{\sgn}{\text{sign}} \newcommand{\halfline}{\vspace{0.5em}} \newcommand{\diag}{\text{diag}} \newcommand{\legn}[2]{\left(\frac{#1}{#2}\right)} \newcommand{\ord}{\text{ord}} \newcommand{\di}{\mathrel{|}} \newcommand{\gen}[1] \newcommand{\irr}{\mathrm{irr }} \renewcommand{\deg}{\mathrm{deg }} \newcommand{\nsgeq}{\trianglelefteq} \newcommand{\nsg}{\triangleleft} \newcommand{\argmin}{\mathrm{argmin}} \newcommand{\argmax}{\mathrm{argmax}} \newcommand{\minimize}{\mathrm{minimize}} \newcommand{\maximize}{\mathrm{maximize}} \newcommand{\subto}{\mathrm{subject\ to}} \newcommand{\DKL}[2]{D_{\mathrm{KL}}\left(#1 \di\di #2\right)} \newcommand{\ReLU}{\mathrm{ReLU}} \newcommand{\E}{\mathsf{E}} \newcommand{\V}{\mathsf{Var}} \newcommand{\Corr}{\mathsf{Corr}} \newcommand{\Cov}{\mathsf{Cov}} \newcommand{\covariance}[1]{\Cov\left(#1\right)} \newcommand{\variance}[1]{\V\left[#1\right]} \newcommand{\variancewith}[1]{\V\left[#1\right]} \newcommand{\expect}[1]{\E\left[#1\right]} \newcommand{\expectwith}[2]{\E_{#1}\left[#2\right]} \renewcommand{\P}{\mathsf{P}} \newcommand{\uniform}[2]{\mathrm{Uniform}\left(#1 \dots #2\right)} \newcommand{\gdist}[2]{\mathcal{N}\left(#1, #2\right)} \DeclarePairedDelimiter{\norm}{\lVert}{\rVert} $$ \everymath{\displaystyle}

MOTIVO - fast motif counting via succinct color coding and adaptive sampling | Hyunju Kim

MOTIVO - fast motif counting via succinct color coding and adaptive sampling

M. Bressan et al. / 2019 / VLDB

April 4, 2022

randomized color coding algorithms for estimating graph motif counts

Introduction
- Graphlets, motifs: inuced subgraphs
- approximate counting
- goal: size가 5~6인 motif도 counting하는 practical algorithm 개발
- ongoing research (sampling)
  - random walk - inefficient
  - CC algorithm - use color coding technique
    1. build an abstract “urn(바구니)” which conatins a sub population of all the k-trees of G
    2. task of sampling k-graphlets -(reduced)→ sampling k-trees from the urn
    ⇒ Therefore, 1. build-up phase, 2. sampling phase.
  → CC algorithm outperforms random walk
- What problems CC have?
  1. time and space limit
  2. taking s samples gives additive 1/s- approximation ⇒ 1/s 이상의 fraction을 가진 motif만 count 가능
- contributions of the paper
  1. 따라서 이 논문에서는 color coding 기법을 사용하여 단점을 극복하고자 함. How?
    1. 하나의 machine word로 up to 16개의 node를 가진 colored rooted trees를 표현하는 simple color coding data structure을 소개하고자 함 (build phase)
    2. large graph에서는, biased coloring trick을 제시하고자 함 (space-time trade-off)
    3. describe set of architectural and implementation optimizations
  2. for the sampling phase
    - 가장 frequent한 motif를 estimate하고, urn에서 제거. 이를 반복
    - small relative error가 있을 수 있음
    ⇒ AGS: adative graphlet sampling 알고리즘 제시
- result of the work: MOTIVO
  - scalable, takes little time

1.1 Related work

ESCAPE: used for computing ground-truth counts here
sampling path or random walk: only counts frequencies
CC: not fittable for k>7
clique counting algorithm [16]: only for cliques

Color coding and CC
- detect paths and trees
- consists of a build-up phases and a sampling phase

2.1 The build-up phase

goal: build a treelet count table (abstract urn, used for sampling)
steps
1. do coloring of G
2. look at treelets of G that are colorful (non-induced copies)
3. for each v, initialize $c(T_c, v)=1$, where T is the trivial treelet on 1 node and $C=\set{c_v}$.
4. count $c(T_C, v)$ via dynamic programming
  - Theorem 1.1: build-up takes time $O(a^k m)$ and space $O(a^kn)$.

2.2 The sampling phase

goal: sample colorful graphlet copies using treelet count table from the build-up phase
multi-stage
1. draw a node and $T_C$.
2. decompose $T_C$ and recursively sample a copy of $T_{C’}’$ and so on.
3. combine
⇒ This gives a colorful copy of T drawn uniformly at random from G
- Using $E[x_i]=c_i\sigma_i/t$ where $\sigma_i$: number of spanning trees in $H_i$, t: total number of colorful k-treelets (both can be computed quickly), let $\hat{c_i}=t\sigma_i^{-1}x_i$. (unbiased)
How to estimate the number of total (uncolored) copies $g_i$ of $H_i$?

⇒ $p_k$=possibility that a fixed subset of k nodes in G becomes colorful

⇒ $E[c_i]=p_kg_i$, so that $\hat{g_i}=c_i/p_k$. (unbiased)

Speeding up color coding
- data structures

3.1 Succint data structures

a hash table maps the pointer of each $T_C$ to the count $c(T_C, v)$.
The internals of CC
- check-and-merge step: for every u~v, counts c(T’, v) and c(T’’, u) in the hash tables of v and u, check their intersection is emty, and merge
  
  → takes 30%~70% of build-up phase
Motivo’s treelet
- encode T → binary string $s_T$
  - perform DFS starting from r and i-th bit becomes 0 or 1
  - merging T: concantenating 1, $s_{T’’}$, $s_{T’}$.
- so we can get followings easily:
  - getsize()
  - merge(T’, T’’)
  - decomp(T)
  - sub(T)