# Estimating Transitive Closure via Sampling

In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder1.

Suppose we have a large directed graph with $n$ vertices and $m$ directed edges. For a vertex $v$, let us denote $R_v$ the set of vertices that are reachable from $v$. There are two known ways to compute sets $R_v$ (all at the same time):

• Perform Depth-First Search (DFS) from each vertex. This takes time $O(nm)$, which is the best known bound for sparse graphs;
• Use fast matrix multiplication, which takes time $O(n^{2.37\ldots})$. This algorithm is better for dense graphs.

Can we do better? Turns out we can, if we are OK with merely approximating the size of every $R_v$. Namely, the following theorem was proved back in 1994:

Theorem 1. There exists a randomized algorithm for computing $(1 + \varepsilon)$-multiplicative approximation for every $|R_v|$ with running time $\varepsilon^{-2}\cdot m \cdot \mathrm{poly}(\log n)$.

Instead of spelling out the full proof, I will present it as a sequence of problems: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.

Problem 1. Let $f \colon V \to [0, 1]$ be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define $g(v) = \min_{w \in R_v} f(w)$. Show how to compute values of $g(v)$ for all vertices $v$ at once in time $m \cdot \mathrm{poly}(\log n)$.

Problem 2. For a positive integer $k$, denote $U_k$ the distribution of the minimum of $k$ independent and uniform reals between 0 and 1. Suppose we receive several independent samples from $U_k$ with an unknown value of $k$. Show that we can obtain a $(1 + \varepsilon)$-multiplicative approximation of $k$ with probability $1 - \delta$ using as few as $O(\log(1 / \delta) / \varepsilon^2)$ samples.

Problem 3. Combine the solutions for two previous problems and prove Theorem 1.

### Footnotes

1. These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!

Ilya

## 3 thoughts on “Estimating Transitive Closure via Sampling”

1. Urbanowicz says:

Why do we requre $f$ to be special (in terms of randomness) in Problem 1? Seems like it will work for any reasonable $f$.

• Yes, the first problem can be solved for any f. But see problem 2. :)

2. That’s a very nice summary of the paper: thanks!

I once asked on CSTheory how to do this. One comment pointed me to the paper you mention. There’s also an answer pointing to some more recent work on how to compute not just cardinalities but also the sets $R_v$: http://cstheory.stackexchange.com/q/553/236