In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder^{1}.

Suppose we have a large directed graph with vertices and directed edges. For a vertex , let us denote the set of vertices that are *reachable* from . There are two known ways to compute sets (all at the same time):

- Perform Depth-First Search (DFS) from each vertex. This takes time , which is the best known bound for
*sparse*graphs; - Use fast matrix multiplication, which takes time . This algorithm is better for
*dense*graphs.

Can we do better? Turns out we can, if we are OK with merely *approximating the size* of every . Namely, the following theorem was proved back in 1994:

**Theorem 1.** There exists a randomized algorithm for computing -multiplicative approximation for every with running time .

Instead of spelling out the full proof, I will present it as a sequence of *problems*: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.

**Problem 1.** Let be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define . Show how to compute values of *for all vertices * *at once* in time .

**Problem 2. **For a positive integer , denote the distribution of the minimum of independent and uniform reals between 0 and 1. Suppose we receive several independent samples from with an unknown value of . Show that we can obtain a -multiplicative approximation of with probability using as few as samples.

**Problem 3. **Combine the solutions for two previous problems and prove **Theorem 1**.

### Footnotes

- These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!

—Ilya

Why do we requre $f$ to be special (in terms of randomness) in Problem 1? Seems like it will work for any reasonable $f$.

Yes, the first problem can be solved for any f. But see problem 2. :)

That’s a very nice summary of the paper: thanks!

I once asked on CSTheory how to do this. One comment pointed me to the paper you mention. There’s also an answer pointing to some more recent work on how to compute not just cardinalities but also the sets : http://cstheory.stackexchange.com/q/553/236