In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder1.
Suppose we have a large directed graph with vertices and directed edges. For a vertex , let us denote the set of vertices that are reachable from . There are two known ways to compute sets (all at the same time):
- Perform Depth-First Search (DFS) from each vertex. This takes time , which is the best known bound for sparse graphs;
- Use fast matrix multiplication, which takes time . This algorithm is better for dense graphs.
Can we do better? Turns out we can, if we are OK with merely approximating the size of every . Namely, the following theorem was proved back in 1994:
Theorem 1. There exists a randomized algorithm for computing -multiplicative approximation for every with running time .
Instead of spelling out the full proof, I will present it as a sequence of problems: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.
Problem 1. Let be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define . Show how to compute values of for all vertices at once in time .
Problem 2. For a positive integer , denote the distribution of the minimum of independent and uniform reals between 0 and 1. Suppose we receive several independent samples from with an unknown value of . Show that we can obtain a -multiplicative approximation of with probability using as few as samples.
Problem 3. Combine the solutions for two previous problems and prove Theorem 1.
- These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!