# Estimating Transitive Closure via Sampling

In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder1.

Suppose we have a large directed graph with $n$ vertices and $m$ directed edges. For a vertex $v$, let us denote $R_v$ the set of vertices that are reachable from $v$. There are two known ways to compute sets $R_v$ (all at the same time):

• Perform Depth-First Search (DFS) from each vertex. This takes time $O(nm)$, which is the best known bound for sparse graphs;
• Use fast matrix multiplication, which takes time $O(n^{2.37\ldots})$. This algorithm is better for dense graphs.

Can we do better? Turns out we can, if we are OK with merely approximating the size of every $R_v$. Namely, the following theorem was proved back in 1994:

Theorem 1. There exists a randomized algorithm for computing $(1 + \varepsilon)$-multiplicative approximation for every $|R_v|$ with running time $\varepsilon^{-2}\cdot m \cdot \mathrm{poly}(\log n)$.

Instead of spelling out the full proof, I will present it as a sequence of problems: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.

Problem 1. Let $f \colon V \to [0, 1]$ be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define $g(v) = \min_{w \in R_v} f(w)$. Show how to compute values of $g(v)$ for all vertices $v$ at once in time $m \cdot \mathrm{poly}(\log n)$.

Problem 2. For a positive integer $k$, denote $U_k$ the distribution of the minimum of $k$ independent and uniform reals between 0 and 1. Suppose we receive several independent samples from $U_k$ with an unknown value of $k$. Show that we can obtain a $(1 + \varepsilon)$-multiplicative approximation of $k$ with probability $1 - \delta$ using as few as $O(\log(1 / \delta) / \varepsilon^2)$ samples.

Problem 3. Combine the solutions for two previous problems and prove Theorem 1.

### Footnotes

1. These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!

Ilya

# Better Circuit Lower Bounds for Explicit Functions

Below is a post by Ilya on a new circuit lower bound.

Let $f \colon \{0, 1\}^n \to \{0, 1\}$ be a Boolean function of $n$ arguments. What is the smallest circuit that can compute $f$ if the set of allowed gates consists of all the unary and binary functions? The size of a circuit is simply the number of gates in it.

It’s easy to show that a random function requires circuits of size $\Omega(2^n / n)$ (this is tight in the worst case), but a big question in computational complexity is to provide a simple enough function that requires large enough circuits to compute.

Basically, if one comes up with a function that lies in $\mathrm{NP}$ and requires circuits of size super-polynomial in $n$, then $\mathrm{P} \ne \mathrm{NP}$, and one of the Millenium Problems is solved!

How far are we from accomplishing this? Well, until recently, the best lower bound for a function from NP was $3n$ [Blum 1984] (remember that eventually we are aiming at super-polynomial lower bounds!).

I’m very excited to report that very recently a better lower bound has been proved by Magnus Gausdal Find, Sasha Golovnev, Edward Hirsch and Sasha Kulikov! Is it super-polynomial in $n$? Is it at least super-linear in $n$?

Well, not really. The new lower bound is $3.011n$! You may ask: why one should be excited about such a seemingly weak improvement (which, nevertheless, took the authors several years of hard work, as far as I know)?

One reason is that this is the first progress on one of the central questions in computational complexity in more than 30 years. In general, researchers like to invent new models and problems and tend to forget good old “core” questions very easily. I’m happy to see this not being the case here.

Besides, the new result is, in fact, much more general than say the Blum’s. The authors prove that any function that is not constant on every sufficiently high-dimensional affine subspace requires large circuits. Then, they simply invoke the recent result of Ben-Sasson and Kopparty, who construct an explicit function with this property. That is, the paper shows a certain pseudo-randomness property to be enough for better lower bounds. Hopefully, there will be more developments in this direction.

Is this result one more step towards proving $\mathrm{P} \ne \mathrm{NP}$? Time will tell.

# Beyond Locality-Sensitive Hashing

This is an extended version of (the only) post in my personal blog.

In this post I will introduce locality-sensitive hashing (for which Andrei Broder, Moses Charikar and Piotr Indyk have been recently awarded Paris Kanellakis Theory and Practice Award) and sketch recent developments by Alexandr Andoni, Piotr Indyk, Nguyen Le Huy and myself (see video by Alexandr Andoni, where he explains the same result from somewhat different perspective).

One problem that one encounters a lot in machine learning, databases and other areas is the near neighbor search problem (NN). Given a set of points $P$ in a $d$-dimensional space and a threshold $r > 0$ the goal is to build a data structure that given a query $q$ reports any point from $P$ within distance at most $r$ from $q$.

Unfortunately, all known data structures for NN suffer from the so-called “curse of dimensionality”: if the query time is $o(n)$ (hereinafter we denote $n$ the number of points in our dataset $P$), then either space or query time is $2^{\Omega(d)}$.

To overcome this obstacle one can consider the approximate near neighbor search problem (ANN). Now in addition to $P$ and $r$ we are also given an approximation parameter $c > 1$. The goal is given a query $q$ report a point from $P$ within distance $cr$ from $q$, provided that the neighborhood of radius $r$ is not empty.

It turns out that one can overcome the curse of dimensionality for ANN (see, for example, this paper and its references). If one insists on having near-linear (in $n$) memory and being subexponential in the dimension, then the only known technique for ANN is locality-sensitive hashing. Let us give some definitions. Say a hash family $\mathcal{H}$ on a metric space $\mathcal{M} = (X, D)$ is $(r, cr, p_1, p_2)$-sensitive, if for every two points $x,y \in X$

• if $D(x, y) \leq r$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$;
• if $D(x, y) \geq cr$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$.

Of course, for $\mathcal{H}$ to be meaningful, we should have $p_1 > p_2$. Informally speaking, the closer two points are, the larger probability of their collision is.

Let us construct a simple LSH family for hypercube $\{0, 1\}^d$, equipped with Hamming distance. We set $\mathcal{H} = \{h_1, h_2, \ldots, h_d\}$, where $h_i(x) = x_i$. It is easy to check that this family is $(r, cr, 1 - r / d, 1 - cr / d)$-sensitive.

In 1998 Piotr Indyk and Rajeev Motwani proved the following theorem. Suppose we have a $(r, cr, p_1, p_2)$-sensitive hash family $\mathcal{H}$ for the metric we want to solve ANN for. Moreover, assume that we can sample and evaluate a function from $\mathcal{H}$ relatively quickly, store it efficiently, and that $p_1 = 1 / n^{o(1)}$. Then, one can solve ANN for this metric with space roughly $O(n^{1 + \rho})$ and query time $O(n^{\rho})$, where $\rho = \ln(1 / p_1) / \ln(1 / p_2)$. Plugging the family from the previous paragraph, we are able to solve ANN for Hamming distance in space around $O(n^{1 + 1 / c})$ and query time $O(n^{1/c})$. More generally, in the same paper it was proved that one can achieve $\rho \leq 1 / c$ for the case of $\ell_p$ norms for $1 \leq p \leq 2$ (via an embedding by William Johnson and Gideon Schechtman). In 2006 Alexandr Andoni and Piotr Indyk proved that one can achieve $\rho \leq 1 / c^2$ for the $\ell_2$ norm.

Thus, the natural question arises: how optimal are the abovementioned bounds on $\rho$ (provided that $p_1$ is not too tiny)? This question was resolved in 2011 by Ryan O’Donnell, Yi Wu and Yuan Zhou: they showed a lower bound $\rho \geq 1/c - o(1)$ for $\ell_1$ and $\rho \geq 1/c^2-o(1)$ for $\ell_2$ matching the upper bounds. Thus, the above simple LSH family for the hypercube is in fact, optimal!

Is it the end of the story? Not quite. The catch is that the definition of LSH families is actually too strong. The real property that is used in the ANN data structure is the following: for every pair of points $x \in P, y \in X$ we have

• if $D(x, y) \leq r$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$;
• if $D(x, y) \geq cr$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$.

The difference with the definition of $(r,cr,p_1,p_2)$-sensitive family is that we now restrict one of the points to be in a prescribed set $P$. And it turns out that one can indeed exploit this dependency on data to get a slightly improved LSH family. Namely, we are able to achieve $\rho \leq 7 / (8c^2) + O(1 / c^3) + o(1)$ for $\ell_2$, which by a simple embedding of $\ell_1$ into $\ell_2$-squared gives $\rho \leq 7 / (8c) + O(1 / c^{3/2}) + o(1)$ for $\ell_1$ (in particular, Hamming distance over the hypercube). This is nice for two reasons. First, we are able to overcome the natural LSH barrier. Second, this result shows that what “practitioners” have been doing for some time (namely, data-dependent space partitioning) can give advantage in theory, too.

In the remaining text let me briefly sketch the main ideas of the result. From now on, assume that our metric is $\ell_2$. The first ingredient is an LSH family that simplifies and improves upon $\rho = 1 / c^2$ for the case, when all data points and queries lie in a ball of radius $O(cr)$. This scheme has strong parallels with an SDP rounding scheme of David Karger, Rajeev Motwani and Madhu Sudan.

The second (and the main) ingredient is a two-level hashing scheme that leverages the abovementioned better LSH family. First, let us recall, how the standard LSH data structure works. We start from a $(r, cr, p_1, p_2)$-sensitive family $\mathcal{H}$ and then consider the following simple “tensoring” operation: we sample $k$ functions $h_1, h_2, \ldots, h_k$ from $\mathcal{H}$ independently and then we hash a point $x$ into a tuple $(h_1(x), h_2(x), \ldots, h_k(x))$. It is easy to see that the new family is $(r, cr, p_1^k, p_2^k)$-sensitive. Let us denote this family by $\mathcal{H}^k$. Now we choose $k$ to have the following collision probabilities:

• $1/n$ at distance $cr$;
• $1/n^{\rho}$ at distance $r$

(actually, we can not set $k$ to achieve these probabilities exactly, since $k$ must be integer, that’s exactly why we need the condition $p_1 = 1 / n^{o(1)}$). Now we hash all the points from the dataset using a random function from $\mathcal{H}^k$, and to answer a query $q$ we hash $q$ and enumerate all the points in the corresponding bucket, until we find anything within distance $cr$. To analyze this simple data structure, we observe that the average number of “outliers” (points at distance more than $cr$) we encounter is at most one due to the choice of $k$. On the other hand, for any near neighbor (within distance at most $r$) we find it with probability at least $n^{-\rho}$, so, to boost it to constant, we build $O(n^{\rho})$ independent hash tables. As a result, we get a data structure with space $O(n^{1 + \rho})$ and query time $O(n^{\rho})$.

Now let us show how to build a similar two-level data structure, which achieves somewhat better parameters. First, we apply the LSH family $\mathcal{H}$ for $\ell_2$ with $\rho \approx 1/c^2$, but only partially. Namely, we choose a constant parameter $\tau > 1$ and $k$ such that the collision probabilities are as follows:

• $1/n$ at distance $\tau cr$;
• $1/n^{1/\tau^2}$ at distance $cr$;
• $1/n^{1/(\tau c)^2}$ at distance $r$.

Now we hash all the data points with $\mathcal{H}^k$ and argue that with high probability every bucket has diameter $O(cr)$. But now given this “bounded buckets” condition, we can utilize the better family designed above! Namely, we hash every bucket using our new family to achieve the following probabilities:

• $1/n^{1 - 1/\tau^2}$ at distance $cr$;
• $1/n^{(1 - \Omega_{\tau}(1))(1 - 1 / \tau^2) / c^2}$ at distance $r$.

Overall, the data structure consists of an outer hash table that uses the LSH family of Andoni and Indyk, and then every bucket is hashed using the new family. Due to independence, the collision probabilities multiply, and we get

• $1/n$ at distance $cr$;
• $1/n^{(1 - \Omega_{\tau}(1)) / c^2}$ at distance $r$.

Then we argue as before and conclude that we can achieve $\rho \leq (1 - \Omega(1)) / c^2$.

After carefully optimizing all the parameters, we achieve, in fact, $\rho \approx 7 / (8c^2)$. Then we go further, and consider a multi-level scheme with several distance scales. Choosing these scales carefully, we achieve $\rho \approx 1 / (2 c^2 \ln 2)$.