# Estimating Transitive Closure via Sampling

In this post, I describe an algorithm of Edith Cohen, which estimates the size of the transitive closure of a given directed graph in near-linear time. This simple, but extremely clever algorithm uses ideas somewhat similar to the algorithm of Flajolet–Martin for estimating the number of distinct elements in a stream, and to MinHash sketch of Broder1.

Suppose we have a large directed graph with $n$ vertices and $m$ directed edges. For a vertex $v$, let us denote $R_v$ the set of vertices that are reachable from $v$. There are two known ways to compute sets $R_v$ (all at the same time):

• Perform Depth-First Search (DFS) from each vertex. This takes time $O(nm)$, which is the best known bound for sparse graphs;
• Use fast matrix multiplication, which takes time $O(n^{2.37\ldots})$. This algorithm is better for dense graphs.

Can we do better? Turns out we can, if we are OK with merely approximating the size of every $R_v$. Namely, the following theorem was proved back in 1994:

Theorem 1. There exists a randomized algorithm for computing $(1 + \varepsilon)$-multiplicative approximation for every $|R_v|$ with running time $\varepsilon^{-2}\cdot m \cdot \mathrm{poly}(\log n)$.

Instead of spelling out the full proof, I will present it as a sequence of problems: each of them will likely be doable for a mathematically mature reader. Going through the problems should be fun, and besides, it will save me some typing.

Problem 1. Let $f \colon V \to [0, 1]$ be a function that assigns random independent and uniform reals between 0 and 1 to every vertex. Let us define $g(v) = \min_{w \in R_v} f(w)$. Show how to compute values of $g(v)$ for all vertices $v$ at once in time $m \cdot \mathrm{poly}(\log n)$.

Problem 2. For a positive integer $k$, denote $U_k$ the distribution of the minimum of $k$ independent and uniform reals between 0 and 1. Suppose we receive several independent samples from $U_k$ with an unknown value of $k$. Show that we can obtain a $(1 + \varepsilon)$-multiplicative approximation of $k$ with probability $1 - \delta$ using as few as $O(\log(1 / \delta) / \varepsilon^2)$ samples.

Problem 3. Combine the solutions for two previous problems and prove Theorem 1.

### Footnotes

1. These similarities explain my extreme enthusiasm towards the algorithm. Sketching-based techniques are useful for a problem covered in 6.006, yay!

Ilya

# Distribution Testing: Do It With Class!

After a long break since the last time we touched upon distribution testing, we are back to discuss two papers that appeared this month on the arXiv — and hopefully some of their implications.

• [ADK15] Optimal testing for properties of distributions.
• [CDRG15] Testing Shape Restrictions of Discrete Distributions.

# Sketching and Embedding are Equivalent for Norms

## Summary

In this post I will show that any normed space that allows good sketches is necessarily embeddable into an $\ell_p$ space with $p$ close to $1$. This provides a partial converse to a result of Piotr Indyk, who showed how to sketch metrics that embed into $\ell_p$ for $0 < p \le 2$. A cool bonus of this result is that it gives a new technique for obtaining sketching lower bounds.

This result appeared in a recent paper of mine that is a joint work with Alexandr Andoni and Robert Krauthgamer. I am pleased to report that it has been accepted to STOC 2015.

## Sketching

One of the exciting relatively recent paradigms in algorithms is that of sketching. The high-level idea is as follows: if we are interested in working with a massive object $x$, let us start with compressing it to a short sketch $\mathrm{sketch}(x)$ that preserves properties of $x$ we care about. One great example of sketching is the Johnson-Lindenstrauss lemma: if we work with $n$ high-dimensional vectors and are interested in Euclidean distances between them, we can project the vectors on a random $O(\varepsilon^{-2} \cdot \log n)$-dimensional subspace, and this will preserve with high probability all the pairwise distances up to a factor of $1 + \varepsilon$.

It would be great to understand, for which computational problems sketching is possible, and how efficient it can be made. There are quite a few nice results (both upper and lower bounds) along these lines (see, e.g., graph sketching or a recent book about sketching for numerical linear algebra), but the general understanding has yet to emerge.

## Sketching for metrics

One of the main motivations to study sketching is fast computation and indexing of similarity measures $\mathrm{sim}(x, y)$ between two objects $x$ and $y$. Often times similarity between objects is modeled by some metric $d(x, y)$ (but not always! think KL divergence): for instance the above example of the Euclidean distance falls into this category. Thus, instantiating the above general question one can ask: for which metric spaces there exist good sketches? That is, when is it possible to compute a short sketch $\mathrm{sketch}(x)$ of a point $x$ such that, given two sketches $\mathrm{sketch}(x)$ and $\mathrm{sketch}(y)$, one is able to estimate the distance $d(x, y)$?

The following communication game captures the question of sketching metrics. Alice and Bob each have a point from a metric space $X$ (say, $x$ and $y$, respectively). Suppose, in addition, that either $d_X(x, y) \le r$ or $d_X(x, y) > D \cdot r$ (where $r$ and $D$ are the parameters known from the beginning). Both Alice and Bob send messages $\mathrm{sketch}(x)$ and $\mathrm{sketch}(y)$ that are $s$ bits long to Charlie, who is supposed to distinguish two cases (whether $d_X(x, y)$ is small or large) with probability at least $0.99$. We assume that all three parties are allowed to use shared randomness. Our main goal is to understand the trade-off between $D$ (approximation) and $s$ (sketch size).

Arguably, the most important metric spaces are $\ell_p$ spaces. Formally, for $1 \leq p \leq \infty$ we define $\ell_p^d$ to be a $d$-dimensional space equipped with distance

$\|x - y\|_p = \Bigl(\sum_{i=1}^d |x_i - y_i|^p\Bigr)^{1/p}$

(when $p = \infty$ this expression should be understood as $\max_{1 \leq i \leq d} |x_i - y_i|$). One can similarly define $\ell_p$ spaces for $0 < p < 1$; even if the triangle inequality does not hold for this case, it is nevertheless a meaningful notion of distance.

It turns out that $\ell_p$ spaces exhibit very interesting behavior, when it comes to sketching. Indyk showed that for $0 < p \le 2$ one can achieve approximation $D = 1 + \varepsilon$ and sketch size $s = O(1 / \varepsilon^2)$ for every $\varepsilon > 0$ (for $1 \le p \le 2$ this was established before by Kushilevitz, Ostrovsky and Rabani). It is quite remarkable that these bounds do not depend on the dimension of a space. On the other hand, for $\ell_p^d$ with $p > 2$ the dependence on the dimension is necessary. It turns out that for constant approximation $D = O(1)$ the optimal sketch size is $\widetilde{\Theta}(d^{1 - 2/p})$.

Are there any other examples of metrics that admit efficient sketches (say, with constant $D$ and $s$)? One simple observation is that if a metric embeds well into $\ell_p$ for $0 < p \le 2$, then one can sketch this metric well. Formally, we say that a map between metric spaces $f \colon X \to Y$ is an embedding with distortion $\widetilde{D}$, if

$d_X(x_1, x_2) \leq C \cdot d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \widetilde{D} \cdot d_X(x_1, x_2)$

for every $x_1, x_2 \in X$ and for some $C > 0$. It is immediate to see that if a metric space $X$ embeds into $\ell_p$ for $0 < p \le 2$ with distortion $O(1)$, then one can sketch $X$ with $s = O(1)$ and $D = O(1)$. Thus, we know that any metric that embeds well into $\ell_p$ with $0 < p \le 2$ is efficiently sketchable. Are there any other examples? The amazing answer is that we don’t know!

## Our results

Our result shows that for a very important class of metrics—normed spaces—embedding into $\ell_p$ is the only possible way to obtain good sketches. Formally, if a normed space $X$ allows sketches of size $s$ for approximation $D$, then for every $\varepsilon > 0$ the space $X$ embeds into $\ell_{1 - \varepsilon}$ with distortion $O(sD / \varepsilon)$. This result together with the above upper bound by Indyk provides a complete characterization of normed spaces that admit good sketches.

Taking the above result in the contrapositive, we see that non-embeddability implies lower bounds for sketches. This is great, since it potentially allows us to employ many sophisticated non-embeddability results proved by geometers and functional analysts. Specifically, we prove two new lower bounds for sketches: for the planar Earth Mover’s Distance (building on a non-embeddability theorem by Naor and Schechtman) and for the trace norm (non-embeddability was proved by Pisier). In addition to it, we are able to unify certain known results: for instance, classify $\ell_p$ spaces and the cascaded norms in terms of “sketchability”.

## Overview of the proof

Let me outline the main steps of the proof of the implication “good sketches imply good embeddings”. The following definition is central to the proof. Let us call a map $f \colon X \to Y$ between two metric spaces $(s_1, s_2, \tau_1, \tau_2)$-threshold, if for every $x_1, x_2 \in X$:

• $d_X(x_1, x_2) \leq s_1$ implies $d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \tau_1$,
• $d_X(x_1, x_2) \geq s_2$ implies $d_Y\bigl(f(x_1), f(x_2)\bigr) \geq \tau_2$.

One should think of threshold maps as very weak embeddings that merely
preserve certain distance scales.

The proof can be divided into two parts. First, we prove that for a normed space $X$ that allows sketches of size $s$ and approximation $D$ there exists a $(1, O(sD), 1, 10)$-threshold map to a Hilbert space. Then, we prove that the existence of such a map implies the existence of an embedding into $\ell_{1 - \varepsilon}$ with distortion $O(sD / \varepsilon)$.

The first half goes roughly as follows. Assume that there is no $(1, O(sD), 1, 10)$-threshold map from $X$ to a Hilbert space. Then, by convex duality, this implies certain Poincaré-type inequalities on $X$. This, in turn, implies sketching lower bounds for $\ell_{\infty}^k(X)$ (the direct sum of $k$ copies of $X$, where the norm is definied as the maximum of norms of the components) by a result of Andoni, Jayram and Pătrașcu (which is based on a very important notion of information complexity). Then, crucially using the fact that $X$ is a normed space, we conclude that $X$ itself does not have good sketches (this step follows from the fact that every normed space is of type $1$ and is of cotype $\infty$).

The second half uses tools from nonlinear functional analysis. First, building on an argument of Johnson and Randrianarivony, we show that for normed spaces $(1, O(sD), 1, 10)$-threshold map into a Hilbert space implies a uniform embedding into a Hilbert space—that is, a map $f \colon X \to H$, where $H$ is a Hilbert space such that

$L\bigl(\|x_1 - x_2\|_X\bigr) \leq \bigl\|f(x_1) - f(x_1)\bigr\|_H \leq U\bigl(\|x_1 - x_2\|_X\bigr),$

where $L, U \colon [0; \infty) \to [0; \infty)$ are non-decreasing functions such that $L(t) > 0$ for every $t > 0$ and $U(t) \to 0$ as $t \to 0$. Both $L$ and $U$ are allowed to depend only on $s$ and $D$. This step uses a certain Lipschitz extension-type theorem and averaging via bounded invariant means. Finally, we conclude the proof by applying theorems of Aharoni-Maurey-Mityagin and Nikishin and obtain a desired (linear) embedding of $X$ into $\ell_{1 - \varepsilon}$.

## Open problems

Let me finally state several open problems.

The first obvious open problem is to extend our result to as large class of general metric spaces as possible. Two notable examples one should keep in mind are the Khot-Vishnoi space and the Heisenberg group. In both cases, a space admits good sketches (since both spaces are embeddable into $\ell_2$-squared), but neither of them is embeddable into $\ell_1$. I do not know, if these spaces are embeddable into $\ell_{1 - \varepsilon}$, but I am inclined to suspect so.

The second open problem deals with linear sketches. For a normed space, one can require that a sketch is of the form $\mathrm{sketch}(x) = Ax$, where $A$ is a random matrix generated using shared randomness. Our result then can be interpreted as follows: any normed space that allows sketches of size $s$ and approximation $D$ allows a linear sketch with one linear measurement and approximation $O(sD)$ (this follows from the fact that for $\ell_{1 - \varepsilon}$ there are good linear sketches). But can we always construct a linear sketch of size $f(s)$ and approximation $g(D)$, where $f(\cdot)$ and $g(\cdot)$ are some (ideally, not too quickly growing) functions?

Finally, the third open problem is about spaces that allow essentially no non-trivial sketches. Can one characterize $d$-dimensional normed spaces, where any sketch for approximation $O(1)$ must have size $\Omega(d)$? The only example I can think of is a space that contains a subspace that is close to $\ell_{\infty}^{\Omega(d)}$. Is this the only case?

Ilya

# Faster, I say! The race for the fastest SDD linear system solver – STOC 2014 Recaps (Part 9)

In the next post in our series of STOC 2014 recaps, Adrian Vladu tells us about some of the latest and greatest in Laplacian and SDD linear system solvers. There’s been a flurry of exciting results in this line of work, so we hope this gets you up to speed.

The Monday morning session was dominated by a nowadays popular topic, symmetric diagonally dominant (SDD) linear system solvers. Richard Peng started by presenting his work with Dan Spielman, the first parallel solver with near linear work and poly-logarithmic depth! This is exciting, since parallel algorithms are used for large scale problems in scientific computing, so this is a result with great practical applications.

The second talk was given by Jakub Pachocki and Shen Chen Xu from CMU, which was a result of merging two papers. The first result is a new type of trees that can be used as preconditioners. The second one is a more efficient solver, which together with the trees shaved one more $\log n$ factor in the race for the fastest solver.

Before getting into more specific details, it might be a good idea to provide a bit of background on the vast literature of Laplacian solvers.

Typically, linear systems $Ax = b$ are easier to solve whenever $A$ has some structure on it. A particular class we care about are positive semidefinite (PSD) matrices. They work nicely because the solution is the minimizer of the quadratic form $f(x) = \frac{1}{2} x^T A x - b x$ , which happens to be a convex function due to the PSD-ness of $A$. Hence we can use various versions of gradient descent, the convergence of which depends usually on the condition number of $A$.

A subset of PSD matrices are Laplacian matrices, which are nothing else but graph Laplacians; using an easy reduction, one can show that any SDD system can be reduced to a Laplacian system. Laplacians are great because they carry a lot of combinatorial structure. Instead of having to suffer through a lot of scary algebra, this is the place where we finally get to solve some fun problems on graphs. The algorithms we aim for have running time close to linear in the sparsity of the matrix.

One reason why graphs are nice is that we know how to approximate them with other simpler graphs. More specifically, when given a graph $G$, we care about finding a sparser graph $H$ such that $L_H \preceq L_G \preceq \alpha \cdot L_H$a, for some small $\alpha$ (the smaller, the better). The point is that whenever you do gradient descent in order to minimize $f(x)$, you can take large steps by solving a system in the sparser $L_H$. Of course, this requires another linear system solve, only that this only needs to be done on a sparser graph. Applying this idea recursively eventually yields efficient solvers. A lot of combinatorial work is spent on understanding how to compute these sparser graphs.

In their seminal work, Spielman and Teng used ultrasparsifiersb as their underlying combinatorial structure, and after many pages of work they obtained a near linear algorithm with a large polylogarithmic factor in the running time. Eventually, Koutis, Miller and Peng came up with a much cleaner construction, and showed how to construct a chain of sparser and sparser graphs, which yielded a solver that was actually practical. Subsequently, people spent a lot of time trying to shave log factors from the running time, see [9], [6], [7], [12] (the last of which was presented at this STOC), and the list will probably continue.

After this digression, we can get back to the conference program and discuss the results.

How do we solve a system $L_G x = b$? We need to find away to efficiently apply the operator $L_G^+$ to $b$. Even Laplacians are not easy to invert, and what’s worse, their pseudoinverses might not even be sparse. However, we can still represent $L_G^+$ as a product of sparse matrices which are easy to compute.

We can gain some inspiration from trying to numerically approximate the inverse of $(1-x)$ for some small real $x$. Taking the Taylor expansion we get that $(1-x)^{-1} = \sum_{i \geq 0} x^i = \prod_{k \geq 0} (1+x^{2^k})$. Notice that in order to get $\epsilon$ precision, we only need to take the product of the first $\Theta(\log 1/\epsilon)$ factors. It would be great if we could approximate matrix inverses the same way. Actually, we can, since for matrices of norm less than $1$ we have the identity $(I-A)^{-1} = \prod_{k\geq 0}(I + A^{2^k})$. At this point we’d be tempted to think that we’re almost done, since we can just write $L_G = D - A$, and try to invert $I - D^{-1/2} A D^{-1/2}$. However we would still need to compute matrix powers, and those matrices might again not even be sparse, so this approach needs more work.

Richard presented a variation of this idea that is more amenable to SDD matrices. He writes

$(D-A)^{-1} = \frac{1}{2}( D^{-1} + (I+D^{-1}A)(D - AD^{-1}A)^{-1}(I+AD^{-1} ) )$

The only hard part of applying this inverse operator $(D-A)^{-1}$ to a vector consists of left multiplying by $(D-AD^{-1}A)^{-1}$. How to do this? One crucial ingredient is the fact that $M = D-AD^{-1}A$ is also SDD! Therefore we can recurse, and solve a linear system in $M$. You might say that we won’t be able to do it efficiently, since $M$ is not sparse. But with a little bit of work and the help of existing spectral sparsification algorithms $M$ can be approximated with a sparse matrix.

Notice that at the $k^{th}$ level of recursion, the operator we need to apply is $\left(D-(D^{-1}A)^{2^k}\right)^{-1}$. A quick calculation shows that if the condition number of $D-A$ is $\kappa$, then the condition number of $D^{-1}A$ is $1-1/\kappa$. This means that after $k = O(\log \kappa)$ iterations, the eigenvalues of $(D^{-1}A)^{2^k}$ are close to $0$, so we can just approximate the operator with $D^{-1}$ without paying too much for the error.

There are a few details left out. Sparsifying $M$ requires a bit of understanding of its underlying structure. Also, in order to do this in parallel, the authors originally employed the spectral sparsification algorithm of Spielman and Teng, combined with a local clustering algorithm of Orecchia, Sachdeva and Vishnoi. Blackboxing these two sophisticated results might question the practicality of the algorithm. Fortunately, Koutis recently produced a simple self-contained spectral sparsification algorithm, which parallelizes and can replace all the heavy machinery in Richard and Dan’s paper.

Jakub Pachocki and Shen Chen Xu talked about two results, which together yield the fastest SDD system solver to date. The race is still on!

Let me go through a bit of more background. Earlier on I mentioned that graph preconditioners are used to take long steps while doing gradient descent. A dual of gradient descent on the quadratic function is the Richardson iteration. This is yet another iterative method, which refines a coarse approximation to the solution of a linear system. Let $L_G$ be the Laplacian of our given graph, and $L_H$ be the Laplacian of its preconditioner. Let us assume that we have access to the inverse of $L_H$. The Richardson iteration computes a sequence $x_0, x_1, x_2, \dots$, which converges to the solution of the system. It starts with a weak estimate for the solution, and iteratively attempts to decrease the norm of the residue $b- L_G x_k$ by updating the current solution with a coarse approximation to the solution of the system $L_G x = b -L_G x_k$. That coarse approximation is computed using $L_H^{-1}$. Therefore steps are given by

$x_{k+1} = x_k + \alpha \cdot L_H^{-1} ( b - L_G x_k )$

where $\alpha \in (0,1]$ is a parameter that adjusts the length of the step. The better $L_H$ approximates $L_G$, the fewer steps we need to make.

The problem that Jakub and Shen talked about was finding these good preconditioners. The way they do it is by looking more closely at the Richardson iteration, and weakening the requirements. Instead of having the preconditioner approximate $L_G$ spectrally, they only impose some moment bounds. I will not describe them here, but feel free to read the paper. Proving that these moment bounds can be satisfied using a sparser preconditioner than those that have been so far used in the literature constitutes the technical core of the paper.

Just like in the past literature, these preconditioners are obtained by starting with a good tree, and sampling extra edges. Traditionally, people used low stretch spanning trees. The issue with them is that the number of edges in the preconditioner is determined by the average stretch of the tree, and we can easily check that for the square grid this is $\Omega(\log n)$. Unfortunately, in general we don’t know how to achieve this bound yet; the best known result is off by a $\log \log n$ factor. It turns out that we can still get preconditioners by looking at a different quantity, called the $\ell_p$ stretch ($0 < p < 1$), which can be brought down to $O(\log^p n)$. This essentially eliminates the need for computing optimal low stretch spanning trees. Furthermore, these trees can be computed really fast, $O(m \log \log n)$ time in the RAM model, and the algorithm parallelizes.

This result consists of a careful combination of existing algorithms on low stretch embeddings of graphs into tree metrics and low stretch spanning trees. I will talk more about these embedding results in a future post.

a. $\preceq$ is also known as the Lowner partial order. $A \preceq B$ is equivalent to $0 \preceq B-A$, which says that $B-A$ is PSD.
b. A $(k, h)$-ultrasparsifier $H$ of $G$ is a graph with $n-k+1$ edges such that $L_H \preceq L_G \preceq h \cdot L_H$. It turns out that one is able to efficiently construct $(k, n/k\ \mathrm{polylog}(n))$ ultrasparsifiers. So by adding a few edges to a spanning tree, you can drastically reduce the relative condition number with the initial graph.

[1] Approaching optimality for solving SDD systems, Koutis, Miller, Peng (2010).
[2] Nearly-Linear Time Algorithms for Graph Partitioning, Graph Sparsification, and Solving Linear Systems, Spielman, Teng (2004).
[3] A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning, Spielman, Teng (2013).
[4] Spectral Sparsification of Graphs, Spielman, Teng (2011).
[5] Nearly-Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems, Spielman, Teng (2014).
[6] A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time, Kelner, Orecchia, Sidford, Zhu (2013).
[7] Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, Lee, Sidford (2013).
[8] A nearly-mlogn time solver for SDD linear systems, Koutis, Miller, Peng (2011).
[9] Improved Spectral Sparsification and Numerical Algorithms for SDD Matrices, Koutis, Levin, Peng (2012).
[10] Simple parallel and distributed algorithms for spectral graph sparsification, Koutis (2014).
[11] Preconditioning in Expectation, Cohen, Kyng, Pachocki, Peng, Rao (2014).
[12] Stretching Stretch, Cohen, Miller, Pachocki, Peng, Xu (2014).
[13] Solving SDD Linear Systems in Nearly mlog1/2n Time, Cohen, Kyng, Miller, Pachocki, Peng, Rao, Xu (2014).
[14] Approximating the Exponential, the Lanczos Method and an $\tilde{O}(m)$-Time Spectral Algorithm for Balanced Separator, Orecchia, Sachdeva, Vishoni (2012).

# TCS in the Big Apple – STOC 2014 Recaps (Part 1)

Ah, summertime. For many NotSoGITCSers, it means no classes, warm weather, ample research time… and a trip to STOC. This year the conference was held in New York City, only four hours away from Cambridge via sketchy Chinatown tour bus. Continuing in the great tradition of conference blogging by TCS blogs (see here, and here), NotSoGITCS will share some of our favorite papers from the conference. I’ll start this off with my recap on Rothvoss’s paper on extension complexity. Over the next few posts, we’ll hear from Clément Canonne and Gautam Kamath on Efficient Distribution Estimation, Pritish Kamath on the latest results in algebraic circuit complexity, Justin Holmgren on indistinguishability obfuscation, and Sunoo Park on coin flipping!

Henry Yuen on , by Thomas Rothvoss

The Best Paper award for this year’s STOC went to this paper, which is another milestone result in the recent flurry of activity in lower bounds for extension complexity of combinatorial problems. This area represents a fruitful “meeting in the middle” between complexity theory and algorithms. Although a proof of P ≠ NP — a conjecture about the limitation of all polynomial time algorithms — seems astronomically distant, we can instead shoot for a closer goal: show that polynomial time algorithms we know of now cannot solve NP-complete problems. The area of extension complexity is about understanding the limitations of some of the finest weapons in our polynomial-time toolbox: LP and SDP solvers. As many of you know, these algorithms are tremendously powerful and versatile, in both theory and practice alike.

A couple years ago, STOC 2012 (also held in New York!) saw the breakthrough result of Fiorini, et al. titled Linear vs. Semideﬁnite Extended Formulations: Exponential Separation and Strong Lower Bounds. This paper showed that while powerful, LP solvers cannot naturally solve NP-complete problems in polynomial time. What do I mean by naturally solve? Let’s take the Traveling Salesman Problem for a moment (which was the subject of the aforementioned paper). One way to try to solve TSP on an n-city instance using linear programming is to try to optimize over the polytope PTSP where each vertex corresponds to a tour of the complete graph on n nodes. Turns out, this polytope has exponentially many facets, and so is a rather unwieldy object to deal with — simply writing down a system of linear inequalities defining PTSP takes exponential time! Instead, people would like to optimize over another polytope in higher dimension QTSP that projects down to PTSP (such a QTSP would be called a linear extension of PTSP) except now QTSP has a polynomial number of facets. Does such a succinct linear extension exist? If so, that would imply P = NP! The Fiorini, et al. paper showed that one can’t solve the Traveling Salesman Problem this way: the TSP polytope has exponential extension complexity — meaning that any linear extension of QTSP of PTSP necessarily has exponentially many facets [1].

In a sense, one shouldn’t be too surprised by this result: after all, we all know that P ≠ NP, so of course the TSP polytope has exponential extension complexity. But we’ve learned more than just that LP solvers can’t solve NP-complete problems: we’ve gained some insight into the why. However, after the Fiorini et al. result, the natural question remained: is this exponential complexity of polytopes about NP-hardness, or is it something else? In particular, what is the extension complexity of the perfect matching polytope PPM (the convex hull of all perfect matchings in $K_n$)? This is interesting question for a number of reasons: this polytope, like PTSP, has an exponential number of facets. However, unlike the TSP polytope, optimizing linear objective functions over PPM is actually polynomial-time solvable! This is due to the seminal work of Jack Edmonds, who, in the 1960’s, effectively brought the notion of “polynomial time computability” into the (computer science) public consciousness by demonstrating that the maximum matching problem admits an efficient solution.

As you might have guessed from the title, Thomas Rothvoss showed that actually the matching polytope cannot be described as the projection of a polynomially-faceted polytope! So, while we have LP-based algorithms for optimizing over PPM, these algorithms must nontrivially rely on the structure of the matching problem, and cannot be expressed as generically optimizing over some succinct linear extension. I’ll say a few words about how he shows this. To lower bound the extension complexity of a polytope, one starts by leveraging the Yannakakis’s Factorization Theorem, proved in his 1991 paper that started the whole area of extension complexity. His theorem says that, for a polytope $P$, $\mathrm{xc}(P) = \mathrm{rank}_{+}(S_P)$, where the left-hand side denotes the extension complexity of $P$ [2], and the right-hand side denotes the non-negative rank of the slack matrix of $P$ [3]. Now, instead of worrying about every possible linear extension of $P$, we “only” have to focus our attention on the non-negative rank of $S_P$. It turns out that fortuitously, we have techniques of lower bounding the non-negative rank of matrices from communication complexity (a connection that was also pointed out by Yannakakis). Roughly speaking, communication complexity tells us that the non-negative rank of a matrix is high if the matrix cannot be “described” as a small collection of combinatorial rectangles.

This idea is captured in what Rothvoss calls the “Hyperplane separation lower bound”: let $S$ be an arbitrary non-negative $m\times n$ matrix; and $W$ be an arbitrary matrix of the same dimensions (not necessarily non-negative). Then $\mathrm{rank}_+(S) \geq \langle W, S \rangle/ (\|S \|_\infty \alpha_W)$, where  $\langle W, S \rangle$ is the Frobenius inner product between $W$ and $S$, $\|S \|_\infty$ is the largest-magnitude entry of $S$, and $\alpha_W := \max \{ \langle W, R \rangle : R \in \{0,1\}^{m \times n}\text{ is rank 1} \}$. Intuitively, in order to show that the non-negative rank of $S$ is large, you want to exhibit a matrix $W$ (a “hyperplane”) that has large correlation with $S$, but has small correlation with any rectangle $R$. Rothvoss presents such a hyperlane $W$ that proves the matching polytope has exponential extension complexity; the hard part is showing that $\langle W, R \rangle \leq 2^{-\Omega(n)}$ for all rectangles $R$. To do so, Rothvoss follows a similar strategy to Razborov’s proof of the $\Omega(n)$ lower bound on the randomized communication complexity of DISJOINTNESS, with substantial modifications to fit the matching problem.

There are a number of reasons why I like these papers in extension complexity: they’re concrete, solid steps towards the dream of proving P ≠ NP. They’re short and elegantly written: the Fiorini, et al. paper is 24 pages; the Rothvoss paper is less than 20. They also demonstrate the unity of theoretical computer science: drawing upon ideas from algorithms, combinatorics, communication complexity, and even quantum information.

-Henry Yuen

[1] Well, now we know that these guys couldn’t have used a succinct extended formulation.

[2] $\mathrm{xc}(P)$ is defined to be the minimum number of facets of any polytope that linearly projects to $P$.

[3] Let $S$ be a non-negative matrix of size $m\times n$. Then $\mathrm{rank}_+(S) = k$ if there exists a factorization $S = TU$ where $T$ and $U$ are non-negative matrices, and $T$ and $U$ have dimensions $m\times k$ and $k\times n$, respectively. Let $P = \{x : Ax \leq b \}$ is a polytope defined by $m$ inequalities, with vertices $v_1,\ldots,v_n$. Then the $(i,j)$th entry of the slack matrix $S_P$ of $P$ is defined to be $b_j - A_j v_i$, with $A_j$ the $j$th row of $A$.

# Beyond Locality-Sensitive Hashing

This is an extended version of (the only) post in my personal blog.

In this post I will introduce locality-sensitive hashing (for which Andrei Broder, Moses Charikar and Piotr Indyk have been recently awarded Paris Kanellakis Theory and Practice Award) and sketch recent developments by Alexandr Andoni, Piotr Indyk, Nguyen Le Huy and myself (see video by Alexandr Andoni, where he explains the same result from somewhat different perspective).

One problem that one encounters a lot in machine learning, databases and other areas is the near neighbor search problem (NN). Given a set of points $P$ in a $d$-dimensional space and a threshold $r > 0$ the goal is to build a data structure that given a query $q$ reports any point from $P$ within distance at most $r$ from $q$.

Unfortunately, all known data structures for NN suffer from the so-called “curse of dimensionality”: if the query time is $o(n)$ (hereinafter we denote $n$ the number of points in our dataset $P$), then either space or query time is $2^{\Omega(d)}$.

To overcome this obstacle one can consider the approximate near neighbor search problem (ANN). Now in addition to $P$ and $r$ we are also given an approximation parameter $c > 1$. The goal is given a query $q$ report a point from $P$ within distance $cr$ from $q$, provided that the neighborhood of radius $r$ is not empty.

It turns out that one can overcome the curse of dimensionality for ANN (see, for example, this paper and its references). If one insists on having near-linear (in $n$) memory and being subexponential in the dimension, then the only known technique for ANN is locality-sensitive hashing. Let us give some definitions. Say a hash family $\mathcal{H}$ on a metric space $\mathcal{M} = (X, D)$ is $(r, cr, p_1, p_2)$-sensitive, if for every two points $x,y \in X$

• if $D(x, y) \leq r$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$;
• if $D(x, y) \geq cr$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$.

Of course, for $\mathcal{H}$ to be meaningful, we should have $p_1 > p_2$. Informally speaking, the closer two points are, the larger probability of their collision is.

Let us construct a simple LSH family for hypercube $\{0, 1\}^d$, equipped with Hamming distance. We set $\mathcal{H} = \{h_1, h_2, \ldots, h_d\}$, where $h_i(x) = x_i$. It is easy to check that this family is $(r, cr, 1 - r / d, 1 - cr / d)$-sensitive.

In 1998 Piotr Indyk and Rajeev Motwani proved the following theorem. Suppose we have a $(r, cr, p_1, p_2)$-sensitive hash family $\mathcal{H}$ for the metric we want to solve ANN for. Moreover, assume that we can sample and evaluate a function from $\mathcal{H}$ relatively quickly, store it efficiently, and that $p_1 = 1 / n^{o(1)}$. Then, one can solve ANN for this metric with space roughly $O(n^{1 + \rho})$ and query time $O(n^{\rho})$, where $\rho = \ln(1 / p_1) / \ln(1 / p_2)$. Plugging the family from the previous paragraph, we are able to solve ANN for Hamming distance in space around $O(n^{1 + 1 / c})$ and query time $O(n^{1/c})$. More generally, in the same paper it was proved that one can achieve $\rho \leq 1 / c$ for the case of $\ell_p$ norms for $1 \leq p \leq 2$ (via an embedding by William Johnson and Gideon Schechtman). In 2006 Alexandr Andoni and Piotr Indyk proved that one can achieve $\rho \leq 1 / c^2$ for the $\ell_2$ norm.

Thus, the natural question arises: how optimal are the abovementioned bounds on $\rho$ (provided that $p_1$ is not too tiny)? This question was resolved in 2011 by Ryan O’Donnell, Yi Wu and Yuan Zhou: they showed a lower bound $\rho \geq 1/c - o(1)$ for $\ell_1$ and $\rho \geq 1/c^2-o(1)$ for $\ell_2$ matching the upper bounds. Thus, the above simple LSH family for the hypercube is in fact, optimal!

Is it the end of the story? Not quite. The catch is that the definition of LSH families is actually too strong. The real property that is used in the ANN data structure is the following: for every pair of points $x \in P, y \in X$ we have

• if $D(x, y) \leq r$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1$;
• if $D(x, y) \geq cr$, then $\mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2$.

The difference with the definition of $(r,cr,p_1,p_2)$-sensitive family is that we now restrict one of the points to be in a prescribed set $P$. And it turns out that one can indeed exploit this dependency on data to get a slightly improved LSH family. Namely, we are able to achieve $\rho \leq 7 / (8c^2) + O(1 / c^3) + o(1)$ for $\ell_2$, which by a simple embedding of $\ell_1$ into $\ell_2$-squared gives $\rho \leq 7 / (8c) + O(1 / c^{3/2}) + o(1)$ for $\ell_1$ (in particular, Hamming distance over the hypercube). This is nice for two reasons. First, we are able to overcome the natural LSH barrier. Second, this result shows that what “practitioners” have been doing for some time (namely, data-dependent space partitioning) can give advantage in theory, too.

In the remaining text let me briefly sketch the main ideas of the result. From now on, assume that our metric is $\ell_2$. The first ingredient is an LSH family that simplifies and improves upon $\rho = 1 / c^2$ for the case, when all data points and queries lie in a ball of radius $O(cr)$. This scheme has strong parallels with an SDP rounding scheme of David Karger, Rajeev Motwani and Madhu Sudan.

The second (and the main) ingredient is a two-level hashing scheme that leverages the abovementioned better LSH family. First, let us recall, how the standard LSH data structure works. We start from a $(r, cr, p_1, p_2)$-sensitive family $\mathcal{H}$ and then consider the following simple “tensoring” operation: we sample $k$ functions $h_1, h_2, \ldots, h_k$ from $\mathcal{H}$ independently and then we hash a point $x$ into a tuple $(h_1(x), h_2(x), \ldots, h_k(x))$. It is easy to see that the new family is $(r, cr, p_1^k, p_2^k)$-sensitive. Let us denote this family by $\mathcal{H}^k$. Now we choose $k$ to have the following collision probabilities:

• $1/n$ at distance $cr$;
• $1/n^{\rho}$ at distance $r$

(actually, we can not set $k$ to achieve these probabilities exactly, since $k$ must be integer, that’s exactly why we need the condition $p_1 = 1 / n^{o(1)}$). Now we hash all the points from the dataset using a random function from $\mathcal{H}^k$, and to answer a query $q$ we hash $q$ and enumerate all the points in the corresponding bucket, until we find anything within distance $cr$. To analyze this simple data structure, we observe that the average number of “outliers” (points at distance more than $cr$) we encounter is at most one due to the choice of $k$. On the other hand, for any near neighbor (within distance at most $r$) we find it with probability at least $n^{-\rho}$, so, to boost it to constant, we build $O(n^{\rho})$ independent hash tables. As a result, we get a data structure with space $O(n^{1 + \rho})$ and query time $O(n^{\rho})$.

Now let us show how to build a similar two-level data structure, which achieves somewhat better parameters. First, we apply the LSH family $\mathcal{H}$ for $\ell_2$ with $\rho \approx 1/c^2$, but only partially. Namely, we choose a constant parameter $\tau > 1$ and $k$ such that the collision probabilities are as follows:

• $1/n$ at distance $\tau cr$;
• $1/n^{1/\tau^2}$ at distance $cr$;
• $1/n^{1/(\tau c)^2}$ at distance $r$.

Now we hash all the data points with $\mathcal{H}^k$ and argue that with high probability every bucket has diameter $O(cr)$. But now given this “bounded buckets” condition, we can utilize the better family designed above! Namely, we hash every bucket using our new family to achieve the following probabilities:

• $1/n^{1 - 1/\tau^2}$ at distance $cr$;
• $1/n^{(1 - \Omega_{\tau}(1))(1 - 1 / \tau^2) / c^2}$ at distance $r$.

Overall, the data structure consists of an outer hash table that uses the LSH family of Andoni and Indyk, and then every bucket is hashed using the new family. Due to independence, the collision probabilities multiply, and we get

• $1/n$ at distance $cr$;
• $1/n^{(1 - \Omega_{\tau}(1)) / c^2}$ at distance $r$.

Then we argue as before and conclude that we can achieve $\rho \leq (1 - \Omega(1)) / c^2$.

After carefully optimizing all the parameters, we achieve, in fact, $\rho \approx 7 / (8c^2)$. Then we go further, and consider a multi-level scheme with several distance scales. Choosing these scales carefully, we achieve $\rho \approx 1 / (2 c^2 \ln 2)$.