Sketching and Embedding are Equivalent for Norms


In this post I will show that any normed space that allows good sketches is necessarily embeddable into an \ell_p space with p close to 1. This provides a partial converse to a result of Piotr Indyk, who showed how to sketch metrics that embed into \ell_p for 0 < p \le 2. A cool bonus of this result is that it gives a new technique for obtaining sketching lower bounds.

This result appeared in a recent paper of mine that is a joint work with Alexandr Andoni and Robert Krauthgamer. I am pleased to report that it has been accepted to STOC 2015.


One of the exciting relatively recent paradigms in algorithms is that of sketching. The high-level idea is as follows: if we are interested in working with a massive object x, let us start with compressing it to a short sketch \mathrm{sketch}(x) that preserves properties of x we care about. One great example of sketching is the Johnson-Lindenstrauss lemma: if we work with n high-dimensional vectors and are interested in Euclidean distances between them, we can project the vectors on a random O(\varepsilon^{-2} \cdot \log n)-dimensional subspace, and this will preserve with high probability all the pairwise distances up to a factor of 1 + \varepsilon.

It would be great to understand, for which computational problems sketching is possible, and how efficient it can be made. There are quite a few nice results (both upper and lower bounds) along these lines (see, e.g., graph sketching or a recent book about sketching for numerical linear algebra), but the general understanding has yet to emerge.

Sketching for metrics

One of the main motivations to study sketching is fast computation and indexing of similarity measures \mathrm{sim}(x, y) between two objects x and y. Often times similarity between objects is modeled by some metric d(x, y) (but not always! think KL divergence): for instance the above example of the Euclidean distance falls into this category. Thus, instantiating the above general question one can ask: for which metric spaces there exist good sketches? That is, when is it possible to compute a short sketch \mathrm{sketch}(x) of a point x such that, given two sketches \mathrm{sketch}(x) and \mathrm{sketch}(y), one is able to estimate the distance d(x, y)?

The following communication game captures the question of sketching metrics. Alice and Bob each have a point from a metric space X (say, x and y, respectively). Suppose, in addition, that either d_X(x, y) \le r or d_X(x, y) > D \cdot r (where r and D are the parameters known from the beginning). Both Alice and Bob send messages \mathrm{sketch}(x) and \mathrm{sketch}(y) that are s bits long to Charlie, who is supposed to distinguish two cases (whether d_X(x, y) is small or large) with probability at least 0.99. We assume that all three parties are allowed to use shared randomness. Our main goal is to understand the trade-off between D (approximation) and s (sketch size).

Arguably, the most important metric spaces are \ell_p spaces. Formally, for 1 \leq p \leq \infty we define \ell_p^d to be a d-dimensional space equipped with distance

\|x - y\|_p = \Bigl(\sum_{i=1}^d |x_i - y_i|^p\Bigr)^{1/p}

(when p = \infty this expression should be understood as \max_{1 \leq i \leq d} |x_i - y_i|). One can similarly define \ell_p spaces for 0 < p < 1; even if the triangle inequality does not hold for this case, it is nevertheless a meaningful notion of distance.

It turns out that \ell_p spaces exhibit very interesting behavior, when it comes to sketching. Indyk showed that for 0 < p \le 2 one can achieve approximation D = 1 + \varepsilon and sketch size s = O(1 / \varepsilon^2) for every \varepsilon > 0 (for 1 \le p \le 2 this was established before by Kushilevitz, Ostrovsky and Rabani). It is quite remarkable that these bounds do not depend on the dimension of a space. On the other hand, for \ell_p^d with p > 2 the dependence on the dimension is necessary. It turns out that for constant approximation D = O(1) the optimal sketch size is \widetilde{\Theta}(d^{1 - 2/p}).

Are there any other examples of metrics that admit efficient sketches (say, with constant D and s)? One simple observation is that if a metric embeds well into \ell_p for 0 < p \le 2, then one can sketch this metric well. Formally, we say that a map between metric spaces f \colon X \to Y is an embedding with distortion \widetilde{D}, if

d_X(x_1, x_2) \leq C \cdot d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \widetilde{D}  \cdot d_X(x_1, x_2)

for every x_1, x_2 \in X and for some C > 0. It is immediate to see that if a metric space X embeds into \ell_p for 0 < p \le 2 with distortion O(1), then one can sketch X with s = O(1) and D = O(1). Thus, we know that any metric that embeds well into \ell_p with 0 < p \le 2 is efficiently sketchable. Are there any other examples? The amazing answer is that we don’t know!

Our results

Our result shows that for a very important class of metrics—normed spaces—embedding into \ell_p is the only possible way to obtain good sketches. Formally, if a normed space X allows sketches of size s for approximation D, then for every \varepsilon > 0 the space X embeds into \ell_{1 - \varepsilon} with distortion O(sD / \varepsilon). This result together with the above upper bound by Indyk provides a complete characterization of normed spaces that admit good sketches.

Taking the above result in the contrapositive, we see that non-embeddability implies lower bounds for sketches. This is great, since it potentially allows us to employ many sophisticated non-embeddability results proved by geometers and functional analysts. Specifically, we prove two new lower bounds for sketches: for the planar Earth Mover’s Distance (building on a non-embeddability theorem by Naor and Schechtman) and for the trace norm (non-embeddability was proved by Pisier). In addition to it, we are able to unify certain known results: for instance, classify \ell_p spaces and the cascaded norms in terms of “sketchability”.

Overview of the proof

Let me outline the main steps of the proof of the implication “good sketches imply good embeddings”. The following definition is central to the proof. Let us call a map f \colon X \to Y between two metric spaces (s_1, s_2, \tau_1, \tau_2)-threshold, if for every x_1, x_2 \in X:

  • d_X(x_1, x_2) \leq s_1 implies d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \tau_1,
  • d_X(x_1, x_2) \geq s_2 implies d_Y\bigl(f(x_1), f(x_2)\bigr) \geq \tau_2.

One should think of threshold maps as very weak embeddings that merely
preserve certain distance scales.

The proof can be divided into two parts. First, we prove that for a normed space X that allows sketches of size s and approximation D there exists a (1, O(sD), 1, 10)-threshold map to a Hilbert space. Then, we prove that the existence of such a map implies the existence of an embedding into \ell_{1 - \varepsilon} with distortion O(sD / \varepsilon).

The first half goes roughly as follows. Assume that there is no (1, O(sD), 1, 10)-threshold map from X to a Hilbert space. Then, by convex duality, this implies certain Poincaré-type inequalities on X. This, in turn, implies sketching lower bounds for \ell_{\infty}^k(X) (the direct sum of k copies of X, where the norm is definied as the maximum of norms of the components) by a result of Andoni, Jayram and Pătrașcu (which is based on a very important notion of information complexity). Then, crucially using the fact that X is a normed space, we conclude that X itself does not have good sketches (this step follows from the fact that every normed space is of type 1 and is of cotype \infty).

The second half uses tools from nonlinear functional analysis. First, building on an argument of Johnson and Randrianarivony, we show that for normed spaces (1, O(sD), 1, 10)-threshold map into a Hilbert space implies a uniform embedding into a Hilbert space—that is, a map f \colon X \to H, where H is a Hilbert space such that

L\bigl(\|x_1 - x_2\|_X\bigr) \leq      \bigl\|f(x_1) - f(x_1)\bigr\|_H \leq U\bigl(\|x_1 - x_2\|_X\bigr),

where L, U \colon [0; \infty) \to [0; \infty) are non-decreasing functions such that L(t) > 0 for every t > 0 and U(t) \to 0 as t \to 0. Both L and U are allowed to depend only on s and D. This step uses a certain Lipschitz extension-type theorem and averaging via bounded invariant means. Finally, we conclude the proof by applying theorems of Aharoni-Maurey-Mityagin and Nikishin and obtain a desired (linear) embedding of X into \ell_{1 - \varepsilon}.

Open problems

Let me finally state several open problems.

The first obvious open problem is to extend our result to as large class of general metric spaces as possible. Two notable examples one should keep in mind are the Khot-Vishnoi space and the Heisenberg group. In both cases, a space admits good sketches (since both spaces are embeddable into \ell_2-squared), but neither of them is embeddable into \ell_1. I do not know, if these spaces are embeddable into \ell_{1 - \varepsilon}, but I am inclined to suspect so.

The second open problem deals with linear sketches. For a normed space, one can require that a sketch is of the form \mathrm{sketch}(x) = Ax, where A is a random matrix generated using shared randomness. Our result then can be interpreted as follows: any normed space that allows sketches of size s and approximation D allows a linear sketch with one linear measurement and approximation O(sD) (this follows from the fact that for \ell_{1 - \varepsilon} there are good linear sketches). But can we always construct a linear sketch of size f(s) and approximation g(D), where f(\cdot) and g(\cdot) are some (ideally, not too quickly growing) functions?

Finally, the third open problem is about spaces that allow essentially no non-trivial sketches. Can one characterize d-dimensional normed spaces, where any sketch for approximation O(1) must have size \Omega(d)? The only example I can think of is a space that contains a subspace that is close to \ell_{\infty}^{\Omega(d)}. Is this the only case?


Beyond Locality-Sensitive Hashing

This is an extended version of (the only) post in my personal blog.

In this post I will introduce locality-sensitive hashing (for which Andrei Broder, Moses Charikar and Piotr Indyk have been recently awarded Paris Kanellakis Theory and Practice Award) and sketch recent developments by Alexandr Andoni, Piotr Indyk, Nguyen Le Huy and myself (see video by Alexandr Andoni, where he explains the same result from somewhat different perspective).

One problem that one encounters a lot in machine learning, databases and other areas is the near neighbor search problem (NN). Given a set of points P in a d-dimensional space and a threshold r > 0 the goal is to build a data structure that given a query q reports any point from P within distance at most r from q.

Unfortunately, all known data structures for NN suffer from the so-called “curse of dimensionality”: if the query time is o(n) (hereinafter we denote n the number of points in our dataset P), then either space or query time is 2^{\Omega(d)}.

To overcome this obstacle one can consider the approximate near neighbor search problem (ANN). Now in addition to P and r we are also given an approximation parameter c > 1. The goal is given a query q report a point from P within distance cr from q, provided that the neighborhood of radius r is not empty.

It turns out that one can overcome the curse of dimensionality for ANN (see, for example, this paper and its references). If one insists on having near-linear (in n) memory and being subexponential in the dimension, then the only known technique for ANN is locality-sensitive hashing. Let us give some definitions. Say a hash family \mathcal{H} on a metric space \mathcal{M} = (X, D) is (r, cr, p_1, p_2)-sensitive, if for every two points x,y \in X

  • if D(x, y) \leq r, then \mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1;
  • if D(x, y) \geq cr, then \mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2.

Of course, for \mathcal{H} to be meaningful, we should have p_1 > p_2. Informally speaking, the closer two points are, the larger probability of their collision is.

Let us construct a simple LSH family for hypercube \{0, 1\}^d, equipped with Hamming distance. We set \mathcal{H} = \{h_1, h_2, \ldots, h_d\}, where h_i(x) = x_i. It is easy to check that this family is (r, cr, 1 - r / d, 1 - cr / d)-sensitive.

In 1998 Piotr Indyk and Rajeev Motwani proved the following theorem. Suppose we have a (r, cr, p_1, p_2)-sensitive hash family \mathcal{H} for the metric we want to solve ANN for. Moreover, assume that we can sample and evaluate a function from \mathcal{H} relatively quickly, store it efficiently, and that p_1 = 1 / n^{o(1)}. Then, one can solve ANN for this metric with space roughly O(n^{1 + \rho}) and query time O(n^{\rho}), where \rho = \ln(1 / p_1) / \ln(1 / p_2). Plugging the family from the previous paragraph, we are able to solve ANN for Hamming distance in space around O(n^{1 + 1 / c}) and query time O(n^{1/c}). More generally, in the same paper it was proved that one can achieve \rho \leq 1 / c for the case of \ell_p norms for 1 \leq p \leq 2 (via an embedding by William Johnson and Gideon Schechtman). In 2006 Alexandr Andoni and Piotr Indyk proved that one can achieve \rho \leq 1 / c^2 for the \ell_2 norm.

Thus, the natural question arises: how optimal are the abovementioned bounds on \rho (provided that p_1 is not too tiny)? This question was resolved in 2011 by Ryan O’Donnell, Yi Wu and Yuan Zhou: they showed a lower bound \rho \geq 1/c - o(1) for \ell_1 and \rho \geq 1/c^2-o(1) for \ell_2 matching the upper bounds. Thus, the above simple LSH family for the hypercube is in fact, optimal!

Is it the end of the story? Not quite. The catch is that the definition of LSH families is actually too strong. The real property that is used in the ANN data structure is the following: for every pair of points x \in P, y \in X we have

  • if D(x, y) \leq r, then \mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \geq p_1;
  • if D(x, y) \geq cr, then \mathrm{Pr}_{h \sim \mathcal{H}}[h(x) = h(y)] \leq p_2.

The difference with the definition of (r,cr,p_1,p_2)-sensitive family is that we now restrict one of the points to be in a prescribed set P. And it turns out that one can indeed exploit this dependency on data to get a slightly improved LSH family. Namely, we are able to achieve \rho \leq 7 / (8c^2) + O(1 / c^3) + o(1) for \ell_2, which by a simple embedding of \ell_1 into \ell_2-squared gives \rho \leq 7 / (8c) + O(1 / c^{3/2}) + o(1) for \ell_1 (in particular, Hamming distance over the hypercube). This is nice for two reasons. First, we are able to overcome the natural LSH barrier. Second, this result shows that what “practitioners” have been doing for some time (namely, data-dependent space partitioning) can give advantage in theory, too.

In the remaining text let me briefly sketch the main ideas of the result. From now on, assume that our metric is \ell_2. The first ingredient is an LSH family that simplifies and improves upon \rho = 1 / c^2 for the case, when all data points and queries lie in a ball of radius O(cr). This scheme has strong parallels with an SDP rounding scheme of David Karger, Rajeev Motwani and Madhu Sudan.

The second (and the main) ingredient is a two-level hashing scheme that leverages the abovementioned better LSH family. First, let us recall, how the standard LSH data structure works. We start from a (r, cr, p_1, p_2)-sensitive family \mathcal{H} and then consider the following simple “tensoring” operation: we sample k functions h_1, h_2, \ldots, h_k from \mathcal{H} independently and then we hash a point x into a tuple (h_1(x), h_2(x), \ldots, h_k(x)). It is easy to see that the new family is (r, cr, p_1^k, p_2^k)-sensitive. Let us denote this family by \mathcal{H}^k. Now we choose k to have the following collision probabilities:

  • 1/n at distance cr;
  • 1/n^{\rho} at distance r

(actually, we can not set k to achieve these probabilities exactly, since k must be integer, that’s exactly why we need the condition p_1 = 1 / n^{o(1)}). Now we hash all the points from the dataset using a random function from \mathcal{H}^k, and to answer a query q we hash q and enumerate all the points in the corresponding bucket, until we find anything within distance cr. To analyze this simple data structure, we observe that the average number of “outliers” (points at distance more than cr) we encounter is at most one due to the choice of k. On the other hand, for any near neighbor (within distance at most r) we find it with probability at least n^{-\rho}, so, to boost it to constant, we build O(n^{\rho}) independent hash tables. As a result, we get a data structure with space O(n^{1 + \rho}) and query time O(n^{\rho}).

Now let us show how to build a similar two-level data structure, which achieves somewhat better parameters. First, we apply the LSH family \mathcal{H} for \ell_2 with \rho \approx 1/c^2, but only partially. Namely, we choose a constant parameter \tau > 1 and k such that the collision probabilities are as follows:

  • 1/n at distance \tau cr;
  • 1/n^{1/\tau^2} at distance cr;
  • 1/n^{1/(\tau c)^2} at distance r.

Now we hash all the data points with \mathcal{H}^k and argue that with high probability every bucket has diameter O(cr). But now given this “bounded buckets” condition, we can utilize the better family designed above! Namely, we hash every bucket using our new family to achieve the following probabilities:

  • 1/n^{1 - 1/\tau^2} at distance cr;
  • 1/n^{(1 - \Omega_{\tau}(1))(1 - 1 / \tau^2) / c^2} at distance r.

Overall, the data structure consists of an outer hash table that uses the LSH family of Andoni and Indyk, and then every bucket is hashed using the new family. Due to independence, the collision probabilities multiply, and we get

  • 1/n at distance cr;
  • 1/n^{(1 - \Omega_{\tau}(1)) / c^2} at distance r.

Then we argue as before and conclude that we can achieve \rho \leq (1 - \Omega(1)) / c^2.

After carefully optimizing all the parameters, we achieve, in fact, \rho \approx 7 / (8c^2). Then we go further, and consider a multi-level scheme with several distance scales. Choosing these scales carefully, we achieve \rho \approx 1 / (2 c^2 \ln 2).

Ilya Razenshteyn