Better Circuit Lower Bounds for Explicit Functions

Below is a post by Ilya on a new circuit lower bound.


Let f \colon \{0, 1\}^n \to \{0, 1\} be a Boolean function of n arguments. What is the smallest circuit that can compute f if the set of allowed gates consists of all the unary and binary functions? The size of a circuit is simply the number of gates in it.

It’s easy to show that a random function requires circuits of size \Omega(2^n / n) (this is tight in the worst case), but a big question in computational complexity is to provide a simple enough function that requires large enough circuits to compute.

Basically, if one comes up with a function that lies in \mathrm{NP} and requires circuits of size super-polynomial in n, then \mathrm{P} \ne \mathrm{NP}, and one of the Millenium Problems is solved!

How far are we from accomplishing this? Well, until recently, the best lower bound for a function from NP was 3n [Blum 1984] (remember that eventually we are aiming at super-polynomial lower bounds!).

I’m very excited to report that very recently a better lower bound has been proved by Magnus Gausdal Find, Sasha Golovnev, Edward Hirsch and Sasha Kulikov! Is it super-polynomial in n? Is it at least super-linear in n?

Well, not really. The new lower bound is 3.011n! You may ask: why one should be excited about such a seemingly weak improvement (which, nevertheless, took the authors several years of hard work, as far as I know)?

One reason is that this is the first progress on one of the central questions in computational complexity in more than 30 years. In general, researchers like to invent new models and problems and tend to forget good old “core” questions very easily. I’m happy to see this not being the case here.

Besides, the new result is, in fact, much more general than say the Blum’s. The authors prove that any function that is not constant on every sufficiently high-dimensional affine subspace requires large circuits. Then, they simply invoke the recent result of Ben-Sasson and Kopparty, who construct an explicit function with this property. That is, the paper shows a certain pseudo-randomness property to be enough for better lower bounds. Hopefully, there will be more developments in this direction.

Is this result one more step towards proving \mathrm{P} \ne \mathrm{NP}? Time will tell.

Sketching and Embedding are Equivalent for Norms

Summary

In this post I will show that any normed space that allows good sketches is necessarily embeddable into an \ell_p space with p close to 1. This provides a partial converse to a result of Piotr Indyk, who showed how to sketch metrics that embed into \ell_p for 0 < p \le 2. A cool bonus of this result is that it gives a new technique for obtaining sketching lower bounds.

This result appeared in a recent paper of mine that is a joint work with Alexandr Andoni and Robert Krauthgamer. I am pleased to report that it has been accepted to STOC 2015.

Sketching

One of the exciting relatively recent paradigms in algorithms is that of sketching. The high-level idea is as follows: if we are interested in working with a massive object x, let us start with compressing it to a short sketch \mathrm{sketch}(x) that preserves properties of x we care about. One great example of sketching is the Johnson-Lindenstrauss lemma: if we work with n high-dimensional vectors and are interested in Euclidean distances between them, we can project the vectors on a random O(\varepsilon^{-2} \cdot \log n)-dimensional subspace, and this will preserve with high probability all the pairwise distances up to a factor of 1 + \varepsilon.

It would be great to understand, for which computational problems sketching is possible, and how efficient it can be made. There are quite a few nice results (both upper and lower bounds) along these lines (see, e.g., graph sketching or a recent book about sketching for numerical linear algebra), but the general understanding has yet to emerge.

Sketching for metrics

One of the main motivations to study sketching is fast computation and indexing of similarity measures \mathrm{sim}(x, y) between two objects x and y. Often times similarity between objects is modeled by some metric d(x, y) (but not always! think KL divergence): for instance the above example of the Euclidean distance falls into this category. Thus, instantiating the above general question one can ask: for which metric spaces there exist good sketches? That is, when is it possible to compute a short sketch \mathrm{sketch}(x) of a point x such that, given two sketches \mathrm{sketch}(x) and \mathrm{sketch}(y), one is able to estimate the distance d(x, y)?

The following communication game captures the question of sketching metrics. Alice and Bob each have a point from a metric space X (say, x and y, respectively). Suppose, in addition, that either d_X(x, y) \le r or d_X(x, y) > D \cdot r (where r and D are the parameters known from the beginning). Both Alice and Bob send messages \mathrm{sketch}(x) and \mathrm{sketch}(y) that are s bits long to Charlie, who is supposed to distinguish two cases (whether d_X(x, y) is small or large) with probability at least 0.99. We assume that all three parties are allowed to use shared randomness. Our main goal is to understand the trade-off between D (approximation) and s (sketch size).

Arguably, the most important metric spaces are \ell_p spaces. Formally, for 1 \leq p \leq \infty we define \ell_p^d to be a d-dimensional space equipped with distance

\|x - y\|_p = \Bigl(\sum_{i=1}^d |x_i - y_i|^p\Bigr)^{1/p}

(when p = \infty this expression should be understood as \max_{1 \leq i \leq d} |x_i - y_i|). One can similarly define \ell_p spaces for 0 < p < 1; even if the triangle inequality does not hold for this case, it is nevertheless a meaningful notion of distance.

It turns out that \ell_p spaces exhibit very interesting behavior, when it comes to sketching. Indyk showed that for 0 < p \le 2 one can achieve approximation D = 1 + \varepsilon and sketch size s = O(1 / \varepsilon^2) for every \varepsilon > 0 (for 1 \le p \le 2 this was established before by Kushilevitz, Ostrovsky and Rabani). It is quite remarkable that these bounds do not depend on the dimension of a space. On the other hand, for \ell_p^d with p > 2 the dependence on the dimension is necessary. It turns out that for constant approximation D = O(1) the optimal sketch size is \widetilde{\Theta}(d^{1 - 2/p}).

Are there any other examples of metrics that admit efficient sketches (say, with constant D and s)? One simple observation is that if a metric embeds well into \ell_p for 0 < p \le 2, then one can sketch this metric well. Formally, we say that a map between metric spaces f \colon X \to Y is an embedding with distortion \widetilde{D}, if

d_X(x_1, x_2) \leq C \cdot d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \widetilde{D}  \cdot d_X(x_1, x_2)

for every x_1, x_2 \in X and for some C > 0. It is immediate to see that if a metric space X embeds into \ell_p for 0 < p \le 2 with distortion O(1), then one can sketch X with s = O(1) and D = O(1). Thus, we know that any metric that embeds well into \ell_p with 0 < p \le 2 is efficiently sketchable. Are there any other examples? The amazing answer is that we don’t know!

Our results

Our result shows that for a very important class of metrics—normed spaces—embedding into \ell_p is the only possible way to obtain good sketches. Formally, if a normed space X allows sketches of size s for approximation D, then for every \varepsilon > 0 the space X embeds into \ell_{1 - \varepsilon} with distortion O(sD / \varepsilon). This result together with the above upper bound by Indyk provides a complete characterization of normed spaces that admit good sketches.

Taking the above result in the contrapositive, we see that non-embeddability implies lower bounds for sketches. This is great, since it potentially allows us to employ many sophisticated non-embeddability results proved by geometers and functional analysts. Specifically, we prove two new lower bounds for sketches: for the planar Earth Mover’s Distance (building on a non-embeddability theorem by Naor and Schechtman) and for the trace norm (non-embeddability was proved by Pisier). In addition to it, we are able to unify certain known results: for instance, classify \ell_p spaces and the cascaded norms in terms of “sketchability”.

Overview of the proof

Let me outline the main steps of the proof of the implication “good sketches imply good embeddings”. The following definition is central to the proof. Let us call a map f \colon X \to Y between two metric spaces (s_1, s_2, \tau_1, \tau_2)-threshold, if for every x_1, x_2 \in X:

  • d_X(x_1, x_2) \leq s_1 implies d_Y\bigl(f(x_1), f(x_2)\bigr) \leq \tau_1,
  • d_X(x_1, x_2) \geq s_2 implies d_Y\bigl(f(x_1), f(x_2)\bigr) \geq \tau_2.

One should think of threshold maps as very weak embeddings that merely
preserve certain distance scales.

The proof can be divided into two parts. First, we prove that for a normed space X that allows sketches of size s and approximation D there exists a (1, O(sD), 1, 10)-threshold map to a Hilbert space. Then, we prove that the existence of such a map implies the existence of an embedding into \ell_{1 - \varepsilon} with distortion O(sD / \varepsilon).

The first half goes roughly as follows. Assume that there is no (1, O(sD), 1, 10)-threshold map from X to a Hilbert space. Then, by convex duality, this implies certain Poincaré-type inequalities on X. This, in turn, implies sketching lower bounds for \ell_{\infty}^k(X) (the direct sum of k copies of X, where the norm is definied as the maximum of norms of the components) by a result of Andoni, Jayram and Pătrașcu (which is based on a very important notion of information complexity). Then, crucially using the fact that X is a normed space, we conclude that X itself does not have good sketches (this step follows from the fact that every normed space is of type 1 and is of cotype \infty).

The second half uses tools from nonlinear functional analysis. First, building on an argument of Johnson and Randrianarivony, we show that for normed spaces (1, O(sD), 1, 10)-threshold map into a Hilbert space implies a uniform embedding into a Hilbert space—that is, a map f \colon X \to H, where H is a Hilbert space such that

L\bigl(\|x_1 - x_2\|_X\bigr) \leq      \bigl\|f(x_1) - f(x_1)\bigr\|_H \leq U\bigl(\|x_1 - x_2\|_X\bigr),

where L, U \colon [0; \infty) \to [0; \infty) are non-decreasing functions such that L(t) > 0 for every t > 0 and U(t) \to 0 as t \to 0. Both L and U are allowed to depend only on s and D. This step uses a certain Lipschitz extension-type theorem and averaging via bounded invariant means. Finally, we conclude the proof by applying theorems of Aharoni-Maurey-Mityagin and Nikishin and obtain a desired (linear) embedding of X into \ell_{1 - \varepsilon}.

Open problems

Let me finally state several open problems.

The first obvious open problem is to extend our result to as large class of general metric spaces as possible. Two notable examples one should keep in mind are the Khot-Vishnoi space and the Heisenberg group. In both cases, a space admits good sketches (since both spaces are embeddable into \ell_2-squared), but neither of them is embeddable into \ell_1. I do not know, if these spaces are embeddable into \ell_{1 - \varepsilon}, but I am inclined to suspect so.

The second open problem deals with linear sketches. For a normed space, one can require that a sketch is of the form \mathrm{sketch}(x) = Ax, where A is a random matrix generated using shared randomness. Our result then can be interpreted as follows: any normed space that allows sketches of size s and approximation D allows a linear sketch with one linear measurement and approximation O(sD) (this follows from the fact that for \ell_{1 - \varepsilon} there are good linear sketches). But can we always construct a linear sketch of size f(s) and approximation g(D), where f(\cdot) and g(\cdot) are some (ideally, not too quickly growing) functions?

Finally, the third open problem is about spaces that allow essentially no non-trivial sketches. Can one characterize d-dimensional normed spaces, where any sketch for approximation O(1) must have size \Omega(d)? The only example I can think of is a space that contains a subspace that is close to \ell_{\infty}^{\Omega(d)}. Is this the only case?

Ilya

Insensitive Intersections of Halfspaces – STOC 2014 Recaps (Part 11)

In the eleventh and final installment of our STOC 2014 recaps, Jerry Li tells us about a spectacularly elegant result by Daniel Kane. It’s an example of what I like to call a “one-page wonder” — this a bit of a misnomer, since Kane’s paper is slightly more than five pages long, but the term refers to any beautiful paper which is (relatively) short and readable.

We hope you’ve enjoyed our coverage of this year’s STOC. We were able to write about a few of our favorite results, but there’s a wealth of other interesting papers that deserve your attention. I encourage you to peruse the proceedings and discover some favorites of your own.


Jerry Li on The Average Sensitivity of an Intersection of Half Spaces, by Daniel Kane

The Monday afternoon sessions kicked off with Daniel Kane presenting his work on the average sensitivity of an intersection of halfspaces. Usually, FOCS/STOC talks can’t even begin to fit all the technical details from the paper, but unusually, Daniel’s talk included a complete proof of his result, without omitting any details. Amazingly, his result is very deep and general, so something incredible is clearly happening somewhere.

At the highest level, Daniel deals with the study of a certain class of Boolean functions. When we classically think about Boolean functions, we think of things such as CNFs, DNFs, decision trees, etc., which map into things like \mathbb{F}_2, \{0, 1\}, or \{\mbox{True}, \mbox{False}\}, but today, and often in the study of Boolean analysis, we will think of functions as mapping \{-1, 1\}^n to \{-1, 1\}, which is roughly equivalent for many purposes (O’Donnell has a nice rule of thumb as when to use one convention or the other here). Given a function f:\{-1, 1\}^n \to \{-1, 1\}, we can define two important measures of sensitivity. The first is the average sensitivity (or, for those of you like me who grew up with O’Donnell’s book, the total influence) of the function, namely,

\mathbb{AS}(f) = \mathbb{E}_{x \sim \{-1, 1\}^n} [| \{i: f(x^{i \to 1}) \neq f(x^{i \to -1}) \} |]

where x^{i \to a} is simply x with its ith coordinate set to a. The second is the noise sensitivity of the function, which is defined similarly: for a parameter \varepsilon \in (0, 1), it is the probability that if we sample x uniformly at random from \{-1, 1\}^n, then independently flip each of its bits with probability \varepsilon, the value of f at these two inputs is different. We denote this quantity \mathbb{NS}_\varepsilon (f). When we generate a string y from a string x in this way we say they are 1 - 2\varepsilon-correlated. The weird function of \varepsilon in that expression is because often we equivalently think of y being generated from x by independently keeping each coordinate of x fixed with probability 1 - 2\varepsilon, and uniformly rerandomizing that bit with probability 2 \varepsilon.

Why are these two measures important? If we have a concept class of functions \mathcal{F}, then it turns out that bounds on these two quantities can often be translated directly into learning algorithms for these classes. By Fourier analytic arguments, good bounds on the noise sensitivity of a function immediately imply that the function has good Fourier concentration on low degree terms, which in turn imply that the so-called “low-degree algorithm” can efficiently learn the class of functions in the PAC model, with random access. Unfortunately, I can’t really give any more detail here without a lot more technical detail, see [2] for a good introduction to the topic.

Now why is the average sensitivity of a Boolean function important? First of all, trust me when I say that it’s a fundamental measure of the robustness of the function. If we identify f with f^{-1} (1), then the average sensitivity is how many edges cross from one subset into another (over 2^n), so it is fundamentally related to the surface area of subsets of the hypercube, which comes up all over the place in Boolean analysis. Secondly, in some cases, we can translate between one measure and the other by considering restrictions of functions. To the best of my knowledge, this appears to be a technique first introduced by Peres in 1999, though his paper is from 2004 [3]. Let f: \{-1, +1\}^n \to \mathbb{R}. We wish to bound the noise sensitivity of f, so we need to see how it behaves when we generate x uniformly at random, then y as 1 - 2 \varepsilon-correlated to x. Suppose \varepsilon = 1/m for some integer m (if not, just round it). Fix a partition of the n coordinates into m bins B_1, \ldots, B_m, and a z \in \{-1, 1\}^n. Then, for any string s \in \{-1, 1\}^m, we associate it with the string x \in \{-1, 1\}^n whose ith coordinate is the ith coordinate of z times the jth coordinate of s, if i \in B_j. Why are we doing this? Well, after some thought, it’s not too hard to convince yourself that if we choose the m bins and the strings z, s uniformly at random, then we get a uniformly random string x. Moreover, to generate a string y which is 1 - 2 \varepsilon-correlated with x, it suffices to, after having already randomly chosen the bins, z, and s, to randomly pick a coordinate of s and flip its sign to produce a new string s', and produce a new string y with these choices of the bins, z, and s'. Thus, importantly, we can reduce the process of producing 1 - 2 \varepsilon-correlated strings to the process of randomly flipping one bit of some weird new function–but this is the process we consider when we consider the average sensitivity! Thus noise sensitivity of f is exactly equal to the expected (over the random choice of the bins and z) average sensitivity of this weird restricted thing. Why this is useful will (hopefully) become clear later.

Since the title of the paper includes the phrase “intersection of halfspaces,” at some point I should probably define what an intersection of halfspaces is. First of all, a halfspace (or linear threshold function) is a Boolean function of the form f(x) = \text{sgn}\left(w \cdot x - \theta\right) where w \in \mathbb{R}^n, \theta \in \mathbb{R} and for concreteness let’s say \text{sgn}\left(0\right) = 1 (however, it’s not too hard to see that any halfspace has a representation so that the linear function inside the sign is never zero on the hypercube). Intuitively, take the hyperplane in \mathbb{R}^n with normal vector w, then assign to all points which are in the same side as w of the hyper plane the value +1, and the rest -1. Halfspaces are an incredibly rich family of Boolean functions which include arguably some of the important objects in Boolean analysis, such as the dictator functions, the majority function, etc. There is basically a mountain of work on halfspaces, due to their importance in learning theory, and as elementary objects which capture a surprising amount of structure.

Secondly, the intersection of k functions f_1, \ldots, f_k is the function which is 1 at x if and only f_i (x) = 1 for all i, and -1 otherwise. If we think of each f_i as a \{ \mbox{True}, \mbox{False}\} predicate on the boolean cube, then their intersection is simply their AND (or NOR, depending on your map from \{-1, 1\} to \{ \mbox{True}, \mbox{False}\}).

Putting these together gives us the family of functions that Daniel’s work concerns. I don’t know what else to say other than they are a natural algebraic generalization of halfspaces. Hopefully you think these functions are interesting, but even if you don’t, it’s (kinda) okay, because, amazingly, it turns out Kane’s main result barely uses any properties of halfspaces! In fact, it only uses the fact that halfspaces are unate, that is, they are either monotone increasing or decreasing in each coordinate. In fact, he proves the following, incredibly general, theorem:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k unate functions. Then

\mathbb{AS}(f) = O(\sqrt{n \log k}).

I’m not going to go into too much detail about the proof; unfortunately, despite my best efforts there’s not much intuition I can compress out of it (in his talk, Daniel himself admitted that there was a lemma which was mysterious even to him). Plus it’s only roughly two pages of elementary (but extremely deep) calculations, just read it yourself! At a very, very high level, the idea is that intersecting a intersection of k - 1 halfspaces with one more can only increase the average sensitivity by a small factor.

The really attentive reader might have also figured out why I gave that strange reduction between noise sensitivity and average sensitivity. This is because, importantly, when we apply this weird process of randomly choosing bins to an intersection of halfspaces, the resulting function is still an intersection of halfspaces, just over fewer coordinates (besides their unate-ness, this is the only property of intersections of halfspaces that Daniel uses). Thus, since we now know how to bound the average sensitivity of halfspaces, we also know tight bounds for the noise sensitivities of intersection of halfspaces, namely, the following:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k halfspaces. Then

\mathbb{NS}_\varepsilon (f) = O(\sqrt{\varepsilon \log k}).

Finally, this gives us good learning algorithms for intersections of halfspaces.

The paper is remarkable; there had been previous work by Nazarov (see [4]) proving optimal bounds for sensitivities of intersections of halfspaces in the Gaussian setting, which is a more restricted setting than the Boolean setting (intuitively because we can simulate Gaussians by sums of independent Boolean variables), and there were some results in the Boolean setting, but they were fairly suboptimal [5]. Furthermore, all these proofs were scary: they were incredibly involved, used powerful machinery from real analysis, drew heavily on the properties of halfspaces, etc. On the other hand, Daniel’s proof of his main result (which I would say builds on past work in the area, except it doesn’t use anything besides elementary facts), well, I think Erdos would say this proof is from “the Book”.

[1] The Average Sensitivity of an Intersection of Half Spaces, Kane, 2014.
[2] Analysis of Boolean Functions, O’Donnell, 2014.
[3] Noise Stability of Weighted Majority, Peres, 2004.
[4] On the maximal perimeter of a convex set in \mathbb{R}^n with respect to a Gaussian measure, Nazarov, 2003.
[5] An Invariance Principle for Polytopes, Harsha, Klivans, Meka, 2010.

Algebraic circuit lower bounds bonanza – STOC 2014 Recaps (Part 3)

Pritish Kamath has kindly agreed to survey the exciting flurry of algebraic complexity theory results in STOC, which, to many of us, is in a permanent state of mystery. But Pritish did some valiant depth-reduction on the state of affairs in the area, and I hope you’ll appreciate his determined effort.

Pritish Kamath on the latest and greatest in algebraic circuit lower bounds

The most natural and intuitive way to compute a polynomial is via an arithmetic circuit. In this model the inputs are variables x_1, x_2, \cdots, x_n and the computation is performed using the operations + and \times (where we allow the + gates to take arbitrary linear combinations of its inputs). The output of the circuit is a formal polynomial over the base field \mathbb{F} (as opposed to functions) computed by boolean circuits).

The size of an arithmetic circuit is defined as the number of + and \times gates in the circuit, and is typically measured with respect to the number of variables. Proving super-polynomial lower bounds on the size of arithmetic circuits computing any explicit polynomial is one of the most important open problems in the area of algebraic complexity theory. The best known lower bound as of today is a modest \Omega(n \log n) due to Baur-Strassen [BS83] from 30 years ago!

Depth-reduction results

Since proving lower bounds for general arithmetic circuits is such a daunting task, it is natural to focus on proving lower bounds for restricted models. One such restriction is of constant-depth circuits. Although this seems like a severe restriction, a surprising result due to Agrawal-Vinay [AV08], subsequently strengthened by Koiran [Koi12] and Tavenas [Tav13] showed that if a polynomial is computed by poly-sized circuits, then they are in fact computed by not-so-large-sized depth-4 circuits. In particular, they showed the following theorem,

Theorem.[AV08, Koi12, Tav13]
Let p_n(\mathbf{x}) \in \mathbb{F}[\mathbf{x}] be a polynomial over n-variables, of degree d = n^{O(1)}. If there is a \mathrm{poly}(n) sized arithmetic circuit computing p_n(\mathbf{x}), then there is a hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuit of size 2^{O(\sqrt{d} \log n)} computing p_n.

hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}] denotes homogeneous depth-4 circuits with fan-in of bottom product gates bounded by \sqrt{d}. The contrapositive of the above depth-reduction result says that proving a 2^{\omega(\sqrt{d} \log n)} lower bound on the size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing p_n(\mathbf{x}), suffices to prove that p_n(\mathbf{x}) cannot be computed by n^{O(1)}-sized arithmetic circuits!

Results from STOC 2014, before and after…

Proving lower bounds for hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits seems like a conceptually much simpler task than proving lower bounds for general circuits. Using a new technique of the shifted partial derivatives, Gupta et al [GKKS13] showed that hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing the d \times d determinant (\mathrm{Det}_d) requires size 2^{\Omega(\sqrt{d})}, which is only a \omega(\log n) factor away (in the exponent), from the depth-reduction result!

Thus, after the above result was established, there were two possible approaches for proving super-polynomial lower bounds for general circuits,

  1. Improve the lower bound on hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits
  2. Improve the depth-reduction result from general circuits to hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits

Post-[GKKS13], there was a lot of excitement and a flurry of activity followed, proving stronger lower bounds (in fact the improved depth reduction of Tavenas [Tav13] came after [GKKS13]). The magnitude of this activity is well illustrated by the fact that there were a sequence of four papers in STOC 2014, each showing significant improvements over the previous ones, all produced in a span of just 6 months or so!

[KSS14] A super-polynomial lower bound for regular arithmetic formulas

In this paper, Kayal-Saha-Saptharishi prove a lower bound of 2^{\Omega(\sqrt{d} \log n)} on size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing an explicit polynomial (neatly constructed using Nisan-Wigderson designs), denoted by \mathrm{NW}_{n,d}. After this result, we are only a \omega(1) factor away (in the exponent) from our goal!

[FLMS14] Lower bounds for depth 4 formulas computing iterated matrix multiplication

In this paper, Fournier-Limaye-Malod-Srinivasan prove a lower bound of 2^{\Omega(\sqrt{d} \log n)} on size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing the iterated matrix product of d generic n \times n matrices, denoted by \mathrm{IMM}_{n,d}. This result improved upon [KSS14] in the sense that \mathrm{IMM}_{n,d} is in \mathrm{VP}, whereas \mathrm{NW}_{n,d} was only in \mathrm{VNP}. This shows that the depth-reduction of [Tav13] is in fact tight, and hence rules out approach (b) stated above!

[KS14a] The Limits of Depth Reduction for Arithmetic Formulas: It’s all about the top fan-in

Taking it one more step further than [FLMS14], Kumar-Saraf prove a 2^{\Omega(\sqrt{d} \log n)} lower bound for an explicit polynomial that can be computed by \mathrm{poly}(n)-sized formulas, thus showing that the depth-reduction of Tavenas is in fact tight even for formulas (formulas are weaker models than circuits, in that, the fan-out of any gate in a formula is at most one; thus, formulas are representable by trees, whereas circuits are represented by directed acyclic graphs).

Another question that had been tantalizingly open for a long time was to prove super-polynomial lower bounds on the size of general hom-\Sigma\Pi\Sigma\Pi-circuits (that is, homogenous depth-4 circuits without the bottom fan-in restriction). Towards this direction, Kumar-Saraf considered the problem of proving lower bounds on size of hom-\Sigma\Pi\Sigma\Pi-circuits with bounded top fan-in, and in this paper, they also prove a super-polynomial lower bound on size of such circuits where the top fan-in is o(\log n).

[KLSS14] Super-polynomial lower bounds for depth-4 homogeneous arithmetic formulas

And finally, Kayal-Limaye-Saha-Srinivasan prove a super-polynomial lower bound of n^{\Omega(\log n)} on the size of general hom-\Sigma\Pi\Sigma\Pi-circuits computing \mathrm{IMM}_{n,d}. Interestingly, Kumar-Saraf in a different paper [KS13] had independently proved a weaker lower bound of n^{\Omega(\log \log n)} on the same circuit model computing a Nisan-Wigderson type polynomial (similar to the one in [KSS14]).

Subsequent results

Now that super-polynomial lower bounds for hom-\Sigma\Pi\Sigma\Pi-circuits had been shown, it remained open to prove exponential lower bound (such as 2^{\Omega(\sqrt{d} \log n)}) on such circuits. Since the STOC deadline in November 2013, more improvements have come out!

Kayal-Limaye-Saha-Srinivasan [KLSS14] have proved a lower bound of 2^{\Omega(\sqrt{d} \log n)} on the size of hom-\Sigma\Pi\Sigma\Pi-circuits computing an explicit Nisan-Wigderson type polynomial (similar to the ones in [KSS14] and [KS13]).

Completing this picture, Kumar-Saraf [KS14b] have proved a similar lower bound of 2^{\Omega(\sqrt{d} \log n)} on the size of hom-\Sigma\Pi\Sigma\Pi-circuits computing \mathrm{IMM}_{n,d}, which shows that the depth-reduction of [Tav13] is tight even if the reduction was to general \Sigma\Pi\Sigma\Pi-circuits (without bottom fan-in condition).

Perhaps we will see these papers in FOCS 2014. ;)

Bird’s eye view of the techniques

To prove lower bounds on any class \mathcal{C} of circuits, a natural idea (pun intended!) used in most lower bound techniques has been to introduce a measure \Gamma : \mathbb{F}[\mathbf{x}] \rightarrow \mathbb{Z}_+ such that \Gamma(p) is large for a certain polynomial p, but \Gamma(C) is small for any small-sized C \in \mathcal{C}, and hence no small-sized circuit in \mathcal{C} can compute p.

The main technique used in all the papers above are certain variants of the measure of shifted-partial derivatives technique introduced in [Kay12] and [GKKS13], inspired by the measure of partial derivatives introduced by Nisan-Wigderson [NW95]. The measure is defined as follows,

\Gamma_{k, \ell}(f) = \mathrm{dim}(\mathbf{x}^{\le \ell} \partial^{=k} f)

Essentially, this measure looks at the dimension of the sub-space spanned by the polynomials obtained by multiplying monomials of degree at most \ell to all k-th order partials derivatives of f (where we think of polynomials as a vector of coefficients).

The other new ideas used in the papers described has been the idea of random restrictions (somewhat seen in the context of Håstad’s switching lemma), and of looking at novel variants of shifted partial derivatives by means of projections onto a smaller space of polynomials. In the interest of length of this post, we won’t go into those details here.

TCS in the Big Apple – STOC 2014 Recaps (Part 1)

courtesy of frontpagemag.comAh, summertime. For many NotSoGITCSers, it means no classes, warm weather, ample research time… and a trip to STOC. This year the conference was held in New York City, only four hours away from Cambridge via sketchy Chinatown tour bus. Continuing in the great tradition of conference blogging by TCS blogs (see here, and here), NotSoGITCS will share some of our favorite papers from the conference. I’ll start this off with my recap on Rothvoss’s paper on extension complexity. Over the next few posts, we’ll hear from Clément Canonne and Gautam Kamath on Efficient Distribution Estimation, Pritish Kamath on the latest results in algebraic circuit complexity, Justin Holmgren on indistinguishability obfuscation, and Sunoo Park on coin flipping!

Henry Yuen on The matching polytope has exponential extension complexity, by Thomas Rothvoss

The Best Paper award for this year’s STOC went to this paper, which is another milestone result in the recent flurry of activity in lower bounds for extension complexity of combinatorial problems. This area represents a fruitful “meeting in the middle” between complexity theory and algorithms. Although a proof of P ≠ NP — a conjecture about the limitation of all polynomial time algorithms — seems astronomically distant, we can instead shoot for a closer goal: show that polynomial time algorithms we know of now cannot solve NP-complete problems. The area of extension complexity is about understanding the limitations of some of the finest weapons in our polynomial-time toolbox: LP and SDP solvers. As many of you know, these algorithms are tremendously powerful and versatile, in both theory and practice alike.

A couple years ago, STOC 2012 (also held in New York!) saw the breakthrough result of Fiorini, et al. titled Linear vs. Semidefinite Extended Formulations: Exponential Separation and Strong Lower Bounds. This paper showed that while powerful, LP solvers cannot naturally solve NP-complete problems in polynomial time. What do I mean by naturally solve? Let’s take the Traveling Salesman Problem for a moment (which was the subject of the aforementioned paper). One way to try to solve TSP on an n-city instance using linear programming is to try to optimize over the polytope PTSP where each vertex corresponds to a tour of the complete graph on n nodes. Turns out, this polytope has exponentially many facets, and so is a rather unwieldy object to deal with — simply writing down a system of linear inequalities defining PTSP takes exponential time! Instead, people would like to optimize over another polytope in higher dimension QTSP that projects down to PTSP (such a QTSP would be called a linear extension of PTSP) except now QTSP has a polynomial number of facets. Does such a succinct linear extension exist? If so, that would imply P = NP! The Fiorini, et al. paper showed that one can’t solve the Traveling Salesman Problem this way: the TSP polytope has exponential extension complexity — meaning that any linear extension of QTSP of PTSP necessarily has exponentially many facets [1].

In a sense, one shouldn’t be too surprised by this result: after all, we all know that P ≠ NP, so of course the TSP polytope has exponential extension complexity. But we’ve learned more than just that LP solvers can’t solve NP-complete problems: we’ve gained some insight into the why. However, after the Fiorini et al. result, the natural question remained: is this exponential complexity of polytopes about NP-hardness, or is it something else? In particular, what is the extension complexity of the perfect matching polytope PPM (the convex hull of all perfect matchings in K_n)? This is interesting question for a number of reasons: this polytope, like PTSP, has an exponential number of facets. However, unlike the TSP polytope, optimizing linear objective functions over PPM is actually polynomial-time solvable! This is due to the seminal work of Jack Edmonds, who, in the 1960’s, effectively brought the notion of “polynomial time computability” into the (computer science) public consciousness by demonstrating that the maximum matching problem admits an efficient solution.

As you might have guessed from the title, Thomas Rothvoss showed that actually the matching polytope cannot be described as the projection of a polynomially-faceted polytope! So, while we have LP-based algorithms for optimizing over PPM, these algorithms must nontrivially rely on the structure of the matching problem, and cannot be expressed as generically optimizing over some succinct linear extension. I’ll say a few words about how he shows this. To lower bound the extension complexity of a polytope, one starts by leveraging the Yannakakis’s Factorization Theorem, proved in his 1991 paper that started the whole area of extension complexity. His theorem says that, for a polytope P, \mathrm{xc}(P) = \mathrm{rank}_{+}(S_P), where the left-hand side denotes the extension complexity of P [2], and the right-hand side denotes the non-negative rank of the slack matrix of P [3]. Now, instead of worrying about every possible linear extension of P, we “only” have to focus our attention on the non-negative rank of S_P. It turns out that fortuitously, we have techniques of lower bounding the non-negative rank of matrices from communication complexity (a connection that was also pointed out by Yannakakis). Roughly speaking, communication complexity tells us that the non-negative rank of a matrix is high if the matrix cannot be “described” as a small collection of combinatorial rectangles.

This idea is captured in what Rothvoss calls the “Hyperplane separation lower bound”: let S be an arbitrary non-negative m\times n matrix; and W be an arbitrary matrix of the same dimensions (not necessarily non-negative). Then \mathrm{rank}_+(S) \geq \langle W, S \rangle/ (\|S \|_\infty \alpha_W), where  \langle W, S \rangle is the Frobenius inner product between W and S, \|S \|_\infty is the largest-magnitude entry of S, and \alpha_W := \max \{ \langle W, R \rangle : R \in \{0,1\}^{m \times n}\text{ is rank 1} \}. Intuitively, in order to show that the non-negative rank of S is large, you want to exhibit a matrix W (a “hyperplane”) that has large correlation with S, but has small correlation with any rectangle R. Rothvoss presents such a hyperlane W that proves the matching polytope has exponential extension complexity; the hard part is showing that \langle W, R \rangle \leq 2^{-\Omega(n)} for all rectangles R. To do so, Rothvoss follows a similar strategy to Razborov’s proof of the \Omega(n) lower bound on the randomized communication complexity of DISJOINTNESS, with substantial modifications to fit the matching problem.

There are a number of reasons why I like these papers in extension complexity: they’re concrete, solid steps towards the dream of proving P ≠ NP. They’re short and elegantly written: the Fiorini, et al. paper is 24 pages; the Rothvoss paper is less than 20. They also demonstrate the unity of theoretical computer science: drawing upon ideas from algorithms, combinatorics, communication complexity, and even quantum information.

-Henry Yuen

[1] Well, now we know that these guys couldn’t have used a succinct extended formulation.

[2] \mathrm{xc}(P) is defined to be the minimum number of facets of any polytope that linearly projects to P.

[3] Let S be a non-negative matrix of size m\times n. Then \mathrm{rank}_+(S) = k if there exists a factorization S = TU where T and U are non-negative matrices, and T and U have dimensions m\times k and k\times n, respectively. Let P = \{x : Ax \leq b \} is a polytope defined by m inequalities, with vertices v_1,\ldots,v_n. Then the (i,j)th entry of the slack matrix S_P of P is defined to be b_j - A_j v_i, with A_j the jth row of A.