Insensitive Intersections of Halfspaces – STOC 2014 Recaps (Part 11)

In the eleventh and final installment of our STOC 2014 recaps, Jerry Li tells us about a spectacularly elegant result by Daniel Kane. It’s an example of what I like to call a “one-page wonder” — this a bit of a misnomer, since Kane’s paper is slightly more than five pages long, but the term refers to any beautiful paper which is (relatively) short and readable.

We hope you’ve enjoyed our coverage of this year’s STOC. We were able to write about a few of our favorite results, but there’s a wealth of other interesting papers that deserve your attention. I encourage you to peruse the proceedings and discover some favorites of your own.

Jerry Li on The Average Sensitivity of an Intersection of Half Spaces, by Daniel Kane

The Monday afternoon sessions kicked off with Daniel Kane presenting his work on the average sensitivity of an intersection of halfspaces. Usually, FOCS/STOC talks can’t even begin to fit all the technical details from the paper, but unusually, Daniel’s talk included a complete proof of his result, without omitting any details. Amazingly, his result is very deep and general, so something incredible is clearly happening somewhere.

At the highest level, Daniel deals with the study of a certain class of Boolean functions. When we classically think about Boolean functions, we think of things such as CNFs, DNFs, decision trees, etc., which map into things like \mathbb{F}_2, \{0, 1\}, or \{\mbox{True}, \mbox{False}\}, but today, and often in the study of Boolean analysis, we will think of functions as mapping \{-1, 1\}^n to \{-1, 1\}, which is roughly equivalent for many purposes (O’Donnell has a nice rule of thumb as when to use one convention or the other here). Given a function f:\{-1, 1\}^n \to \{-1, 1\}, we can define two important measures of sensitivity. The first is the average sensitivity (or, for those of you like me who grew up with O’Donnell’s book, the total influence) of the function, namely,

\mathbb{AS}(f) = \mathbb{E}_{x \sim \{-1, 1\}^n} [| \{i: f(x^{i \to 1}) \neq f(x^{i \to -1}) \} |]

where x^{i \to a} is simply x with its ith coordinate set to a. The second is the noise sensitivity of the function, which is defined similarly: for a parameter \varepsilon \in (0, 1), it is the probability that if we sample x uniformly at random from \{-1, 1\}^n, then independently flip each of its bits with probability \varepsilon, the value of f at these two inputs is different. We denote this quantity \mathbb{NS}_\varepsilon (f). When we generate a string y from a string x in this way we say they are 1 - 2\varepsilon-correlated. The weird function of \varepsilon in that expression is because often we equivalently think of y being generated from x by independently keeping each coordinate of x fixed with probability 1 - 2\varepsilon, and uniformly rerandomizing that bit with probability 2 \varepsilon.

Why are these two measures important? If we have a concept class of functions \mathcal{F}, then it turns out that bounds on these two quantities can often be translated directly into learning algorithms for these classes. By Fourier analytic arguments, good bounds on the noise sensitivity of a function immediately imply that the function has good Fourier concentration on low degree terms, which in turn imply that the so-called “low-degree algorithm” can efficiently learn the class of functions in the PAC model, with random access. Unfortunately, I can’t really give any more detail here without a lot more technical detail, see [2] for a good introduction to the topic.

Now why is the average sensitivity of a Boolean function important? First of all, trust me when I say that it’s a fundamental measure of the robustness of the function. If we identify f with f^{-1} (1), then the average sensitivity is how many edges cross from one subset into another (over 2^n), so it is fundamentally related to the surface area of subsets of the hypercube, which comes up all over the place in Boolean analysis. Secondly, in some cases, we can translate between one measure and the other by considering restrictions of functions. To the best of my knowledge, this appears to be a technique first introduced by Peres in 1999, though his paper is from 2004 [3]. Let f: \{-1, +1\}^n \to \mathbb{R}. We wish to bound the noise sensitivity of f, so we need to see how it behaves when we generate x uniformly at random, then y as 1 - 2 \varepsilon-correlated to x. Suppose \varepsilon = 1/m for some integer m (if not, just round it). Fix a partition of the n coordinates into m bins B_1, \ldots, B_m, and a z \in \{-1, 1\}^n. Then, for any string s \in \{-1, 1\}^m, we associate it with the string x \in \{-1, 1\}^n whose ith coordinate is the ith coordinate of z times the jth coordinate of s, if i \in B_j. Why are we doing this? Well, after some thought, it’s not too hard to convince yourself that if we choose the m bins and the strings z, s uniformly at random, then we get a uniformly random string x. Moreover, to generate a string y which is 1 - 2 \varepsilon-correlated with x, it suffices to, after having already randomly chosen the bins, z, and s, to randomly pick a coordinate of s and flip its sign to produce a new string s', and produce a new string y with these choices of the bins, z, and s'. Thus, importantly, we can reduce the process of producing 1 - 2 \varepsilon-correlated strings to the process of randomly flipping one bit of some weird new function–but this is the process we consider when we consider the average sensitivity! Thus noise sensitivity of f is exactly equal to the expected (over the random choice of the bins and z) average sensitivity of this weird restricted thing. Why this is useful will (hopefully) become clear later.

Since the title of the paper includes the phrase “intersection of halfspaces,” at some point I should probably define what an intersection of halfspaces is. First of all, a halfspace (or linear threshold function) is a Boolean function of the form f(x) = \text{sgn}\left(w \cdot x - \theta\right) where w \in \mathbb{R}^n, \theta \in \mathbb{R} and for concreteness let’s say \text{sgn}\left(0\right) = 1 (however, it’s not too hard to see that any halfspace has a representation so that the linear function inside the sign is never zero on the hypercube). Intuitively, take the hyperplane in \mathbb{R}^n with normal vector w, then assign to all points which are in the same side as w of the hyper plane the value +1, and the rest -1. Halfspaces are an incredibly rich family of Boolean functions which include arguably some of the important objects in Boolean analysis, such as the dictator functions, the majority function, etc. There is basically a mountain of work on halfspaces, due to their importance in learning theory, and as elementary objects which capture a surprising amount of structure.

Secondly, the intersection of k functions f_1, \ldots, f_k is the function which is 1 at x if and only f_i (x) = 1 for all i, and -1 otherwise. If we think of each f_i as a \{ \mbox{True}, \mbox{False}\} predicate on the boolean cube, then their intersection is simply their AND (or NOR, depending on your map from \{-1, 1\} to \{ \mbox{True}, \mbox{False}\}).

Putting these together gives us the family of functions that Daniel’s work concerns. I don’t know what else to say other than they are a natural algebraic generalization of halfspaces. Hopefully you think these functions are interesting, but even if you don’t, it’s (kinda) okay, because, amazingly, it turns out Kane’s main result barely uses any properties of halfspaces! In fact, it only uses the fact that halfspaces are unate, that is, they are either monotone increasing or decreasing in each coordinate. In fact, he proves the following, incredibly general, theorem:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k unate functions. Then

\mathbb{AS}(f) = O(\sqrt{n \log k}).

I’m not going to go into too much detail about the proof; unfortunately, despite my best efforts there’s not much intuition I can compress out of it (in his talk, Daniel himself admitted that there was a lemma which was mysterious even to him). Plus it’s only roughly two pages of elementary (but extremely deep) calculations, just read it yourself! At a very, very high level, the idea is that intersecting a intersection of k - 1 halfspaces with one more can only increase the average sensitivity by a small factor.

The really attentive reader might have also figured out why I gave that strange reduction between noise sensitivity and average sensitivity. This is because, importantly, when we apply this weird process of randomly choosing bins to an intersection of halfspaces, the resulting function is still an intersection of halfspaces, just over fewer coordinates (besides their unate-ness, this is the only property of intersections of halfspaces that Daniel uses). Thus, since we now know how to bound the average sensitivity of halfspaces, we also know tight bounds for the noise sensitivities of intersection of halfspaces, namely, the following:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k halfspaces. Then

\mathbb{NS}_\varepsilon (f) = O(\sqrt{\varepsilon \log k}).

Finally, this gives us good learning algorithms for intersections of halfspaces.

The paper is remarkable; there had been previous work by Nazarov (see [4]) proving optimal bounds for sensitivities of intersections of halfspaces in the Gaussian setting, which is a more restricted setting than the Boolean setting (intuitively because we can simulate Gaussians by sums of independent Boolean variables), and there were some results in the Boolean setting, but they were fairly suboptimal [5]. Furthermore, all these proofs were scary: they were incredibly involved, used powerful machinery from real analysis, drew heavily on the properties of halfspaces, etc. On the other hand, Daniel’s proof of his main result (which I would say builds on past work in the area, except it doesn’t use anything besides elementary facts), well, I think Erdos would say this proof is from “the Book”.

[1] The Average Sensitivity of an Intersection of Half Spaces, Kane, 2014.
[2] Analysis of Boolean Functions, O’Donnell, 2014.
[3] Noise Stability of Weighted Majority, Peres, 2004.
[4] On the maximal perimeter of a convex set in \mathbb{R}^n with respect to a Gaussian measure, Nazarov, 2003.
[5] An Invariance Principle for Polytopes, Harsha, Klivans, Meka, 2010.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s