Insensitive Intersections of Halfspaces – STOC 2014 Recaps (Part 11)

In the eleventh and final installment of our STOC 2014 recaps, Jerry Li tells us about a spectacularly elegant result by Daniel Kane. It’s an example of what I like to call a “one-page wonder” — this a bit of a misnomer, since Kane’s paper is slightly more than five pages long, but the term refers to any beautiful paper which is (relatively) short and readable.

We hope you’ve enjoyed our coverage of this year’s STOC. We were able to write about a few of our favorite results, but there’s a wealth of other interesting papers that deserve your attention. I encourage you to peruse the proceedings and discover some favorites of your own.

Jerry Li on The Average Sensitivity of an Intersection of Half Spaces, by Daniel Kane

The Monday afternoon sessions kicked off with Daniel Kane presenting his work on the average sensitivity of an intersection of halfspaces. Usually, FOCS/STOC talks can’t even begin to fit all the technical details from the paper, but unusually, Daniel’s talk included a complete proof of his result, without omitting any details. Amazingly, his result is very deep and general, so something incredible is clearly happening somewhere.

At the highest level, Daniel deals with the study of a certain class of Boolean functions. When we classically think about Boolean functions, we think of things such as CNFs, DNFs, decision trees, etc., which map into things like \mathbb{F}_2, \{0, 1\}, or \{\mbox{True}, \mbox{False}\}, but today, and often in the study of Boolean analysis, we will think of functions as mapping \{-1, 1\}^n to \{-1, 1\}, which is roughly equivalent for many purposes (O’Donnell has a nice rule of thumb as when to use one convention or the other here). Given a function f:\{-1, 1\}^n \to \{-1, 1\}, we can define two important measures of sensitivity. The first is the average sensitivity (or, for those of you like me who grew up with O’Donnell’s book, the total influence) of the function, namely,

\mathbb{AS}(f) = \mathbb{E}_{x \sim \{-1, 1\}^n} [| \{i: f(x^{i \to 1}) \neq f(x^{i \to -1}) \} |]

where x^{i \to a} is simply x with its ith coordinate set to a. The second is the noise sensitivity of the function, which is defined similarly: for a parameter \varepsilon \in (0, 1), it is the probability that if we sample x uniformly at random from \{-1, 1\}^n, then independently flip each of its bits with probability \varepsilon, the value of f at these two inputs is different. We denote this quantity \mathbb{NS}_\varepsilon (f). When we generate a string y from a string x in this way we say they are 1 - 2\varepsilon-correlated. The weird function of \varepsilon in that expression is because often we equivalently think of y being generated from x by independently keeping each coordinate of x fixed with probability 1 - 2\varepsilon, and uniformly rerandomizing that bit with probability 2 \varepsilon.

Why are these two measures important? If we have a concept class of functions \mathcal{F}, then it turns out that bounds on these two quantities can often be translated directly into learning algorithms for these classes. By Fourier analytic arguments, good bounds on the noise sensitivity of a function immediately imply that the function has good Fourier concentration on low degree terms, which in turn imply that the so-called “low-degree algorithm” can efficiently learn the class of functions in the PAC model, with random access. Unfortunately, I can’t really give any more detail here without a lot more technical detail, see [2] for a good introduction to the topic.

Now why is the average sensitivity of a Boolean function important? First of all, trust me when I say that it’s a fundamental measure of the robustness of the function. If we identify f with f^{-1} (1), then the average sensitivity is how many edges cross from one subset into another (over 2^n), so it is fundamentally related to the surface area of subsets of the hypercube, which comes up all over the place in Boolean analysis. Secondly, in some cases, we can translate between one measure and the other by considering restrictions of functions. To the best of my knowledge, this appears to be a technique first introduced by Peres in 1999, though his paper is from 2004 [3]. Let f: \{-1, +1\}^n \to \mathbb{R}. We wish to bound the noise sensitivity of f, so we need to see how it behaves when we generate x uniformly at random, then y as 1 - 2 \varepsilon-correlated to x. Suppose \varepsilon = 1/m for some integer m (if not, just round it). Fix a partition of the n coordinates into m bins B_1, \ldots, B_m, and a z \in \{-1, 1\}^n. Then, for any string s \in \{-1, 1\}^m, we associate it with the string x \in \{-1, 1\}^n whose ith coordinate is the ith coordinate of z times the jth coordinate of s, if i \in B_j. Why are we doing this? Well, after some thought, it’s not too hard to convince yourself that if we choose the m bins and the strings z, s uniformly at random, then we get a uniformly random string x. Moreover, to generate a string y which is 1 - 2 \varepsilon-correlated with x, it suffices to, after having already randomly chosen the bins, z, and s, to randomly pick a coordinate of s and flip its sign to produce a new string s', and produce a new string y with these choices of the bins, z, and s'. Thus, importantly, we can reduce the process of producing 1 - 2 \varepsilon-correlated strings to the process of randomly flipping one bit of some weird new function–but this is the process we consider when we consider the average sensitivity! Thus noise sensitivity of f is exactly equal to the expected (over the random choice of the bins and z) average sensitivity of this weird restricted thing. Why this is useful will (hopefully) become clear later.

Since the title of the paper includes the phrase “intersection of halfspaces,” at some point I should probably define what an intersection of halfspaces is. First of all, a halfspace (or linear threshold function) is a Boolean function of the form f(x) = \text{sgn}\left(w \cdot x - \theta\right) where w \in \mathbb{R}^n, \theta \in \mathbb{R} and for concreteness let’s say \text{sgn}\left(0\right) = 1 (however, it’s not too hard to see that any halfspace has a representation so that the linear function inside the sign is never zero on the hypercube). Intuitively, take the hyperplane in \mathbb{R}^n with normal vector w, then assign to all points which are in the same side as w of the hyper plane the value +1, and the rest -1. Halfspaces are an incredibly rich family of Boolean functions which include arguably some of the important objects in Boolean analysis, such as the dictator functions, the majority function, etc. There is basically a mountain of work on halfspaces, due to their importance in learning theory, and as elementary objects which capture a surprising amount of structure.

Secondly, the intersection of k functions f_1, \ldots, f_k is the function which is 1 at x if and only f_i (x) = 1 for all i, and -1 otherwise. If we think of each f_i as a \{ \mbox{True}, \mbox{False}\} predicate on the boolean cube, then their intersection is simply their AND (or NOR, depending on your map from \{-1, 1\} to \{ \mbox{True}, \mbox{False}\}).

Putting these together gives us the family of functions that Daniel’s work concerns. I don’t know what else to say other than they are a natural algebraic generalization of halfspaces. Hopefully you think these functions are interesting, but even if you don’t, it’s (kinda) okay, because, amazingly, it turns out Kane’s main result barely uses any properties of halfspaces! In fact, it only uses the fact that halfspaces are unate, that is, they are either monotone increasing or decreasing in each coordinate. In fact, he proves the following, incredibly general, theorem:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k unate functions. Then

\mathbb{AS}(f) = O(\sqrt{n \log k}).

I’m not going to go into too much detail about the proof; unfortunately, despite my best efforts there’s not much intuition I can compress out of it (in his talk, Daniel himself admitted that there was a lemma which was mysterious even to him). Plus it’s only roughly two pages of elementary (but extremely deep) calculations, just read it yourself! At a very, very high level, the idea is that intersecting a intersection of k - 1 halfspaces with one more can only increase the average sensitivity by a small factor.

The really attentive reader might have also figured out why I gave that strange reduction between noise sensitivity and average sensitivity. This is because, importantly, when we apply this weird process of randomly choosing bins to an intersection of halfspaces, the resulting function is still an intersection of halfspaces, just over fewer coordinates (besides their unate-ness, this is the only property of intersections of halfspaces that Daniel uses). Thus, since we now know how to bound the average sensitivity of halfspaces, we also know tight bounds for the noise sensitivities of intersection of halfspaces, namely, the following:

Theorem. [Kane14]
Let f:\{-1, 1\}^n \to \{-1, 1\} be an intersection of k halfspaces. Then

\mathbb{NS}_\varepsilon (f) = O(\sqrt{\varepsilon \log k}).

Finally, this gives us good learning algorithms for intersections of halfspaces.

The paper is remarkable; there had been previous work by Nazarov (see [4]) proving optimal bounds for sensitivities of intersections of halfspaces in the Gaussian setting, which is a more restricted setting than the Boolean setting (intuitively because we can simulate Gaussians by sums of independent Boolean variables), and there were some results in the Boolean setting, but they were fairly suboptimal [5]. Furthermore, all these proofs were scary: they were incredibly involved, used powerful machinery from real analysis, drew heavily on the properties of halfspaces, etc. On the other hand, Daniel’s proof of his main result (which I would say builds on past work in the area, except it doesn’t use anything besides elementary facts), well, I think Erdos would say this proof is from “the Book”.

[1] The Average Sensitivity of an Intersection of Half Spaces, Kane, 2014.
[2] Analysis of Boolean Functions, O’Donnell, 2014.
[3] Noise Stability of Weighted Majority, Peres, 2004.
[4] On the maximal perimeter of a convex set in \mathbb{R}^n with respect to a Gaussian measure, Nazarov, 2003.
[5] An Invariance Principle for Polytopes, Harsha, Klivans, Meka, 2010.

So Alice and Bob want to flip a coin… – STOC 2014 Recaps (Part 10)

A major use case for coin flipping (with actual coins) is when you’re with friends, and you have to decide where to eat. This agonizing decision process can be elegantly avoided when randomness is used. But who’s doing the coin flipping? How do you know your friend isn’t secretly choosing which restaurant to go to? Cryptography offers a solution to this, and Sunoo will tell us about how this solution is actually equivalent to one way functions!

Sunoo Park on Coin Flipping of Any Constant Bias Implies One-Way Functions, by Itay Berman, Iftach Haitner, and Aris Tentes

In this post, we look at a fundamental problem that has plagued humankind since long before theoretical computer science: if I don’t trust you and you don’t trust me, how can we make a fair random choice? This problem was once solved in the old-fashioned way of flipping a coin, but theoretical computer science has made quite some advances since.

What are the implications of this? In their recent STOC paper, Berman, Haitner, and Tentes show that the ability for two parties to flip a (reasonably) fair coin means that one-way functions exist. This, in turn has far-reaching cryptographic implications.

A function f is one-way if it is “easy” to compute f(x) given any input x, and it is “hard”, given the image y of a random input, to find a preimage x' such that f(x')=y. The existence of one-way functions imply a wide range of fundamental cryptographic primitives, including pseudorandom generation, pseudorandom functions, symmetric-key encryption, bit commitments, and digital signatures – and vice versa: the seminal work of Impagliazzo and Luby [1] showed that the existence of cryptography based on complexity-theoretic hardness assumptions – encompassing the all of the aforementioned primitives – implies that one-way functions exist.

About coin-flipping protocols, however, only somewhat more restricted results were known. Coin-flipping has long been a subject of interest in cryptography, since the early work of Blum [2] which described the following problem:

“Alice and Bob want to flip a coin by telephone. (They have just divorced, live in different cities, want to decide who gets the car.)”

More formally, coin-flipping protocol is a protocol in which two players interact by exchanging messages, which upon completion outputs a single bit interpreted as the outcome of a coin flip. The aim is that the coin should be (close to) unbiased, even if one of the players “cheats” and tries to bias the outcome towards a certain value. We say that a protocol has constant bias if the probability that the outcome is equal to 0 is constant (in a security parameter).

Impagliazzo and Luby’s original paper showed that coin-flipping protocols achieving negligible bias (that is, they are very close to perfect!) imply one-way functions. Subsequently, Maji, Prabhakaran and Sahai [3] proved that coin-flipping protocols with a constant number of rounds (and any non-trivial bias, i.e. 1/2-o(1)) also imply one-way functions. Yet more recently, Haitner and Omri [4] showed that the same holds for any coin-flipping protocol with a constant bias (namely, a bias of \frac{\sqrt{2}-1}{2}-o(1)). Finally, Berman, Haitner and Tentes proved that coin-flipping of any constant bias implies one-way functions. The remainder of this post will give a flavor of the main ideas behind their proof.

The high-level structure of the argument is as follows: given any coin-flipping protocol \Pi=(\mathsf{A},\mathsf{B}) between players \mathsf{A} and \mathsf{B}, we first define a (sort of) one-way function, then show that an adversary capable of efficiently inverting that function must be able to achieve a significant bias in \Pi. The one-way function used is the transcript function which maps the players’ random coinflips to a protocol transcript. The two main neat insights are these:

  • Consider the properties of a coin-flipping protocol when interpreted as a zero-sum game between two players: Alice wins if the outcome is 1, and Bob wins otherwise. If the players play optimally, who wins? It turns out that from the winner, we can deduce that there is a set of protocol transcripts where the outcome is bound to be the winning outcome, no matter what the losing player does: that is, transcripts that are “immune” to the losing player.
  • A recursive variant of the biasing attack proposed by Haitner and Omri in [4] is proposed. The new attack can be employed by the adversary in order to generate a transcript that lies in the “immune” set with probability close to 1 – so, this adversary (who has access to an inverter for the transcript function) can bias the protocol outcome with high probability.

The analysis is rather involved; and there are some fiddly details to resolve, such as how a bounded adversary can simulate the recursive attack efficiently, and ensuring that the inverter will work for the particular query distribution of the adversary. Without going into all those details, the last part of this post describes the optimal recursive attack.

Let \mathsf{Inv} be the function that takes as input a pair (\mathbf{t},b), where \mathbf{t} is a random transcript of a partial (incomplete) honest execution of \Pi and b\in\{0,1\}, and outputs a random pair of random coinflips (r_1,r_2) for the players, satisfying the following two conditions: (1) they are consistent with \mathbf{t}, that is, the coinflips could be plausible next coinflips given the partial transcript \mathbf{t}; and (2) there exists a continuation of the protocol after \mathbf{t} and the next-coinflips (r_1,r_2) that leads to the protocol outcome b.

It seems that an effective way for the adversary to use \mathsf{Inv} might be to call \mathsf{Inv}(\mathbf{t},1) for each partial transcript \mathbf{t} at which the relevant player has to make a move, and then to behave according to the outputted coins. We call this strategy the biased-continuation attack, which is the crucial attack underlying the result of [4].

The new paper proposes a recursive biased-continuation attack that adds an additional layer of sophistication. Let \mathsf{A}^{(0)}=\mathsf{A} be the honest first player’s strategy. Now, define \mathsf{A}^{(i)} to be attacker which, rather than sampling a random 1-continuation among all the possible honest continuations of the protocol (\mathsf{A},\mathsf{B}), instead samples a random 1-continuation among all continuations of (\mathsf{A}^{(i-1)},\mathsf{B}). Note that \mathsf{A}^{(1)} is the biased-continuation attacker described above! It turns out that as the number of recursions grows, the probability that the resulting transcript will land in the “immune” set approaches 1 – meaning a successful attack! Naïvely, this attack may require exponential time due to the many recursions required; however, the paper circumvents this by analyzing the probability that the transcript will land in a set which is “almost immune”, and finding that this probability approaches 1 significantly faster.

[1] One-Way Functions Are Essential for Complexity Based Cryptography. Impagliazzo, Luby (FOCS 1989).
[2] Coin Flipping by Telephone: A Protocol for Solving Impossible Problems. Blum (CRYPTO 1981).
[3] On the Computational Complexity of Coin Flipping. Maji, Prabhakaran, Sahai (FOCS 2010).
[4] Coin Flipping with Constant Bias Implies One-Way Functions. Haitner, Omri (FOCS 2011).

Faster, I say! The race for the fastest SDD linear system solver – STOC 2014 Recaps (Part 9)

In the next post in our series of STOC 2014 recaps, Adrian Vladu tells us about some of the latest and greatest in Laplacian and SDD linear system solvers. There’s been a flurry of exciting results in this line of work, so we hope this gets you up to speed.

The Monday morning session was dominated by a nowadays popular topic, symmetric diagonally dominant (SDD) linear system solvers. Richard Peng started by presenting his work with Dan Spielman, the first parallel solver with near linear work and poly-logarithmic depth! This is exciting, since parallel algorithms are used for large scale problems in scientific computing, so this is a result with great practical applications.

The second talk was given by Jakub Pachocki and Shen Chen Xu from CMU, which was a result of merging two papers. The first result is a new type of trees that can be used as preconditioners. The second one is a more efficient solver, which together with the trees shaved one more \log n factor in the race for the fastest solver.

Before getting into more specific details, it might be a good idea to provide a bit of background on the vast literature of Laplacian solvers.

Typically, linear systems Ax = b are easier to solve whenever A has some structure on it. A particular class we care about are positive semidefinite (PSD) matrices. They work nicely because the solution is the minimizer of the quadratic form f(x) = \frac{1}{2} x^T A x - b x , which happens to be a convex function due to the PSD-ness of A. Hence we can use various versions of gradient descent, the convergence of which depends usually on the condition number of A.

A subset of PSD matrices are Laplacian matrices, which are nothing else but graph Laplacians; using an easy reduction, one can show that any SDD system can be reduced to a Laplacian system. Laplacians are great because they carry a lot of combinatorial structure. Instead of having to suffer through a lot of scary algebra, this is the place where we finally get to solve some fun problems on graphs. The algorithms we aim for have running time close to linear in the sparsity of the matrix.

One reason why graphs are nice is that we know how to approximate them with other simpler graphs. More specifically, when given a graph G, we care about finding a sparser graph H such that L_H \preceq L_G \preceq \alpha \cdot L_Ha, for some small \alpha (the smaller, the better). The point is that whenever you do gradient descent in order to minimize f(x), you can take large steps by solving a system in the sparser L_H. Of course, this requires another linear system solve, only that this only needs to be done on a sparser graph. Applying this idea recursively eventually yields efficient solvers. A lot of combinatorial work is spent on understanding how to compute these sparser graphs.

In their seminal work, Spielman and Teng used ultrasparsifiersb as their underlying combinatorial structure, and after many pages of work they obtained a near linear algorithm with a large polylogarithmic factor in the running time. Eventually, Koutis, Miller and Peng came up with a much cleaner construction, and showed how to construct a chain of sparser and sparser graphs, which yielded a solver that was actually practical. Subsequently, people spent a lot of time trying to shave log factors from the running time, see [9], [6], [7], [12] (the last of which was presented at this STOC), and the list will probably continue.

After this digression, we can get back to the conference program and discuss the results.

An Efficient Parallel Solver for SDD Linear Systems by Richard Peng and Daniel Spielman

How do we solve a system L_G x = b? We need to find away to efficiently apply the operator L_G^+ to b. Even Laplacians are not easy to invert, and what’s worse, their pseudoinverses might not even be sparse. However, we can still represent L_G^+ as a product of sparse matrices which are easy to compute.

We can gain some inspiration from trying to numerically approximate the inverse of (1-x) for some small real x. Taking the Taylor expansion we get that (1-x)^{-1} = \sum_{i \geq 0} x^i = \prod_{k \geq 0} (1+x^{2^k}). Notice that in order to get \epsilon precision, we only need to take the product of the first \Theta(\log 1/\epsilon) factors. It would be great if we could approximate matrix inverses the same way. Actually, we can, since for matrices of norm less than 1 we have the identity (I-A)^{-1} = \prod_{k\geq 0}(I + A^{2^k}). At this point we’d be tempted to think that we’re almost done, since we can just write L_G = D - A, and try to invert I - D^{-1/2} A D^{-1/2}. However we would still need to compute matrix powers, and those matrices might again not even be sparse, so this approach needs more work.

Richard presented a variation of this idea that is more amenable to SDD matrices. He writes

(D-A)^{-1} = \frac{1}{2}( D^{-1} + (I+D^{-1}A)(D - AD^{-1}A)^{-1}(I+AD^{-1} ) )

The only hard part of applying this inverse operator (D-A)^{-1} to a vector consists of left multiplying by (D-AD^{-1}A)^{-1}. How to do this? One crucial ingredient is the fact that M = D-AD^{-1}A is also SDD! Therefore we can recurse, and solve a linear system in M. You might say that we won’t be able to do it efficiently, since M is not sparse. But with a little bit of work and the help of existing spectral sparsification algorithms M can be approximated with a sparse matrix.

Notice that at the k^{th} level of recursion, the operator we need to apply is \left(D-(D^{-1}A)^{2^k}\right)^{-1}. A quick calculation shows that if the condition number of D-A is \kappa, then the condition number of D^{-1}A is 1-1/\kappa. This means that after k = O(\log \kappa) iterations, the eigenvalues of (D^{-1}A)^{2^k} are close to 0, so we can just approximate the operator with D^{-1} without paying too much for the error.

There are a few details left out. Sparsifying M requires a bit of understanding of its underlying structure. Also, in order to do this in parallel, the authors originally employed the spectral sparsification algorithm of Spielman and Teng, combined with a local clustering algorithm of Orecchia, Sachdeva and Vishnoi. Blackboxing these two sophisticated results might question the practicality of the algorithm. Fortunately, Koutis recently produced a simple self-contained spectral sparsification algorithm, which parallelizes and can replace all the heavy machinery in Richard and Dan’s paper.

Solving SDD Linear Systems in Nearly m \log^{1/2} n Time by Michael B. Cohen, Rasmus Kyng, Gary L. Miller, Jakub W. Pachocki, Richard Peng, Anup B. Rao, and Shen Chen Xu

Jakub Pachocki and Shen Chen Xu talked about two results, which together yield the fastest SDD system solver to date. The race is still on!

Let me go through a bit of more background. Earlier on I mentioned that graph preconditioners are used to take long steps while doing gradient descent. A dual of gradient descent on the quadratic function is the Richardson iteration. This is yet another iterative method, which refines a coarse approximation to the solution of a linear system. Let L_G be the Laplacian of our given graph, and L_H be the Laplacian of its preconditioner. Let us assume that we have access to the inverse of L_H. The Richardson iteration computes a sequence x_0, x_1, x_2, \dots, which converges to the solution of the system. It starts with a weak estimate for the solution, and iteratively attempts to decrease the norm of the residue b- L_G x_k by updating the current solution with a coarse approximation to the solution of the system L_G x = b -L_G x_k. That coarse approximation is computed using L_H^{-1}. Therefore steps are given by

x_{k+1} = x_k + \alpha \cdot L_H^{-1} ( b - L_G x_k )

where \alpha \in (0,1] is a parameter that adjusts the length of the step. The better L_H approximates L_G, the fewer steps we need to make.

The problem that Jakub and Shen talked about was finding these good preconditioners. The way they do it is by looking more closely at the Richardson iteration, and weakening the requirements. Instead of having the preconditioner approximate L_G spectrally, they only impose some moment bounds. I will not describe them here, but feel free to read the paper. Proving that these moment bounds can be satisfied using a sparser preconditioner than those that have been so far used in the literature constitutes the technical core of the paper.

Just like in the past literature, these preconditioners are obtained by starting with a good tree, and sampling extra edges. Traditionally, people used low stretch spanning trees. The issue with them is that the number of edges in the preconditioner is determined by the average stretch of the tree, and we can easily check that for the square grid this is \Omega(\log n). Unfortunately, in general we don’t know how to achieve this bound yet; the best known result is off by a \log \log n factor. It turns out that we can still get preconditioners by looking at a different quantity, called the \ell_p stretch (0 < p < 1), which can be brought down to O(\log^p n). This essentially eliminates the need for computing optimal low stretch spanning trees. Furthermore, these trees can be computed really fast, O(m \log \log n) time in the RAM model, and the algorithm parallelizes.

This result consists of a careful combination of existing algorithms on low stretch embeddings of graphs into tree metrics and low stretch spanning trees. I will talk more about these embedding results in a future post.

a. \preceq is also known as the Lowner partial order. A \preceq B is equivalent to 0 \preceq B-A, which says that B-A is PSD.
b. A (k, h)-ultrasparsifier H of G is a graph with n-k+1 edges such that L_H \preceq L_G \preceq h \cdot L_H. It turns out that one is able to efficiently construct (k, n/k\ \mathrm{polylog}(n)) ultrasparsifiers. So by adding a few edges to a spanning tree, you can drastically reduce the relative condition number with the initial graph.

[1] Approaching optimality for solving SDD systems, Koutis, Miller, Peng (2010).
[2] Nearly-Linear Time Algorithms for Graph Partitioning, Graph Sparsification, and Solving Linear Systems, Spielman, Teng (2004).
[3] A Local Clustering Algorithm for Massive Graphs and its Application to Nearly-Linear Time Graph Partitioning, Spielman, Teng (2013).
[4] Spectral Sparsification of Graphs, Spielman, Teng (2011).
[5] Nearly-Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems, Spielman, Teng (2014).
[6] A Simple, Combinatorial Algorithm for Solving SDD Systems in Nearly-Linear Time, Kelner, Orecchia, Sidford, Zhu (2013).
[7] Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, Lee, Sidford (2013).
[8] A nearly-mlogn time solver for SDD linear systems, Koutis, Miller, Peng (2011).
[9] Improved Spectral Sparsification and Numerical Algorithms for SDD Matrices, Koutis, Levin, Peng (2012).
[10] Simple parallel and distributed algorithms for spectral graph sparsification, Koutis (2014).
[11] Preconditioning in Expectation, Cohen, Kyng, Pachocki, Peng, Rao (2014).
[12] Stretching Stretch, Cohen, Miller, Pachocki, Peng, Xu (2014).
[13] Solving SDD Linear Systems in Nearly mlog1/2n Time, Cohen, Kyng, Miller, Pachocki, Peng, Rao, Xu (2014).
[14] Approximating the Exponential, the Lanczos Method and an \tilde{O}(m)-Time Spectral Algorithm for Balanced Separator, Orecchia, Sachdeva, Vishoni (2012).

An Encore: More Learning and Testing – STOC 2014 Recaps (Part 8)

Thought you were rid of us? Not quite: in a last hurrah, Clément and I come back with a final pair of distribution estimation recaps — this time on results from the actual conference!

Gautam Kamath on Efficient Density Estimation via Piecewise Polynomial Approximation by Siu-On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun

Density estimation is the question on everyone’s mind. It’s as simple as it gets – we receive samples from a distribution and want to figure out what the distribution looks like. The problem rears its head in almost every setting you can imagine — fields as diverse as medicine, advertising, and compiler design, to name a few. Given its ubiquity, it’s embarrassing to admit that we didn’t have a provably good algorithm for this problem until just now.

Let’s get more precise. We’ll deal with the total variation distance metric (AKA statistical distance). Given distributions with PDFs f and g, their total variation distance is d_{\mathrm{TV}}(f,g) = \frac12\|f - g\|_1. Less formally but more intuitively, it upper bounds the difference in probabilities for any given event. With this metric in place, we can define what it means to learn a distribution: given sample access to a distribution X, we would like to output a distribution \hat X such that d_{\mathrm{TV}}(f_X,f_{\hat X}) \leq \varepsilon.

This paper presents an algorithm for learning t-piecewise degree-d polynomials. Wow, that’s a mouthful — what does it mean? A t-piecewise degree-d polynomial is a function where the domain can be partitioned into t intervals, such that the function is a degree-d polynomial on each of these intervals. The main result says that a distribution with a PDF described by a t-piecewise degree-d polynomial can be learned to accuracy \varepsilon using O((d+1)kt/\varepsilon^2) samples and polynomial time. Moreover, the sample complexity is optimal up to logarithmic factors.

Piecewise Polynomial

A 4-piecewise degree-3 polynomial.
Lifted from Ilias’ slides.

Now this is great and all, but what good are piecewise polynomials? How many realistic distributions are described by something like “-x^2 + 1 for x \in [0,1) but \frac12x - 1 for x \in [1,2) and 2x^4 - x^2 +…”? The answer turns out to be a ton of distributions — as long as you squint at them hard enough.

The wonderful thing about this result is that it’s semi-agnostic. Many algorithms in the literature are God-fearing subroutines, and will sacrifice their first-born child to make sure they receive samples from the class of distributions they’re promised — otherwise, you can’t make any guarantees about the quality of their output. But our friend here is a bit more skeptical. He deals with a funny class of distributions, and knows true piecewise polynomial distributions are few and far between — if you get one on the streets, who knows if it’s pure? Our friend is resourceful: no matter the quality, he makes it work.

Let’s elaborate, in slightly less blasphemous terms. Suppose you’re given sample access to a distribution \mathcal{D} which is at total variation distance \leq \tau from some t-piecewise degree-d polynomial (you don’t need to know which one). Then the algorithm will output a (2t-1)-piecewise degree-d polynomial which is at distance \leq 4\tau + O(\varepsilon) from \mathcal{D}. In English: even if the algorithm isn’t given a piecewise polynomial, it’ll still produce something that’s (almost) as good as you could hope for.

With this insight under our cap, let’s ask again — where do we see piecewise polynomials? They’re everywhere: this algorithm can handle distributions which are log-concave, bounded monotone, Gaussian, t-modal, monotone hazard rate, and Poisson Binomial. And the kicker is that it can handle mixtures of these distributions too. Usually, algorithms fail catastrophically when considering mixtures, but this algorithm keeps chugging and handles them all — and near optimally, most of the time.

The analysis is tricky, but I’ll try to give a taste of some of the techniques. One of the key tools is the Vapnik-Chervonenkis (VC) inequality. Without getting into the details, the punchline is that if we output a piecewise polynomial which is “close” to the empirical distribution (under a weaker metric than total variation distance), it’ll give us our desired learning result. In this setting, “close” means (roughly) that the CDFs don’t stray too far from each (though in a sense that is stronger than the Kolmogorov distance metric).

Let’s start with an easy case – what if the distribution is a 1-piecewise polynomial? By the VC inequality, we just have to match the empirical CDF. We can do this by setting up a linear program which outputs a linear combination of the Chebyshev polynomials, constrained to resemble the empirical distribution.

It turns out that this subroutine is the hardest part of the algorithm. In order to deal with multiple pieces, we first discretize the support into small intervals which are roughly equal in probability mass. Next, in order to discover a good partition of these intervals, we run a dynamic program. This program uses the subroutine from the previous paragraph to compute the best polynomial approximation over each contiguous set of the intervals. Then, it stitches the solutions together in the minimum cost way, with the constraint that it uses fewer than 2t - 1 pieces.

In short, this result essentially closes the problem of density estimation for an enormous class of distributions — they turn existential approximations (by piecewise polynomials) into approximation algorithms. But there’s still a lot more work to do — while this result gives us improper learning, we yearn for proper learning algorithms. For example, this algorithm lets us approximate a mixture of Gaussians using a piecewise polynomial, but can we output a mixture of Gaussians as our hypothesis instead? Looking at the sample complexity, the answer is yes, but we don’t know of any computationally efficient way to solve this problem yet. Regardless, there’s many exciting directions to go — I’m looking forward to where the authors will take this line of work!


Clément Canonne on L_p-Testing, by Piotr Berman, Sofya Raskhodnikova, and Grigory Yaroslavtsev [1])

Almost every — if not all — work in property testing of functions are concerned with the Hamming distance between functions, that is the fraction of inputs on which they disagree. Very natural when we deal for instance with Boolean functions f\colon\{0,1\}^d\to\{0,1\}, this distance becomes highly arguable when the codomain is, say, the real line: sure, f\colon x\mapsto x^2-\frac{\sin^2 x}{2^{32}} and g\colon x\mapsto x^2 technically disagree on almost every single input, but should they be considered two completely different functions?

This question, Grigory answered by the negative; and presented (joint work with Piotr Berman and Sofya Raskhodnikova [2]) a new framework for testing real-valued functions f\colon X^d\to[0,1], less sensitive to this sort of annoying “technicalities” (i.e., noise). Instead of the usual Hamming/L_0 distance between function, they suggest the more robust L_p (p\geq 1) distance

\mathrm{d}_p(f,g) = \frac{ \left(\int_{X^d} \lvert f(x)-g(x) \rvert^p dx\right)^{\frac{1}{p}} }{ \left(\int_{X^d} \mathbf{1} dx \right)^{\frac{1}{p}} } = \mathbb{E}_{x\sim\mathcal{U}(X^d)}\left[\lvert f(x)-g(x) \rvert^p\right]^{\frac{1}{p}}\in [0,1]

(think of X^d as being the hypercube \{0,1\}^d or the hypergrid [n]^d, and p being 1 or 2. In this case, the denominator is just a normalizing factor \lvert X\rvert or \sqrt{\lvert X\rvert})

Now, erm… why?

  • because it is much more robust to noise in the data;
  • because it is much more robust to outliers;
  • because it plays well (as a preprocessing step for model selection) with existing variants of PAC-learning under L_p norms;
  • because L_1 and L_2 are pervasive in (machine) learning;
  • because they can.

Their results and methods turn out to be very elegant: to outline only a few, they

  • give the first example of testing monotonicity testing (de facto, for the L_1 distance) when adaptivity provably helps; that is, a testing algorithm that selects its future queries as a function of the answers it previously got can outperform any tester that commits in advance to all its queries. This settles a longstanding question for testing monotonicity with respect to Hamming distance;
  • improve several general results for property testing, also applicable to Hamming testing (e.g. Levin’s investment strategy [3]);
  • provide general relations between sample complexity of testing (and tolerant testing) for various norms (L_0, L_1,L_2,L_p);
  • have quite nice and beautiful algorithms (e.g., testing via partial learning) for testing monotonicity and Lipschitz property;
  • give close-to-tight bounds for the problems they consider;
  • have slides in which the phrase “Big Data” and a mention to stock markets appear (!);
  • have an incredibly neat reduction between L_1 and Hamming testing of monotonicity.

I will hereafter only focus on the last of these bullets, one which really tickled my fancy (gosh, my fancy is so ticklish) — for the other ones, I urge you to read the paper. It is a cool paper. Really.

Here is the last bullet, in a slightly more formal fashion — recall that a function f defined on a partially ordered set is monotone if for all comparable inputs x,y such that x\preceq y, one has f(x)\leq f(y); and that a one-sided tester is an algorithm which will never reject a “good” function: it can only err on “bad” functions (that is, it may sometimes accept, with small probability, a function far from monotone, but will never reject a monotone function).

Suppose one has a one-sided, non-adaptive tester T for monotonicity of Boolean functions f\colon X\to \{0,1\} with respect to Hamming distance, with query complexity q. Then the very same T is also a tester for monotonicity of real-valued functions f\colon X\to [0,1] with respect to L_1 distance.

Almost too good to be true: we can recycle testers! How? The idea is to express our real-valued f as some “mixture” of Boolean functions, and use T as if we were accessing these. More precisely, let f\colon [n]^d \to [0,1] be a function which one intends to test for monotonicity. For all thresholds t\in[0,1], the authors define the Boolean function f_t by

f_t(x) = \begin{cases}  1 & \text{ if } f(x) \geq t\\  0 & \text{ if } f(x) < t  \end{cases}

All these f_t are Boolean; and one can verify that for all x, f(x)=\int_0^1 f_t(x) dt. Here comes the twist: one can also show that the distance of f to monotone satisfies

\mathrm{d}_1(f,\mathcal{M}) = \int_0^1 \mathrm{d}_0(f_t,\mathcal{M}) dt

i.e. the L_1 distance of f to monotone is the integral of the Hamming distances of the f_t‘s to monotone. And by a very simple averaging argument, if f is far from monotone, then at least one of the f_t‘s must be…
How does that help? Well, take your favorite Boolean, Hamming one-sided (non-adaptive) tester for monotonicity, T: being one-sided, it can only reject a function if it has some “evidence” it is not monotone — indeed, if it sees some violation: i.e., a pair x,y with x\prec y but f(x) > f(y).

Feed this tester, instead of the Boolean function it expected, our real-valued f; as one of the f_{t^\ast}‘s is far from monotone, our tester would reject f_{t^\ast}; so it would find a violation of monotonicity by f_{t^\ast} if it were given access to f_{t^\ast}. But being non-adaptive, the tester does exactly the same queries on f as it would have done on this f_{t^\ast}! And it is not difficult to see that a violation for f_{t^\ast} is still a violation for f: so the tester finds a proof that f is not monotone, and rejects. \square


— Clément.

Final, small remark: one may notice a similarity between L_1 testing of functions f\colon[n]\to[0,1] and the “usual” testing (with relation to total variation distance, \propto L_1) of distributions D\colon [n]\to[0,1]. There is actually a quite important difference, as in the latter the distance is not normalized by n (because distributions have to sum to 1 anyway). In this sense, there is no direct relation between the two, and the work presented here is indeed novel in every respect.

Edit: thanks to Sofya Raskhodnikova for spotting an imprecision in the original review.

[1] Slides available here:
[3] See e.g. Appendix A.2 in “On Multiple Input Problems in Property Testing”, Oded Goldreich. 2013.

Efficient Distribution Estimation 4: Greek Afternoon – STOC 2014 Recaps (Part 7)

This post is the finale on our series on the Efficient Distribution Estimation workshop. See also, part 1, part 2, and part 3.

After one last caffeination, we gathered once more for the exciting finale to the day — we were in for a very Greek afternoon, with back-to-back tutorials by Ilias Diakonikolas and Costis Daskalakis. As a wise man once said a few blog posts ago, “if you can’t beat them… change what ‘them’ means”. That was the running theme of this session – they showed us how to exploit the structure of a distribution to design more efficient algorithms.

Gautam Kamath on Beyond Histograms: Exploiting Structure in Distribution Estimation

Ilias focused on the latter of these – in particular, k-flat distributions. A k-flat distribution can be described by k intervals, over which the probability mass function is constant – it literally looks like k flat pieces. Warm up: what if we knew where each piece started and ended? Then the learning problem would be easy: by grouping samples that fall into each interval together, we reduce the support size from n to k. Now we can use the same algorithm as for the general case – only this time, it takes O(k/\varepsilon^2) samples.

But we’re rarely lucky enough to know the correct intervals. What should we do? Guess the intervals? Too expensive. Guess the intervals approximately? A bit better, but still pricey. Make the problem a bit easier and allow ourselves to use k/\varepsilon intervals, instead of only k? This actually works — while lesser mortals would be satisfied with this solution and call it a day, Ilias refused to leave us with anything less than the grand prize. Can we efficiently output a k-flat distribution using only O(k/\varepsilon^2) samples?

Yes, we can, using some tools from Vapnik-Chervonenkis (VC) theory. The VC inequality is a useful tool which allows us to relate the empirical distribution and the true distribution, albeit under a weaker metric than the total variation distance. Skipping some technical details, the key idea is that we want to output a k-flat distribution that is close to the empirical distribution under this weaker metric. Using the triangle inequality, specific properties of this metric, and the fact that we’re comparing two k-flat distributions, this will give us a k-flat distribution which is close to the target in total variation distance, as desired. Computing a k-flat distribution that’s close to the empirical one isn’t actually too tough – a careful dynamic program works.

We can learn k-flat distributions – so what? This class might strike you as rather narrow, but this result leads to algorithms for a variety of classes of distributions, including monotone, t-modal, monotone hazard rate, and log-concave distributions. These classes are all close to k-flat, and this algorithm is fine with that. In this sense, this tool captures all these classes at the same time — One algorithm to rule them all, so to speak. This algorithm even directly generalizes to mixtures of these distributions, which is huge — studying mixtures usually makes the problem much more difficult.

Alright, but what’s the catch? Not all distributions are that close to k-flat. For example, this algorithm requires \tilde O(1/\varepsilon^3) samples to learn a log-concave distribution, even though the optimal sample-complexity is O(1/\varepsilon^{5/2}). It turns out that log-concave distributions are close to k-linear, rather than k-flat, and we must use a k-linear approximation if we want a near-optimal sample complexity.

“Please sir, I want some more.”

You can have it, Oliver — it’s actually possible to learn distributions which are close to mixtures of piecewise polynomials! But this topic is juicy enough that it deserves its own blog post.

Open problems

  • The perennial question – what can we do in high dimensions?
  • Most of these results are fundamentally improper – they approximate a distribution with a distribution which may be from a different class. Can these techniques lead to computationally efficient proper learning algorithms, where we output a hypothesis from the same class? We already know sample-efficient algorithms, the trouble is the running time.

Gautam Kamath on Beyond Berry Esseen: Structure and Learning of Sums of Random Variables

Finally, we had Costis, who talked about sums of random variables. We would like to understand a class of distributions \mathcal{D} in three (highly related) ways:

  • Structure: What “simpler” class of distributions approximates \mathcal{D}?
  • Covering: Can we generate a small set of distributions S, such that for any distribution D \in \mathcal{D}, we have a distribution D^* \in S such that D^* is \varepsilon-close to D?
  • Learning: Given sample access to a distribution D \in \mathcal{D}, can we output a distribution D^* such that D^* is \varepsilon-close to D?

Understanding the structure often implies covering results, since we can usually enumerate the simpler class of distributions. And covering implies learning, since generic results allow us to find a “good” distribution from S at the cost of O(\log |S|) samples. It’s not that easy though, since either the details are not obvious, or we’d like to go beyond these generic results.

Consider Poisson Binomial Distributions*, or PBDs for short. A PBD is the sum X = \sum_i X_i of n independent Bernoulli random variables, where X_i is \mathrm{Bernoulli}(p_i). This is like a Binomial distribution on steroids: the Binomial distribution is the special case when all p_i are equal.

PBDs pop up everywhere in math and computer science, so there’s a plethora of classical structural results – with a few catches. For example, check out a result by Le Cam, which approximates a PBD by a Poisson distribution with the same mean:

d_{TV}\left(\sum_i X_i, \mathrm{Poisson}\left(\sum_i p_i\right)\right) \leq \sum_i p_i^2

Catch one: this structural result describes only a small fraction of PBDs – it gives a good approximation when the p_i values are really small. Catch two: we’re approximating a PBD with a Poisson, a distribution from a different family – we’d ideally like to have a proper approximation, that is, approximate the PBD with another PBD.

Recent results from our community show that you can get around both of these issues at once (woo, chalk one up for the computer scientists!). Structurally, we know the following: any PBD is \varepsilon-close to either a Binomial distribution or a shifted PBD with the parameter n replaced by O(1/\varepsilon^3) — a much smaller number of “coin flips”. While the original distribution had n different p_i parameters, the approximating distribution has only O(1/\varepsilon^3) parameters. Since n is usually way bigger than 1/\varepsilon, this is a huge reduction in complexity. By carefully enumerating the distributions in this structural result, we can get a small cover. Specifically, the size of the cover is polynomial in n and quasi-polynomial in 1/\varepsilon, while the naive cover is exponential in n. Now, using the generic reduction mentioned before, we can learn the distribution with \mathrm{polylog}(n) samples. Again though, let’s pretend that n is so big that it makes André the Giant feel insecure – can we take it out of the picture? If you’ve been paying any attention to my rhetorical style, you might predict the answer is yes (and you’d be right): O(1/\varepsilon^2) samples suffice for learning.

Is that the end of the story? No, we need to go deeper — let’s generalize PBDs once more: Sums of Independent Integer Random Variables (SIIRVs)**. A k-SIIRV is the sum X = \sum_i X_i of n independent random variables (k-IRVs), where each X_i \in \{1, \dots, k\}. Note that when k = 2, this is essentially equivalent to a PBD. PBDs are actually quite tame, and have many nice properties – for example, they’re unimodal and log-concave. However, even a 3-SIIRV is far from being k-modal or log-concave. After some careful analysis and modular arithmetic, it turns out that a k-SIIRV with sufficiently large variance is approximated by cZ + Y, where c is some integer \in \{1, \dots, k-1\}, Z is a discretized Gaussian, Y is a c-IRV, and Y and Z are independent. On the other hand, if the variance is not large, the distribution has a sparse support – it can be approximated by a (k/\varepsilon)-IRV. This structural observation leads to a learning result with sample complexity which is polynomial in k and 1/\varepsilon, but once again independent of n.

Screenshot from 2014-06-22 15:16:30

A particularly sinister 3-SIIRV, far from being 3-modal or log-concave. Image taken from Costis’ slides.

Open problems

  • Current results for PBDs are quasi-polynomial in 1/\varepsilon, for both the cover size and the running time for the proper learning result. Can we make this polynomial?
  • For k-SIIRVs, our current understanding is fundamentally improper. Can we obtain proper covers or learning algorithms for this class?
  • What about higher dimensions? We have some preliminary results on the multidimensional generalization of PBDs (joint work with Costis and Christos Tzamos), but we’re totally in the dark when it comes to generalizing k-SIIRVs.


*For maximum confusion, Poisson Binomial Distributions have nothing to do with Poisson Distributions, except that they were both named after our good friend Siméon.
**For another perspective on the SIIRV result, see this post by my partner in crime, Clément.

[1] Learning k-modal distributions via testing, Daskalakis, Diakonikolas, Servedio (2012).
[2] Approximating and Testing k-Histogram Distributions in Sub-Linear time, Indyk, Levi, Rubinfeld (2012).
[3] Learning Poisson Binomial Distributions, Daskalakis, Diakonikolas, Servedio (2012).
[4] Learning Mixtures of Structured Distributions over Discrete Domains, Chan, Diakonikolas, Servedio, Sun (2013).
[5] Testing k-modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2013).
[6] Learning Sums of Independent Integer Random Variables, Daskalakis, Diakonikolas, Servedio (2013).
[7] Efficient Density Estimation via Piecewise Polynomial Approximation, Chan, Diakonikolas, Servedio, Sun (2013).
[8] Sparse Covers for Sums of Indicators, Daskalakis, Papadimitriou (2013).
[9] Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians, Daskalakis, Kamath (2014).
[10] Near-optimal-sample estimators for spherical Gaussian mixtures, Acharya, Jafarpour, Orlitsky, Suresh (2014).

Spaghetti code crypto – STOC 2014 Recaps (Part 6)

We return to our STOC recap programming with Justin Holmgren, who will tell us about an exciting new result in program obfuscation. (However, he doesn’t tell us how to say “indistinguishability obfuscation” 10 times fast).

Justin Holmgren on How to Use Indistinguishability Obfuscation: Deniable Encryption, and More, by Amit Sahai and Brent Waters

One very recent topic of study in cryptography is program obfuscation – that is, how to make a program maximally unintelligible, while still preserving the functionality. We will represent programs as circuits, which for most purposes is equivalent to any other representation. At some level, obfuscation of a program seems like it should be an easy task – any programmer knows that it’s easy to end up with unreadable spaghetti code, but theoretical treatment of the issue was lacking, and finding a construction with provable security properties has proven quite difficult.

In 2001, Barak et al. [1] precisely formulated black-box obfuscation, which says that if you can compute something given the obfuscated program, you could also compute it given black-box access to the program. This is a natural and very strong notion of obfuscation. If we could obfuscate programs according to this definition, we could easily solve many difficult problems in cryptography. For example, a public-key fully homomorphic encryption scheme could be constructed from an ordinary symmetric-key encryption scheme by obfuscating a program which decrypts two input ciphertexts, and outputs the encryption of their NAND. Unfortunately, [1] also showed that black-box obfuscation is impossible to achieve.

One weaker definition of obfuscation that [1] did not rule out is called indistinguishability obfuscation (iO). It requires that if two circuits C_1 and C_2 are functionally equivalent (for all x, C_1(x) = C_2(x)), then their obfuscations are indistinguishable. In 2013, Garg et al. [2] gave the first candidate construction for iO. This construction is not proven secure under standard assumptions (e.g. the hardness of factoring), but no attacks on the construction are currently known, and it has been proven that certain types of generic attacks cannot possibly work.

It isn’t immediately clear whether iO is useful. After all, what information is the obfuscation really hiding if it only guarantees indistinguishability from functionally equivalent programs? “How to use Indistinguishability Obfuscation: Deniable Encryption and More”, by Amit Sahai and Brent Waters [3], shows several applications of iO to building cryptographic primitives, and proposes iO as a “hub of cryptography”, meaning that constructions which use iO as a building block would be able to benefit from improvements in the efficiency or underlying computational assumptions of iO constructions.

As a first example of how to use iO, they show how to construct an IND-CPA public key cryptosystem based on one-way functions and iO. This uses a few other basic and well-studied cryptographic primitives (it is known that all of them exist if one-way functions exist). First, a pseudorandom generator is a deterministic function which takes as input a random string s, and computes a longer string r which is indistinguishable from truly random. The length of r can be polynomially larger than the length of s, but we will let r be twice the length of s. A pseudorandom function family is parameterized by a secret key k and an input x, and produces outputs of some fixed length. Oracle access to f_k(\cdot) for some random unknown k is computationally indistinguishable from oracle access to a truly random function. The main special property that [2] require is a puncturable pseudorandom function family. That is, for a secret key k and an input x, a “punctured” key k\{x\} can be computed. k\{x\} allows one to compute f_k(y) for all y \neq x, but essentially reveals no information about f_k(x).

The power of iO comes from combining its indistinguishability guarantees with those of other cryptographic primitives, in what is known as the hybrid method. We will start with the security game played with a real scheme, and give a sequence of other games, each indistinguishable from the previous one to the adversary, such that the adversary obviously has no advantage in the last game. This will prove that the scheme is secure.

For example, a simple public key encryption scheme for one-bit messages (let G be a PRG, and F be a puncturable pseudorandom function family) would be the following.

  • The secret key SK is just a puncturable PRF key k.
  • The public key PK is the obfuscation of an Encrypt program. Encrypt takes a message m and randomness r, and outputs (G(r), F_k(G(r)) \oplus m).
  • To decrypt a ciphertext (c_1, c_2) given SK, one just computes F_k(c_1)  \oplus c_2.

In the IND-CPA security game, there is a challenger and an adversary. The challenger generates the public and secret key, and gives the public key to the adversary. The challenger also generates an encryption of a random bit b, and gives this to the adversary. The adversary wins if he can guess the bit that the challenger chose.

In the first hybrid, the challenger encrypts by choosing a truly random \tilde{r}, and generating the ciphertext (\tilde{r}, F_k(\tilde{r}) \oplus b). This is indistinguishable by the security property of a pseudorandom generator.

In the next hybrid, the program Encrypt which is obfuscated and given to the adversary uses a punctured key k\{\tilde{r}\} instead of the full PRF key k. This is with high probability functionally equivalent, because \tilde{r} is most likely not in the range of G. So by the security of iO, this hybrid is indistinguishable from the previous one.

Next, the ciphertext is generated by choosing a truly random bit p, and generating the ciphertext (\tilde{r}, p \oplus b). This is also indistinguishable to the adversary, because the punctured key reveals nothing about F_k(\tilde{r}). That is, if these two hybrids were distinguishable, then given only k\{\tilde{r}\}, one could distinguish between a truly random bit p and the true value F_k(\tilde{r}).

In this hybrid, it is clear (by analogy to a one-time pad) that no adversary can distinguish a ciphertext encrypting 0 from a ciphertext encrypting 1. So the original scheme is also secure.

Sahai and Waters achieve more than just public-key encryption – they also achieved the long-standing open problem of publicly deniable encryption. This is an encryption system for which if Encrypt(m ; r) = c, one can also compute a “fake” randomness r' for any other message m' such that Encrypt(m'; r') is also equal to c, and r' is indistinguishable from r. The motivation is that after a message is encrypted, coercing the encrypter into revealing his message is futile – there is no way of verifying whether what he says is the truth. Note that of course the decrypter could be coerced to reveal his secret key, but there’s nothing that can be done about that.

The basic idea of the deniable encryption scheme is that the encryption public key is again an obfuscated program – this program takes a message and randomness string. If the randomness string has some rare special structure (it is a hidden sparse trigger), it is interpreted as encoding a particular ciphertext, and the program will output that ciphertext. Otherwise, the program encrypts normally. There is also a public obfuscated Explain program, which allows anybody to sample a random hidden sparse trigger, but not test whether or not a string is a hidden sparse trigger.

At present, iO obfuscation is known to imply some very powerful primitives, including for example multi-input functional encryption. It is still open to construct fully homomorphic encryption for iO, or even collision resistant hash functions from iO, or to show that no such reduction is possible.

[1] On the (Im)possibility of Obfuscating Programs, Barak, Goldreich, Impagliazzo, Rudich, Sahai, Vadhan, Yang, 2001.
[2] Candidate Indistinguishability Obfuscation, Garg, Gentry, Halevi, Raykova, Sahai, Waters, 2013.
[3] How to Use Indistinguishability Obfuscation: Deniable Encryption, and More, Sahai, Waters, 2014.

Efficient Distribution Estimation 3: Rocco’s Modern Life – STOC 2014 Recaps (Part 5)

The road twists, as Clément takes us down a slightly different path in the penultimate post of our Efficient Distribution Estimation workshop summary series. See also: Part 1 and Part 2.

Clément Canonne on A complexity theoretic perspective on unsupervised learning

And now, for something completely different… Rocco Servedio presented a (joint with Ilias Diakonikolas and Anindya De [1]) work, not dealing with distribution testing. Showing there is a world outside distribution testing, and that this world is filled with wonder, probabilities and Boolean functions. A world of inverse problems.

I will only try to give the main idea, maybe the main ideas, of the problem they consider, why it is hard, and how they overcame the difficulties. To start, fix an unknown Boolean function f\colon\{0,1\}^n\to\{0,1\} in some class \mathcal{C} — think of the class of linear threshold functions, for instance (LTFs, a.k.a. weighted majorities, or halfspaces). f is unknown, but it brings with it the set of its satisfying inputs, I\stackrel{\rm{}def}{=} f^{-1}(\{1\}); and you can get uniform samples from this I. You press a button, you get x\in I.

Can you do this a few time, and then output a new device with its own button, that will generate on demand i.i.d. samples from some distribution D close (in total variation distance, as usual) to the uniform distribution on I? In other terms, can you approximate the uniform distribution on I, by observing i.i.d. samples from I (but without knowing f)?

Of course, one way to do this would be to use a learning algorithm for \mathcal{C} (but using only what we get: only positive examples): learn an approximate version \hat{f} of f, and then study it “offline” to find out exactly what \hat{f}^{-1}(\{1\}) is, before spitting uniform samples from it. Clearly, if \hat{f} is close enough to {f}, this will work!

But does it always? Recall that here, we want to be close to uniform on I, that is we only consider the error with relation to the positive examples of the function; while the general results for (PAC-)learning of functions under the uniform distribution only care about the overall error. This means \hat{f} may need to be very close to f for this approach to succeed… As an example, take f to be \mathrm{AND}_n, i.e. 0 everywhere but on the single point x=1^n. The function \hat{f}= \lnot\mathrm{OR}_n is a very good approximation: it has error only 1/2^{n-1}; and yet, this would lead to a distribution supported on the single point 0^n, which is thus 1-far from the actual distribution (supported on 1^n)!

Intuitively, the problem here is that PAC-learning gives a sort of additive guarantee:

\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon

while we need some some of multiplicative one:

\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon\cdot \mathbb{P}\{f(x) = 1\}

which is much harder to obtain when f has very few satisfying assignments.

This being said, the discussion above shows that when we are in the dense case, things look simpler: i.e., when f has “many” (an inverse polynomial fraction of the hypercube) satisfying inputs. In this case, the idea is to combine two steps — and show they can be performed:
(1) SQ Learning: get a hypothesis \hat{f}. Upshot: one has to show that it’s possible to use a SQ learner, even from positive examples only.
(2) Hypothesis testing: test if \hat{f} is good enough (if not, try again)
before then using the now explicitly known “good” \hat{f} to approximate the set of satisfying inputs of f.

But then, once this “simple” (erm) case is dealt with, how to handle the general case, where f could have an exponentially small fraction of satisfying inputs? Rocco explained one of their main ingredient — what they called a “densifier”, i.e. an algorithm that will start from positive examples from f and output a “densified function” g which agrees with f on most of f‘s satisfying inputs, and does not have many more (so that f^{-1}(\{1\}) is “dense” in g^{-1}(\{1\})). Provided such a densifier exists (and can be efficiently simulated) for a class \mathcal{C}, they show how to combine it with (1) and (2) above to iteratively “densify” the sparse set f^{-1}(\{1\}) and reduce to the “easy”, dense case.

Now that all the bricks are in place, the next step in order to play Lego is, of course, to show that such densifiers do exist… for interesting classes, and in particular classes who also have SQ learning algorithms (otherwise, brick (1) is still missing!). In the paper, the authors describe how to build such a densifier for (for instance) the class of LTFs. Combining this with the ingredients sketched above, this gives them a polynomial time algorithm for learning the uniform distribution over satisfying assignments to an unknown LTF. As another instantiation of their approach, the authors also give a densifier for DNF formulas and a resulting quasi-polynomial time distribution learning algorithm.

Two remarks here:

  • maybe surprisingly, the hardness of this problem stems from the Boolean setting — indeed, the same question, dealing with functions f\colon\mathbb{R}^n\to\{0,1\} on the Gaussian space, becomes way easier;
  • the related “direct” problem (on input the full specification of f, generate close-to-uniform samples from f^{-1}(\{1\})), is neither harder nor simpler: the authors provide examples where (under some suitable well-believed assumption) the former is easy and the latter hard, and vice-versa. In particular, if f^{-1}(\{1\}) is a finite cyclic group, getting even one positive example allows the algorithm to get easily the whole f^{-1}(\{1\}) “for free”.

[1] Inverse Problems in Approximate Uniform Generation, De, Diakonikolas, Servedio, 2012.

Efficient Distribution Estimation 2: The Valiant Sequel – STOC 2014 Recaps (Part 4)

This is a continuation of Efficient Distribution Estimation STOC workshop summary series by G and Clément (see the Part 1 here).

Gautam Kamath on Distributional Property Testing and Estimation: Past, Present, and Future

After lunch, we were lucky enough to witness back-to-back talks by the Valiant brothers, Greg and Paul. If you aren’t familiar with their work, first of all, how have you been dealing with all of your everyday property testing needs? But second of all, over the past few years, they’ve touched a number of classic statistical problems and left most of them with matching optimal upper and lower bounds.

Greg kicked off the show and focused on property estimation. Given a property of interest and access to independent draws from a fixed distribution \mathcal{D} over \{1, \dots, n\}, how many draws are necessary to estimate the property accurately?

Of course, this is an incredibly broad problem. For example, properties of a distribution which could be of interest include the mean, the L^{3} norm, or the probability that the samples will spell out tomorrow’s winning lottery numbers. In order to tame this mighty set of problems, we focus on a class of properties which are slightly easier to understand — symmetric properties. These are properties which are invariant to permuting the labels on the support — for instance, if we decided to say 3 was now called 12 and vice-versa, the property wouldn’t change. Notable examples of symmetric properties include the support size*, entropy, and distance between two distributions. Still a broad class of properties, but surprisingly, we can make statements about all of them at once.

What’s the secret? For symmetric properties, it turns out that the fingerprint of a set of samples contains all the relevant information. The fingerprint of a set of samples X is a vector \mathcal{F}_X whose ith component is the number of elements in the domain which occur exactly i times in X. The CliffNotes version of the rest of this post is that using linear programming on the fingerprint gives us an unreasonable amount of power**.

First off, Greg talked about a surprising dichotomy which demonstrates the power of linear estimators. A linear estimator of a property is an estimator which outputs a linear combination of the fingerprint. Given a property \pi, a number of samples k, and a distance \varepsilon, exactly one of the following holds:

  • There exist distributions y_1 and y_2 such that |\pi(y_1) - \pi(y_2)| > \varepsilon but no estimator (linear or not) that uses only k samples can distinguish the two with probability \geq 51\%.
  • There exists a linear estimator which requires k(1 + o(1)) samples and estimates the property to within \varepsilon(1 + o(1)) with probability \geq 99\%.

In other words, if there exists an estimator for a property, there is a linear estimator which is almost as good. Even better, their result is constructive, for both cases! A linear program is used to construct the estimator, where the variables are the estimator coefficients. Roughly, this linear program attempts to minimize the bias of the estimator while keeping the variance relatively small. On the other hand, when we take the dual of this linear program, we construct two distributions which have similar fingerprint expectations, but differ in their value for the property — we construct an explicit counter-example to any estimator which claims to use k samples and estimate the property to accuracy \varepsilon.

But here’s the $64 question – do these estimators work in practice? Sadly, the answer is no, which actually isn’t that surprising here. The estimator is defined in terms of the worst-case instance for each property — in other words, it’s oblivious to the particular distribution it receives samples from, which can be wasteful in many cases.

Let’s approach the problem again. What if we took the fingerprint of our samples and computed the property for the empirical distribution? This actually works great – the empirical distribution optimally estimates the portion of the distribution which we have seen. The downside is that we have to see the entire distribution, which takes \Omega(n) samples. If we want to break past this linear barrier, we must estimate the unseen portion. It turns out that, once again, linear programming saves the day. We solve a program which describes all distributions whose expected fingerprint is close to the observed fingerprint. It can be shown that the diameter of this set is small, meaning that any distribution in this space will approximate the target distribution for all symmetric properties at the same time! Furthermore, this estimator takes only O(n/\log n) samples, which turns out to be optimal.

Again, we ask the same question – do these estimators work in practice? This time, the answer is yes! While performance plots are often conspicuously missing from theory papers, Greg was happy to compare their results to estimators used in practice — their approach outperformed the benchmarks in almost all instances and regimes.

Finally, Greg mentioned a very recent result on instance-by-instance optimal identity testing. Recall that in identity testing, given a known distribution P and an unknown distribution Q, we wish to test if they are equal or \varepsilon-far (in total variation distance). As mentioned before, when P is uniform, \Theta(\sqrt{n}/\varepsilon^2) are necessary and sufficient for this problem. But what about when P is \mathrm{Bin}(n,\frac12), or when P assigns a probability to i which is proportional to the ith digit of \pi — do we have to design an algorithm by scratch for every P?

Thankfully, the answer is no. The Valiants provide an algorithm which is optimal up to constant factors for all but the most pathological instances of P. Somewhat unexpectedly, the optimal number of samples turns out to be \Theta(\|P\|_{2/3}/\varepsilon^2) – aside from the fact it matches the uniform case, it’s not obvious why the 2/3-norm is the “correct” norm here. Their estimator is reasonably simple – it involves a small tweak to Pearson’s classical \chi-squared test.


Clément Canonne on Lower Bound Techniques for Statistical Estimation Tasks

At this point, Greg tagged out, allowing Paul to give us a primer in “Lower Bound Techniques for Statistical Estimation Tasks”: roughly speaking, the other side of Greg’s coin. Indeed, all the techniques used to derive upper bounds (general, systematic testers for symmetric properties) can also been turned into their evil counterparts: how to show no algorithm supposed to test a given property can be “too efficient”.

The dish is very different, yet the ingredients similar: without going into too much details, let’s sketch the recipe. In the following, keep in mind that we will mostly cook against symmetric properties, that is properties \mathcal{P} invariant by relabeling of the domain elements: if D\in\mathcal{P}, then for any permutation \pi of [n] one has D\circ\pi\in\mathcal{P}.

  • elements do not matter: all information any tester can use can be found in the fingerprint of the samples it obtained. In other terms, we can completely forget the actual samples for the sake of our lower bound, and only prove things about their fingerprint, fingerprint of fingerprint, and so on.
  • Poissonization is our friend: if we take m samples from a distribution, then the fingerprint, a tuple of m integer random variables X_1,\dots X_m, lacks a very nice feature: the X_i‘s are not independent (they do kind of have to sum to m, for instance). But if instead of this, we were to take m^\prime\sim\mathrm{Poisson}(m) samples, then the individual components of the fingerprint will be independent! Even better, many nice properties of the Poisson distribution will apply: the random variable m^\prime will be tightly concentrated around its mean m, for instance. (this applies whether the property is symmetric or not. When in doubt, Poissonize. Independence!)
  • Central Limit Theorems: the title ain’t witty here, but the technique is. As a very high-level, imagine we get to random variables T, S corresponding to the “transcript” of what a tester/algorithm sees from taking m samples from two possible distributions it should distinguish. We would like to argue that the distributions of T and S are close, so close (in a well-defined sense) that it is information-theoretically impossible to do so. This can be very difficult; a general idea would be to prove that there exists another distribution G, depending on very few parameters — think of a Gaussian: only the mean and (co)variance (matrix) — such that both S and T are close to G. Then, by the triangle inequality, they must be close, and we are done.Sweeping a lot of details under the rug, what Paul presented is this very elegant idea of proving new Central Limit Theorems that exactly give that: theorems that state that asymptotically (with quantitative bounds on the convergence rate), some class of distributions of interest will “behave like” a multivariate Gaussian or Multinomial with the “right parameters” — where “behave like” refers to a convient metric on distributions (e.g. Earthmover or Total Variation), and “right parameters” means “as few as possible”, in order to keep as many degrees of freedom when designing the distributions of S and T).Proving such CLT’s is no easy task, but — on the bright side — they have many applications, even outside the field of distribution testing.

There would be much, much more to cover here, but I’m past 600 words already, and was only allowed \mathrm{Poisson}(500).


*In order to avoid pathological cases for estimating support size, we assume none of the probability are in the range (0,\frac{1}{n}).

**At least, when channeled by sufficiently talented theoreticians.

[1] Estimating the unseen: A sublinear-sample canonical estimator of distributions, Valiant, Valiant (2010).
[2] A CLT and tight lower bounds for estimating entropy, Valiant, Valiant (2010).
[3] Estimating the Unseen: An n/log(n)-sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs, Valiant, Valiant (2011).
[4] Estimating the Unseen: Improved Estimators for Entropy and other Properties, Valiant, Valiant (2013).
[5] An Automatic Inequality Prover and Instance Optimal Identity Testing, Valiant, Valiant (2014).
[6] Testing Symmetric Properties of Distributions, Valiant (2011).
See also, Greg’s TCS+ talk, which covers [1,2,3,4].

Algebraic circuit lower bounds bonanza – STOC 2014 Recaps (Part 3)

Pritish Kamath has kindly agreed to survey the exciting flurry of algebraic complexity theory results in STOC, which, to many of us, is in a permanent state of mystery. But Pritish did some valiant depth-reduction on the state of affairs in the area, and I hope you’ll appreciate his determined effort.

Pritish Kamath on the latest and greatest in algebraic circuit lower bounds

The most natural and intuitive way to compute a polynomial is via an arithmetic circuit. In this model the inputs are variables x_1, x_2, \cdots, x_n and the computation is performed using the operations + and \times (where we allow the + gates to take arbitrary linear combinations of its inputs). The output of the circuit is a formal polynomial over the base field \mathbb{F} (as opposed to functions) computed by boolean circuits).

The size of an arithmetic circuit is defined as the number of + and \times gates in the circuit, and is typically measured with respect to the number of variables. Proving super-polynomial lower bounds on the size of arithmetic circuits computing any explicit polynomial is one of the most important open problems in the area of algebraic complexity theory. The best known lower bound as of today is a modest \Omega(n \log n) due to Baur-Strassen [BS83] from 30 years ago!

Depth-reduction results

Since proving lower bounds for general arithmetic circuits is such a daunting task, it is natural to focus on proving lower bounds for restricted models. One such restriction is of constant-depth circuits. Although this seems like a severe restriction, a surprising result due to Agrawal-Vinay [AV08], subsequently strengthened by Koiran [Koi12] and Tavenas [Tav13] showed that if a polynomial is computed by poly-sized circuits, then they are in fact computed by not-so-large-sized depth-4 circuits. In particular, they showed the following theorem,

Theorem.[AV08, Koi12, Tav13]
Let p_n(\mathbf{x}) \in \mathbb{F}[\mathbf{x}] be a polynomial over n-variables, of degree d = n^{O(1)}. If there is a \mathrm{poly}(n) sized arithmetic circuit computing p_n(\mathbf{x}), then there is a hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuit of size 2^{O(\sqrt{d} \log n)} computing p_n.

hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}] denotes homogeneous depth-4 circuits with fan-in of bottom product gates bounded by \sqrt{d}. The contrapositive of the above depth-reduction result says that proving a 2^{\omega(\sqrt{d} \log n)} lower bound on the size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing p_n(\mathbf{x}), suffices to prove that p_n(\mathbf{x}) cannot be computed by n^{O(1)}-sized arithmetic circuits!

Results from STOC 2014, before and after…

Proving lower bounds for hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits seems like a conceptually much simpler task than proving lower bounds for general circuits. Using a new technique of the shifted partial derivatives, Gupta et al [GKKS13] showed that hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing the d \times d determinant (\mathrm{Det}_d) requires size 2^{\Omega(\sqrt{d})}, which is only a \omega(\log n) factor away (in the exponent), from the depth-reduction result!

Thus, after the above result was established, there were two possible approaches for proving super-polynomial lower bounds for general circuits,

  1. Improve the lower bound on hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits
  2. Improve the depth-reduction result from general circuits to hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits

Post-[GKKS13], there was a lot of excitement and a flurry of activity followed, proving stronger lower bounds (in fact the improved depth reduction of Tavenas [Tav13] came after [GKKS13]). The magnitude of this activity is well illustrated by the fact that there were a sequence of four papers in STOC 2014, each showing significant improvements over the previous ones, all produced in a span of just 6 months or so!

[KSS14] A super-polynomial lower bound for regular arithmetic formulas

In this paper, Kayal-Saha-Saptharishi prove a lower bound of 2^{\Omega(\sqrt{d} \log n)} on size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing an explicit polynomial (neatly constructed using Nisan-Wigderson designs), denoted by \mathrm{NW}_{n,d}. After this result, we are only a \omega(1) factor away (in the exponent) from our goal!

[FLMS14] Lower bounds for depth 4 formulas computing iterated matrix multiplication

In this paper, Fournier-Limaye-Malod-Srinivasan prove a lower bound of 2^{\Omega(\sqrt{d} \log n)} on size of hom-\Sigma\Pi\Sigma\Pi[\sqrt{d}]-circuits computing the iterated matrix product of d generic n \times n matrices, denoted by \mathrm{IMM}_{n,d}. This result improved upon [KSS14] in the sense that \mathrm{IMM}_{n,d} is in \mathrm{VP}, whereas \mathrm{NW}_{n,d} was only in \mathrm{VNP}. This shows that the depth-reduction of [Tav13] is in fact tight, and hence rules out approach (b) stated above!

[KS14a] The Limits of Depth Reduction for Arithmetic Formulas: It’s all about the top fan-in

Taking it one more step further than [FLMS14], Kumar-Saraf prove a 2^{\Omega(\sqrt{d} \log n)} lower bound for an explicit polynomial that can be computed by \mathrm{poly}(n)-sized formulas, thus showing that the depth-reduction of Tavenas is in fact tight even for formulas (formulas are weaker models than circuits, in that, the fan-out of any gate in a formula is at most one; thus, formulas are representable by trees, whereas circuits are represented by directed acyclic graphs).

Another question that had been tantalizingly open for a long time was to prove super-polynomial lower bounds on the size of general hom-\Sigma\Pi\Sigma\Pi-circuits (that is, homogenous depth-4 circuits without the bottom fan-in restriction). Towards this direction, Kumar-Saraf considered the problem of proving lower bounds on size of hom-\Sigma\Pi\Sigma\Pi-circuits with bounded top fan-in, and in this paper, they also prove a super-polynomial lower bound on size of such circuits where the top fan-in is o(\log n).

[KLSS14] Super-polynomial lower bounds for depth-4 homogeneous arithmetic formulas

And finally, Kayal-Limaye-Saha-Srinivasan prove a super-polynomial lower bound of n^{\Omega(\log n)} on the size of general hom-\Sigma\Pi\Sigma\Pi-circuits computing \mathrm{IMM}_{n,d}. Interestingly, Kumar-Saraf in a different paper [KS13] had independently proved a weaker lower bound of n^{\Omega(\log \log n)} on the same circuit model computing a Nisan-Wigderson type polynomial (similar to the one in [KSS14]).

Subsequent results

Now that super-polynomial lower bounds for hom-\Sigma\Pi\Sigma\Pi-circuits had been shown, it remained open to prove exponential lower bound (such as 2^{\Omega(\sqrt{d} \log n)}) on such circuits. Since the STOC deadline in November 2013, more improvements have come out!

Kayal-Limaye-Saha-Srinivasan [KLSS14] have proved a lower bound of 2^{\Omega(\sqrt{d} \log n)} on the size of hom-\Sigma\Pi\Sigma\Pi-circuits computing an explicit Nisan-Wigderson type polynomial (similar to the ones in [KSS14] and [KS13]).

Completing this picture, Kumar-Saraf [KS14b] have proved a similar lower bound of 2^{\Omega(\sqrt{d} \log n)} on the size of hom-\Sigma\Pi\Sigma\Pi-circuits computing \mathrm{IMM}_{n,d}, which shows that the depth-reduction of [Tav13] is tight even if the reduction was to general \Sigma\Pi\Sigma\Pi-circuits (without bottom fan-in condition).

Perhaps we will see these papers in FOCS 2014. ;)

Bird’s eye view of the techniques

To prove lower bounds on any class \mathcal{C} of circuits, a natural idea (pun intended!) used in most lower bound techniques has been to introduce a measure \Gamma : \mathbb{F}[\mathbf{x}] \rightarrow \mathbb{Z}_+ such that \Gamma(p) is large for a certain polynomial p, but \Gamma(C) is small for any small-sized C \in \mathcal{C}, and hence no small-sized circuit in \mathcal{C} can compute p.

The main technique used in all the papers above are certain variants of the measure of shifted-partial derivatives technique introduced in [Kay12] and [GKKS13], inspired by the measure of partial derivatives introduced by Nisan-Wigderson [NW95]. The measure is defined as follows,

\Gamma_{k, \ell}(f) = \mathrm{dim}(\mathbf{x}^{\le \ell} \partial^{=k} f)

Essentially, this measure looks at the dimension of the sub-space spanned by the polynomials obtained by multiplying monomials of degree at most \ell to all k-th order partials derivatives of f (where we think of polynomials as a vector of coefficients).

The other new ideas used in the papers described has been the idea of random restrictions (somewhat seen in the context of Håstad’s switching lemma), and of looking at novel variants of shifted partial derivatives by means of projections onto a smaller space of polynomials. In the interest of length of this post, we won’t go into those details here.

Efficient Distribution Estimation – STOC 2014 Recaps (Part 2)

The STOC 2014 saga continues (check out Part 1 here), with Clément Canonne and Gautam “G” Kamath covering the day-long workshop on Efficient Distribution Estimation. There’s a wealth of interesting stuff from the workshop, so we’ve broken it up into episodes. Enjoy this one, and stay tuned for the next installment!

Clément Canonne on Taming distributions over big domains

Talk slides here

Before STOC officially began, before the talks and the awe and the New Yorker Hotel and all the neat and cool results the reviews on this blog will try to give an idea and a sense of, before even this very long and cumbersome and almost impossible — almost, only? to understand sentence, before all this came Saturday.

Three simultaneous sessions, all three very tempting: one covering “Recent Advances on the Lovasz Local Lemma”, one giving tools and insight on how to “[overcome] the intractability bottleneck in unsupervised machine learning”, and finally one — yay! — on “Efficient Distribution Estimation“. I will hereby focus on the third one, organized by Ilias Diakonikolas and Gregory Valiant, as it is the one I attended, and in particular 3 of the 6 talks of the workshop; G (Gautam) will review the other ones, I bet with shorter sentences and more relevance.

It all started with Ronitt Rubinfeld, from MIT and Tel Aviv University, describing the flavour of problems this line of work was dealing with, what exactly were the goals, and why should one care about this. Imagine we have samples from a space of data — a Big space of Data; whether it be records from car sales, or the past ten years of results for the New Jersey lottery, or even logs of network traffic; and we want to know something about the underlying distribution: has the trend in cars changed recently? Is the lottery fixed? Is someone doing unusual requests to our server?

How can we tell?

Let us rephrase slightly the problem: there exists a probability distribution D on a set of n elements [n]=\{1,\dots,n\}; given i.i.d. (independent, identically distributed) samples from D, can we decide with as few of these samples as possible if, for instance,

  • D is uniform, or “far”* from it?
  • D is equal to a known, fully specified distribution (say, the distribution of car sales from 10 years ago), or far from it?
  • has D big entropy, or not?

Of course, there exist many tools in statistics for this type of tasks: the Chi-Square test, for instance. Yet, they usually have a fatal flaw: namely, to give good guarantees they require a number of samples at least linear in n. And n is big. Like, huge. (and anyway, one could always actually learn D to high accuracy with O(n) samples) No, the question is really can one do anything interesting with o(n) samples?

As Ronitt explained: yes, one can. Testing uniformity can be done with (only!) O(\sqrt{n}) samples**; one can test identity to a known distribution with O(\sqrt{n}) as well; and testing whether two unknown distributions are equal or far from each other only requires O(n^{2/3}) samples from each. All these results (see e.g. [1] for a survey of this huge body of research) are quite recent; and come from works in TCS spanning the last 15 years or so.

“And they saw they could test distributions in o(n) samples; and they saw that this was good, and they also saw that most of these results were tight”. But as always, the same question arose: can we do better? This may sound strange: if the sample complexities are tight, how could we even hope to do better? If there is a lower bound, how can we even ask to beat it?

This was the second part of Ronitt’s talk: bypassing impossibility results. She presented two different approaches, focusing on the second:

  • “if you can’t beat them… change what ‘them’ means”. Indeed, getting better algorithms for general, arbitrary distributions may be impossible; but in practice, things are rarely arbitrary. What if we knew something more about this unknown D — for instance, it is a mixture of simple distributions? Or it has a particular “shape” (e.g., its probability mass function is monotone)? Or it is obtained by summing many “nice” and simple random variables? Under these assumptions, can the structure of D leveraged to get better, more sample efficients algorithms? (in short: yes. See for instance [2], where the sample complexity, from n^c in general, becomes \mathrm{poly}(\log n))
  • “if playing by the rules is too hard, change the rules”. All the results and lower bound above suppose the algorithms have (only) access to i.i.d. samples from the distribution D. But in many cases, they could get more: for instance, there are scenarii where a tester can focus on some subset of the domain, and only get samples from it (think about “stratified sampling”, or just the situation where a pollster targets in particular voters between age 25 and 46). Given this new type of access to the “conditional distributions” induced by D, can algorithms be more efficient? And if the data comes from a known, preprocessed (even sorted!) dataset, as it is the case for example with Google and the n-grams (frequencies and statistics of words in a vast body of English literature), the testing algorithm can actually ask the exact frequency (equivalently, the probability value) of any given element of its choosing, in addition to just sampling a random word! This type of access, with additionial or modified types of query to D, has been studied for instance in [3,4,5,6]; and, in a nutshell, the answer is positive.

Yes, Virginia, there is a Santa Claus. And he has constant sample complexity!

Open problems

  • Study more types of structures or assumptions on the distribution: namely, what type of natural assumptions on D can we get, for which the testing algorithms get a significant speedup? (in terms of sample complexity)
  • What is the relation between the new type of access? Are they related? Can one simulate the other? What other natural sorts of access can one consider?
  • What about learning with these new types of access? (so far, we were interested in testing whether D had some property of interest; but — maybe — learning the distribution also became easier?)

Such excitement — and this was only the beginning!


* Far in total variation (or statistical) distance: this is essentially (up to a factor 2) the \ell_1 distance between two probability distributions.
** and this is tight.

(this bibliography is far from comprehensive; if any reader wants additional pointers, I’ll be happy to provide more!)

[1] Taming big probability distributions, R. Rubinfeld (2012).
[2] Testing k-Modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2011).
[3] On the Power of Conditional Samples in Distribution Testing, Chakraborty, Fischer, Goldhirsh, Matsliah (2012).
[4] Testing probability distributions using conditional samples, Canonne, Ron, Servedio (2012).
[5] Streaming and Sublinear Approximation of Entropy and Information Distances, Guha, McGregor, Venkatasubramanian (2005).
[6] Testing probability distributions underlying aggregated data, Canonne, Rubinfeld (2014).