FOCS 2013 Recaps, Part 2

We continue our FOCS 2013 recaps (Part 1) with recaps by Clément (a guest blogger from Columbia) and Michael.

Clément Canonne on Learning sums of independent integer random variables by Constantinos Daskalakis, Ilias Diakonikolas, Ryan O’Donnell, Rocco Servedio, and Li-Yang Tan.

Background: the problem of inferring the distribution of some data, given only random elements sampled from it, is a fundamental and longstanding question in statistics, science, life and most likely even more. Although a recent line of work in the TCS community has helped understanding this question, the general case — \epsilon-approximating an unknown distribution on n elements, with no assumption on it — is well-known to be (a) hard, and (b) hard (that is, \Theta(n/\epsilon^2) samples needed, and well… n is big). For this reason, a lot of focus has been drawn to the study of particular, “natural” types of distributions that may arise: e.g, mixtures of well-known distributions, or distributions with some promise on their “shape” (monotone, unimodal, log-concave…) the hope being to do better, or to prove one can’t.

For instance, take a biased coin. For quite a few decades, it’s been known that estimating the bias to \pm\epsilon accuracy required \Theta(1/\epsilon^2) samples. What about a sum of n independent coins, each of them with their own bias†? Odd as it may seem, up till last year a non-trivial upper bound was not known [DDS12] — it turns out that for this sort of random variables, O(1/\epsilon^3) samples are enough.

(a quick note here: of course, learning the n different biases would require a dependence on n; the problem here is to approximate the density of the distribution, i.e. come up with any type of distribution which is statistically close to the unknown one)

In this talk, Li-Yang presented an exciting result for an even more general class of distributions, glamorously named kSIIRV. If you remember the example two paragraphs above, there was a sum of n differently-biased biased coins; now, replace the coins by dice — biased dice, each different, each with k faces. To get an outcome, you throw them all, sum everything, and only remember the sum. More formally, a k-sum of integer-valued independent random variables is of the form S = X_1+\dots + X_n, where the X_j‘s are arbitrary independent random variables taking value in \{0,\dots,k-1\} (think of k as a small constant, e.g. 20 or 10^{47}). The question they ask is not can we learn the distribution of S? (yes, of course: \Theta(kn/\epsilon^2)) but rather can we learn this distribution with o(n) samples?. And — surprisingly — the answer is yes. In more details: you can learn k-SIIRVs in time and sample complexity \mathrm{poly}(k,1/\epsilon) independent of n. You can throw as many dice as you want!

A first observation: it’s not obvious. Intuitively, one may think something along the lines of “sure! we can do it with k=2 (the Poisson Binomial Distribution result mentioned below), let’s just generalize to k=k“. Turns out it does not work that well — a PBD is still nice (i.e., unimodal, log-concave, well-behaved): even a 3-SIIRV can be a nightmare (\Omega(1)-far from unimodal, \Omega(1)-far from log-concave, \Omega(1)-far from anything you wish for). So they came up with something else, and proved the following key result: if S is a k-SIIRV, it can essentially be decomposed asS \approx cZ + Y where Z is a discrete Gaussian with the “right” mean, Y is a k-sparse (“nice”) random variable, and c\in[k-1] is an integer. With this characterization in hand, the learning algorithm essentially follows: for all possible choices of c, try to learn the corresponding Z and Y; this yields a pool of k candidates for the final hypothesis, one of them at least being good. Now let them fight, pick the winner.

Of course, the actual analysis and proofs are way more involved, and they use for this machinery beyond the scope of my memory, talent and this small note; but — and that was the great force of this talk — the speaker explained their work clearly, with simple examples and catchy pictures, and managed to give a very good sense of what was going on — while nonchalantly sipping a cup of tea.

(Or coffee, maybe.)

† a.k.a. a Poisson Binomial Distribution: which has nothing to do with Poisson distributions, of course.

Michael Forbes on Polar Codes: Speed of polarization and polynomial gap to capacity, by Venkatesan Guruswami and Patrick Xia.

Consider the binary symmetric channel: you can send bits through this channel, and each will be flipped independently with probability p. We want an optimal scheme for sending information reliably through this channel. In particular, the scheme should take a parameter \epsilon > 0 and devise a way to send N bits through the channel to reliably transmit (1-H(p)-\epsilon)N actual bits of information, so that the failure probability is at most \epsilon.  Further, the scheme should be explicit, and N should be \mathrm{poly}(1/\epsilon).

Arikan (2008) gave a scheme called polar codes meeting these requirements, except that N had to be very large for the scheme to succeed.  Indeed, no concrete bounds were actually known, and the work of Guruswami-Xia shows that N=\mathrm{poly}(1/\epsilon) suffices.

To understand polar codes, first consider the following simpler problem. Given N Bernoulli(p) bits (each being 1 independently with probability p) devise a map from N bits to \approx H(p)N bits that is injective with high probability.  It is not to hard to do so, since there are only \approx 2^{H(p)N} N-bit strings that appear with noticable probability, so we can just list the noticable strings in some enumerate (taking \approx H(p)N bits), and discard the neglible strings.

Now suppose you had such a map that was linear over \mathbb{F}_2. The claim is that this would yield a good scheme for coding in the binary symmetric channel.  Specifically, let the map be defined by the H(p)N \times N matrix A.  Let your “messages” be those strings in the \mathbb{F}_2-nullspace of A.  It is not hard to see that are \approx 2^{(1-H(p))N} such strings.  Suppose now we send such a message m through the channel.  Then the channel will yield m+e, where e is a binary string with each bit independently being 1 with probability p. It follows then that applying A yiels A(m+e)=Ae.  But A is injective with high probability on such strings e, so Ae will determine e, from which point we can determine m since we have m+e.

The above thus reduces the problem of communication over the binary symmetric channel to that of compression.  Arikan’s polar codes then solve this compression problem as follows.  First, an N\times N matrix A is devised so that given the random vector e, the resulting random vector Ae has the polarization property.  That is, if you look at the coordinates of Ae in order, then each coordinate (as a random variable) is either fully determined or fully independent of the previous coordinates (as random variables).  Then, one can see that those coordinates of Ae that are fully determined are not helpful in establishing an injective map, so can be discarded.  This will discard H(p)N rows of the matrix, by conservation of entropy, and thus will solve the compression problem.

To construct such a matrix A, one can start with 2 Bernoulli(p) bits x,y.  If we then transform those to x'=x\oplus y and y'=y, we first see that x' is now Bernoulli(2p(1-p)), and 2p(1-p)\gg p. Thus, the entropy of x' is larger than that of x.  Individually, the entropy of y' is clearly the same as the entropy of y, but when we look at the order of the bits things become interesting. That is, while x and y are independent, x' and y' are not, and in particular reading x' will lower the entropy of y'.

This 2-bit process induces a slight polarization, and now one recurses. That is, given 4 bits, first apply the 2-bit process to two pairs separately.  Then, one has 2 of the bits being “higher entropy” and two bits being “lower entropy”, at which point the process can be repeated under this new grouping.  The challenge in the analysis is: does this eventually lead to near full polarization? If so, how does the speed of polarization depend on N?

Arikan showed that this first question is a resounding yes, but used the martingale convergence theorem which does not address the second question. Guruswami-Xia address this second concern, by studying this process further and giving quantatitive bounds for how the polarization improves in each step.

FOCS 2013 Recaps, Part 1

Editor’s note: don’t forget to add the blog to your favorite RSS reader (link: https://mittheory.wordpress.com/feed/), and follow our twitter account!

The Foundations of Computer Science — or FOCS for short — conference needs no introduction. Along with STOC and SODA, it is one of the juggernaut meetups in the world of theoretical computer science. This year it was held at the Berkeley Marina, which had the benefit that in those spare moments in between talks, we could meander outside the hotel and soak in the rather pleasant view of the Berkeley yacht club, with misted downtown San Francisco far in the distance.

IMG_20131026_174606

Of course, we weren’t there for (just) the pretty views and warm weather: from workshops on “real stable polynomials” (including their application to solving the Kadison-Singer Problem!), to OSNAPs in numerical linear algebra, to nearly maximum flows in nearly linear time, there was plenty of theory to had.

If for some unfortunate reason you weren’t able to attend this year, then I regretfully must inform you that you missed a concert featuring Christos Papadimitriou and the Positive Eigenvalues! But not all is lost, because several students who did attend have selflessly agreed to share their “best of” recaps of FOCS 2013, to tell you about their favorite papers and a brief summary of the talks.

Today, Matt Weinberg will tell us about “A Polynomial Time Algorithm for Lossy Population Recovery”, “The Simple Economics of Approximately Optimal Auctions”, and “Interlacing Families I: Bipartite Ramanujan Graphs of All Degrees”. Next time Clément Canonne and Michael Forbes will share their recaps.

Recaps by Matt Weinberg

A Polynomial Time Algorithm for Lossy Population Recovery
Ankur Moitra and Michael Saks

This paper studies the following problem: imagine that species in a population can be represented as k-bit strings. Then each k-bit string has some probability of occurring in the population. Given access to samples from the population, one would like to recover these probabilities. If this is the model, then the obvious simple algorithm is the best one can do. Instead, the paper assumes that the samples are corrupted in the following way: each bit of the sample is replaced with a “?” independently with probability p. One would still like to use these corrupted samples to learn the population. The main result of this paper is an algorithm that learns the distribution within epsilon and requires only a polynomial number of samples (in k and \epsilon^{-1}) for any constant corruption probability p > 0.

Here are the high level ideas of the proof:

  • There’s a known reduction that says if you can approximately recover the probability of the all-zero string, then you can approximately recover the probability of all strings.
  • Let b be a vector where b_i denotes the probability that an uncorrupted sample from the population has i zeroes. Let also c be a vector where c_i denotes the probability that a corrupted sample from the population has i zeroes. Then there is an easily computable linear transformation A such that Ab = c, where A_{ij} is just the probability that exactly i of j ones remain after corruption. So we can also compute A^{-1}. We can then take corrupted samples to estimate c, and estimate b = A^{-1}c. Unfortunately, the coefficients in A^{-1} are so big that we’d need an exponentially good approximation to c in order to get any meaningful approximation to b by this approach.
  • Instead, they add some error matrix E to A^{-1}. Doing so has two effects: one, it changes the magnitude of terms in A^{-1} and two, it adds additional error to the recovered b_0 term. They write an LP to see if any error matrix exists that both makes the terms of A^{-1} + E small enough so that they don’t need too many samples, and doesn’t add too much additional error to the recovered term.
  • They show that the dual of this LP corresponds to searching for a certain kind of polynomial. Any such polynomial that exists implies an upper bound on how good an error matrix can be.
  • Using tools from complex analysis (specifically, Hadamard Three-Circle Theorem), they show that no polynomial exists that can imply a strong upper bound. Therefore, there are good error matrices and their LP will find one.
  • So in summary, their algorithm takes samples to estimate this vector c, where c_i is the probability that a corrupted sample has exactly i ones. Then, they solve their LP to find a good error matrix E. Then they compute (A^{-1} + E)c to recover a decent approximation to the vector b, where b_i is the probability that an uncorrupted sample had exactly i ones. Then they plug this into the reduction that we mentioned in the beginning.

The Simple Economics of Approximately Optimal Auctions
Saeed Alaei, Hu Fu, Nima Haghpanah, and Jason Hartline

This paper studies the following problem: A salesman has multiple goods to sell to m unit-demand buyers. There may be some feasibility constraints on which agents can simultaneously receive goods. The seller wishes to auction the goods in a way that maximizes his revenue. With only a single good to sell, an interpretation of Myerson’s optimal auction by Bulow and Roberts shows that the optimal auction has a particularly simple form and can be interpreted as maximizing “marginal revenue” (discussed below). The main result of this paper is a characterization of optimal auctions with multiple goods as maximizing marginal revenue as well.

To understand their result, first consider the single item case. When selling a single item to a single buyer, it’s well known that the best thing to do is set a take-it-or-leave-it price p, that the buyer will choose to take with probability q. For any q in [0,1], one can define \mathrm{REV}(q) to be the revenue obtained by choosing a price that the buyer takes with probability exactly q. One can also define the marginal revenue to be \mathrm{REV}'(q). The Bulow-Roberts interpretation of Myerson’s optimal auction says exactly that the optimal auction for m buyers is to ask buyers to report bids b_i, map bids to probabilities q_i (the probability that buyer i would purchase the item at price b_i), and then give the item to the agent maximizing \mathrm{REV}'(q_i). In other words, always give the item to the agent that with the maximum marginal revenue.

In the multi-item case, not much is known even about how to sell many items to a single agent, so it’s not even clear that an analogue of marginal revenue makes sense. The paper proposes defining \mathrm{REV}(q) to be the maximum revenue that could possibly be obtained by selling the items to a single agent in a way such that the agent will receive something with probability exactly q. Specialized to the single-item case, this recovers exactly the original definition. In addition, they give a natural mapping from bids (which are now vectors) to probabilities, and define the marginal revenue mechanism to first map bids to probabilities q_i, then select the feasible set of agents with the maximum value of \mathrm{REV}'(q_i) to be awarded an item. The item they receive is exactly what they would get in the single-agent scheme that defines \mathrm{REV}(q_i). Unlike the single-item setting, the marginal revenue mechanism may not always be optimal. But the main result of the paper shows that the mechanism is in fact optimal if a natural condition (called revenue-linearity) on the input instance holds. Their result is robust to approximation as well: if their condition only holds approximately, then the mechanism is approximately optimal. They also give some specific instances of multi-dimensional settings where revenue-linearity holds exactly and approximately.

Interlacing Families I: Bipartite Ramanujan Graphs of All Degrees (Best Paper Award)
Adam Marcus, Daniel Spielman and Nikhil Srivastava

A d-regular graph is said to be Ramanujan if all of its non-trivial eigenvalues (the ones that aren’t d or -d) are at most 2\sqrt{d-1} in absolute value. This paper addresses the question of whether or not an infinite sequence of d-regular Ramanujan graphs exist for all d > 2, and shows that in fact such a sequence exists for all d > 2 of bipartite graphs.

They obtain their result by proving a conjecture of Bilu and Linial on the existence of certain kinds of graphs. Specifically, the conjecture is that every d-regular graph has a signing (that is, a weighting of all edges to be either +1 or -1), such that all eigenvalues of the signed graph are at most 2 \sqrt{d-1} in absolute value. The paper proves a restricted version of this conjecture for bipartite graphs. Once this conjecture is proved, existing machinery by Bilu and Linial yields the main theorem. Below is a high level overview of the proof:

  • Consider randomly assigning each edge to be +1 or -1 independently with probability 1/2. One could then compute the expected largest eigenvalue of this procedure and hope that it’s at most 2\sqrt{d-1}. Unfortunately, this isn’t the case. Instead, they compute the expected characteristic polynomial. That is, the average over all 2^{|E|} signings of the characteristic polynomial of that signed adjacency matrix.
  • It turns out that this polynomial has already been computed as the matching polynomial by Godsil and Gutman, and it is the matching polynomial of the original graph. That is, the coefficient of x^{n-2i} is the (-1)^i times number of matchings with exactly i edges.
  • Now that they have the expected polynomial, they want to compute its largest root. Fortunately, it turns out that this has also already been done: Heilmann and Lieb showed that all roots are real and that the largest root is at most 2\sqrt{d-1}.
  • So now they have to relate the roots of the expected characteristic polynomial to roots of actual signings. So the question they address is: When do the roots of an average polynomial tell you anything about the roots of those being averaged? The answer is usually “absolutely nothing”, but the main technical result of this paper is a sufficient condition for the answer to be “something useful.” They give a definition called interlacing families of polynomials, and show that for all interlacing families, there exists a polynomial in the group whose largest root is no larger than the largest root of the averaged polynomial.
  • The penultimate step is showing that the characteristic polynomials of each signing forms an interlacing family (which seems quite involved). With this, they’ve now shown that there exists a signing with a characteristic polynomial whose largest eigenvalue is no larger than the largest eigenvalue of the expected polynomial, which is at most 2\sqrt{d-1}. The last step is observing that if G is bipartite, then all eigenvalues come in pairs of the form \{\lambda, -\lambda\}. Therefore, every bipartite graph has a signing with all eigenvalues having absolute value at most 2\sqrt{d-1}.