FOCS 2013 Recaps, Part 2

We continue our FOCS 2013 recaps (Part 1) with recaps by Clément (a guest blogger from Columbia) and Michael.

Clément Canonne on Learning sums of independent integer random variables by Constantinos Daskalakis, Ilias Diakonikolas, Ryan O’Donnell, Rocco Servedio, and Li-Yang Tan.

Background: the problem of inferring the distribution of some data, given only random elements sampled from it, is a fundamental and longstanding question in statistics, science, life and most likely even more. Although a recent line of work in the TCS community has helped understanding this question, the general case — \epsilon-approximating an unknown distribution on n elements, with no assumption on it — is well-known to be (a) hard, and (b) hard (that is, \Theta(n/\epsilon^2) samples needed, and well… n is big). For this reason, a lot of focus has been drawn to the study of particular, “natural” types of distributions that may arise: e.g, mixtures of well-known distributions, or distributions with some promise on their “shape” (monotone, unimodal, log-concave…) the hope being to do better, or to prove one can’t.

For instance, take a biased coin. For quite a few decades, it’s been known that estimating the bias to \pm\epsilon accuracy required \Theta(1/\epsilon^2) samples. What about a sum of n independent coins, each of them with their own bias†? Odd as it may seem, up till last year a non-trivial upper bound was not known [DDS12] — it turns out that for this sort of random variables, O(1/\epsilon^3) samples are enough.

(a quick note here: of course, learning the n different biases would require a dependence on n; the problem here is to approximate the density of the distribution, i.e. come up with any type of distribution which is statistically close to the unknown one)

In this talk, Li-Yang presented an exciting result for an even more general class of distributions, glamorously named kSIIRV. If you remember the example two paragraphs above, there was a sum of n differently-biased biased coins; now, replace the coins by dice — biased dice, each different, each with k faces. To get an outcome, you throw them all, sum everything, and only remember the sum. More formally, a k-sum of integer-valued independent random variables is of the form S = X_1+\dots + X_n, where the X_j‘s are arbitrary independent random variables taking value in \{0,\dots,k-1\} (think of k as a small constant, e.g. 20 or 10^{47}). The question they ask is not can we learn the distribution of S? (yes, of course: \Theta(kn/\epsilon^2)) but rather can we learn this distribution with o(n) samples?. And — surprisingly — the answer is yes. In more details: you can learn k-SIIRVs in time and sample complexity \mathrm{poly}(k,1/\epsilon) independent of n. You can throw as many dice as you want!

A first observation: it’s not obvious. Intuitively, one may think something along the lines of “sure! we can do it with k=2 (the Poisson Binomial Distribution result mentioned below), let’s just generalize to k=k“. Turns out it does not work that well — a PBD is still nice (i.e., unimodal, log-concave, well-behaved): even a 3-SIIRV can be a nightmare (\Omega(1)-far from unimodal, \Omega(1)-far from log-concave, \Omega(1)-far from anything you wish for). So they came up with something else, and proved the following key result: if S is a k-SIIRV, it can essentially be decomposed asS \approx cZ + Y where Z is a discrete Gaussian with the “right” mean, Y is a k-sparse (“nice”) random variable, and c\in[k-1] is an integer. With this characterization in hand, the learning algorithm essentially follows: for all possible choices of c, try to learn the corresponding Z and Y; this yields a pool of k candidates for the final hypothesis, one of them at least being good. Now let them fight, pick the winner.

Of course, the actual analysis and proofs are way more involved, and they use for this machinery beyond the scope of my memory, talent and this small note; but — and that was the great force of this talk — the speaker explained their work clearly, with simple examples and catchy pictures, and managed to give a very good sense of what was going on — while nonchalantly sipping a cup of tea.

(Or coffee, maybe.)

† a.k.a. a Poisson Binomial Distribution: which has nothing to do with Poisson distributions, of course.

Michael Forbes on Polar Codes: Speed of polarization and polynomial gap to capacity, by Venkatesan Guruswami and Patrick Xia.

Consider the binary symmetric channel: you can send bits through this channel, and each will be flipped independently with probability p. We want an optimal scheme for sending information reliably through this channel. In particular, the scheme should take a parameter \epsilon > 0 and devise a way to send N bits through the channel to reliably transmit (1-H(p)-\epsilon)N actual bits of information, so that the failure probability is at most \epsilon.  Further, the scheme should be explicit, and N should be \mathrm{poly}(1/\epsilon).

Arikan (2008) gave a scheme called polar codes meeting these requirements, except that N had to be very large for the scheme to succeed.  Indeed, no concrete bounds were actually known, and the work of Guruswami-Xia shows that N=\mathrm{poly}(1/\epsilon) suffices.

To understand polar codes, first consider the following simpler problem. Given N Bernoulli(p) bits (each being 1 independently with probability p) devise a map from N bits to \approx H(p)N bits that is injective with high probability.  It is not to hard to do so, since there are only \approx 2^{H(p)N} N-bit strings that appear with noticable probability, so we can just list the noticable strings in some enumerate (taking \approx H(p)N bits), and discard the neglible strings.

Now suppose you had such a map that was linear over \mathbb{F}_2. The claim is that this would yield a good scheme for coding in the binary symmetric channel.  Specifically, let the map be defined by the H(p)N \times N matrix A.  Let your “messages” be those strings in the \mathbb{F}_2-nullspace of A.  It is not hard to see that are \approx 2^{(1-H(p))N} such strings.  Suppose now we send such a message m through the channel.  Then the channel will yield m+e, where e is a binary string with each bit independently being 1 with probability p. It follows then that applying A yiels A(m+e)=Ae.  But A is injective with high probability on such strings e, so Ae will determine e, from which point we can determine m since we have m+e.

The above thus reduces the problem of communication over the binary symmetric channel to that of compression.  Arikan’s polar codes then solve this compression problem as follows.  First, an N\times N matrix A is devised so that given the random vector e, the resulting random vector Ae has the polarization property.  That is, if you look at the coordinates of Ae in order, then each coordinate (as a random variable) is either fully determined or fully independent of the previous coordinates (as random variables).  Then, one can see that those coordinates of Ae that are fully determined are not helpful in establishing an injective map, so can be discarded.  This will discard H(p)N rows of the matrix, by conservation of entropy, and thus will solve the compression problem.

To construct such a matrix A, one can start with 2 Bernoulli(p) bits x,y.  If we then transform those to x'=x\oplus y and y'=y, we first see that x' is now Bernoulli(2p(1-p)), and 2p(1-p)\gg p. Thus, the entropy of x' is larger than that of x.  Individually, the entropy of y' is clearly the same as the entropy of y, but when we look at the order of the bits things become interesting. That is, while x and y are independent, x' and y' are not, and in particular reading x' will lower the entropy of y'.

This 2-bit process induces a slight polarization, and now one recurses. That is, given 4 bits, first apply the 2-bit process to two pairs separately.  Then, one has 2 of the bits being “higher entropy” and two bits being “lower entropy”, at which point the process can be repeated under this new grouping.  The challenge in the analysis is: does this eventually lead to near full polarization? If so, how does the speed of polarization depend on N?

Arikan showed that this first question is a resounding yes, but used the martingale convergence theorem which does not address the second question. Guruswami-Xia address this second concern, by studying this process further and giving quantatitive bounds for how the polarization improves in each step.


6 thoughts on “FOCS 2013 Recaps, Part 2

  1. Hi Notso (can I call you Notso?),
    There’s a point I’m don’t quite understand in the Polar Code protocol: namely, whether the matrix A is known beforehand by the two parties, or if they also have to communicate it.
    I reckon it boils down to the fact that I don’t really get what is meant by “The above thus reduces the problem of communication over the binary symmetric channel to that of compression.” Is the challenge to design an efficient way of computing a good A, or two compress such an A in a very succinct representation?

    • Both parties are assumed to know the protocol for communication, and this is a typical phenomenon in coding theory. Even in the non-constructive proofs of achieving channel capacity (via the probabilistic method), one can assume this as the two parties can use brute-force methods to find the lexicographically first coding scheme that works.

      In polar codes, the scheme for achieving polarization – the construction of the NxN matrix – is deterministic. The discarding of redundant rows is somewhat more complicated. One method would be to use estimation/sampling procedures are run to determine which rows polarize to zero (conditional) entropy. This sampling requires randomness, and one might be concerned that if two parties used the same procedure then they would discard different rows and thus result in different coding schemes (and thus fail to communicate). I imagine this probability of error can be made sufficiently small.

      However, Guruswami-Xia do show that this estimation can be done deterministically, so that the construction of NxN matrix, as well as the choice of which H(p)N rows to discard, are completely determined just by the parameters of the channel. This way both parties can run this procedure to get the same matrix A. Then the sender can sample codewords from the nullspace of this matrix, and the receiver can use the decoding algorithm to remove the noise (I didn’t describe this in the post, but there is a quasi-linear time decoding algorithm for polar codes, that was known before).

      • Thanks — this leaves me, I assume, with yet another point I don’t get: shouldn’t be then approximately $(1-H(p))N$ rows of the original $N\times N$ matrix A that end up being discarded? (right now, it reads $H(p)N$)

      • (seems I can’t reply directly to the latest question.)

        Sorry, yes, you are correct. In the fourth paragraph I indicate (correctly) that the final matrix will be H(p)NxN, but then later in the fifth paragraph indicate that we should discard H(p)N rows, when we should really be keeping that number.

  2. Welcome! – Not so Great Ideas in Theoretical Computer Science

  3. Efficient Distribution Estimation 4: Greek Afternoon – STOC 2014 Recaps (Part 7) – Not so Great Ideas in Theoretical Computer Science

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s