Background: the problem of inferring the distribution of some data, given only random elements sampled from it, is a fundamental and longstanding question in statistics, science, life and most likely even more. Although a recent line of work in the TCS community has helped understanding this question, the general case — -approximating an unknown distribution on elements, with no assumption on it — is well-known to be (a) hard, and (b) hard (that is, samples needed, and well… is big). For this reason, a lot of focus has been drawn to the study of particular, “natural” types of distributions that may arise: e.g, mixtures of well-known distributions, or distributions with some promise on their “shape” (monotone, unimodal, log-concave…) the hope being to do better, or to prove one can’t.
For instance, take a biased coin. For quite a few decades, it’s been known that estimating the bias to accuracy required samples. What about a sum of independent coins, each of them with their own bias†? Odd as it may seem, up till last year a non-trivial upper bound was not known [DDS12] — it turns out that for this sort of random variables, samples are enough.
(a quick note here: of course, learning the different biases would require a dependence on ; the problem here is to approximate the density of the distribution, i.e. come up with any type of distribution which is statistically close to the unknown one)
In this talk, Li-Yang presented an exciting result for an even more general class of distributions, glamorously named -SIIRV. If you remember the example two paragraphs above, there was a sum of differently-biased biased coins; now, replace the coins by dice — biased dice, each different, each with faces. To get an outcome, you throw them all, sum everything, and only remember the sum. More formally, a -sum of integer-valued independent random variables is of the form , where the ‘s are arbitrary independent random variables taking value in (think of as a small constant, e.g. 20 or ). The question they ask is not can we learn the distribution of ? (yes, of course: ) but rather can we learn this distribution with samples?. And — surprisingly — the answer is yes. In more details: you can learn -SIIRVs in time and sample complexity independent of . You can throw as many dice as you want!
A first observation: it’s not obvious. Intuitively, one may think something along the lines of “sure! we can do it with (the Poisson Binomial Distribution result mentioned below), let’s just generalize to “. Turns out it does not work that well — a PBD is still nice (i.e., unimodal, log-concave, well-behaved): even a -SIIRV can be a nightmare (-far from unimodal, -far from log-concave, -far from anything you wish for). So they came up with something else, and proved the following key result: if is a -SIIRV, it can essentially be decomposed as where is a discrete Gaussian with the “right” mean, is a -sparse (“nice”) random variable, and is an integer. With this characterization in hand, the learning algorithm essentially follows: for all possible choices of , try to learn the corresponding and ; this yields a pool of candidates for the final hypothesis, one of them at least being good. Now let them fight, pick the winner.
Of course, the actual analysis and proofs are way more involved, and they use for this machinery beyond the scope of my memory, talent and this small note; but — and that was the great force of this talk — the speaker explained their work clearly, with simple examples and catchy pictures, and managed to give a very good sense of what was going on — while nonchalantly sipping a cup of tea.
(Or coffee, maybe.)
† a.k.a. a Poisson Binomial Distribution: which has nothing to do with Poisson distributions, of course.
Michael Forbes on Polar Codes: Speed of polarization and polynomial gap to capacity, by Venkatesan Guruswami and Patrick Xia.
Consider the binary symmetric channel: you can send bits through this channel, and each will be flipped independently with probability . We want an optimal scheme for sending information reliably through this channel. In particular, the scheme should take a parameter and devise a way to send bits through the channel to reliably transmit actual bits of information, so that the failure probability is at most . Further, the scheme should be explicit, and should be .
Arikan (2008) gave a scheme called polar codes meeting these requirements, except that had to be very large for the scheme to succeed. Indeed, no concrete bounds were actually known, and the work of Guruswami-Xia shows that suffices.
To understand polar codes, first consider the following simpler problem. Given Bernoulli() bits (each being 1 independently with probability ) devise a map from bits to bits that is injective with high probability. It is not to hard to do so, since there are only -bit strings that appear with noticable probability, so we can just list the noticable strings in some enumerate (taking bits), and discard the neglible strings.
Now suppose you had such a map that was linear over . The claim is that this would yield a good scheme for coding in the binary symmetric channel. Specifically, let the map be defined by the matrix . Let your “messages” be those strings in the -nullspace of . It is not hard to see that are such strings. Suppose now we send such a message through the channel. Then the channel will yield , where is a binary string with each bit independently being 1 with probability . It follows then that applying yiels . But is injective with high probability on such strings , so will determine , from which point we can determine since we have .
The above thus reduces the problem of communication over the binary symmetric channel to that of compression. Arikan’s polar codes then solve this compression problem as follows. First, an matrix is devised so that given the random vector , the resulting random vector has the polarization property. That is, if you look at the coordinates of in order, then each coordinate (as a random variable) is either fully determined or fully independent of the previous coordinates (as random variables). Then, one can see that those coordinates of that are fully determined are not helpful in establishing an injective map, so can be discarded. This will discard rows of the matrix, by conservation of entropy, and thus will solve the compression problem.
To construct such a matrix , one can start with 2 Bernoulli() bits . If we then transform those to and , we first see that is now Bernoulli(), and . Thus, the entropy of is larger than that of . Individually, the entropy of is clearly the same as the entropy of , but when we look at the order of the bits things become interesting. That is, while and are independent, and are not, and in particular reading will lower the entropy of .
This 2-bit process induces a slight polarization, and now one recurses. That is, given 4 bits, first apply the 2-bit process to two pairs separately. Then, one has 2 of the bits being “higher entropy” and two bits being “lower entropy”, at which point the process can be repeated under this new grouping. The challenge in the analysis is: does this eventually lead to near full polarization? If so, how does the speed of polarization depend on ?
Arikan showed that this first question is a resounding yes, but used the martingale convergence theorem which does not address the second question. Guruswami-Xia address this second concern, by studying this process further and giving quantatitive bounds for how the polarization improves in each step.