We continue our FOCS 2013 recaps (Part 1) with recaps by Clément (a guest blogger from Columbia) and Michael.
Clément Canonne on Learning sums of independent integer random variables by Constantinos Daskalakis, Ilias Diakonikolas, Ryan O’Donnell, Rocco Servedio, and Li-Yang Tan.
Background: the problem of inferring the distribution of some data, given only random elements sampled from it, is a fundamental and longstanding question in statistics, science, life and most likely even more. Although a recent line of work in the TCS community has helped understanding this question, the general case — -approximating an unknown distribution on
elements, with no assumption on it — is well-known to be (a) hard, and (b) hard (that is,
samples needed, and well…
is big). For this reason, a lot of focus has been drawn to the study of particular, “natural” types of distributions that may arise: e.g, mixtures of well-known distributions, or distributions with some promise on their “shape” (monotone, unimodal, log-concave…) the hope being to do better, or to prove one can’t.
For instance, take a biased coin. For quite a few decades, it’s been known that estimating the bias to accuracy required
samples. What about a sum of
independent coins, each of them with their own bias†? Odd as it may seem, up till last year a non-trivial upper bound was not known [DDS12] — it turns out that for this sort of random variables,
samples are enough.
(a quick note here: of course, learning the different biases would require a dependence on
; the problem here is to approximate the density of the distribution, i.e. come up with any type of distribution which is statistically close to the unknown one)
In this talk, Li-Yang presented an exciting result for an even more general class of distributions, glamorously named –SIIRV. If you remember the example two paragraphs above, there was a sum of
differently-biased biased coins; now, replace the coins by dice — biased dice, each different, each with
faces. To get an outcome, you throw them all, sum everything, and only remember the sum. More formally, a
-sum of integer-valued independent random variables is of the form
, where the
‘s are arbitrary independent random variables taking value in
(think of
as a small constant, e.g. 20 or
). The question they ask is not can we learn the distribution of
? (yes, of course:
) but rather can we learn this distribution with
samples?. And — surprisingly — the answer is yes. In more details: you can learn
-SIIRVs in time and sample complexity
independent of
. You can throw as many dice as you want!
A first observation: it’s not obvious. Intuitively, one may think something along the lines of “sure! we can do it with (the Poisson Binomial Distribution result mentioned below), let’s just generalize to
“. Turns out it does not work that well — a PBD is still nice (i.e., unimodal, log-concave, well-behaved): even a
-SIIRV can be a nightmare (
-far from unimodal,
-far from log-concave,
-far from anything you wish for). So they came up with something else, and proved the following key result: if
is a
-SIIRV, it can essentially be decomposed as
where
is a discrete Gaussian with the “right” mean,
is a
-sparse (“nice”) random variable, and
is an integer. With this characterization in hand, the learning algorithm essentially follows: for all possible choices of
, try to learn the corresponding
and
; this yields a pool of
candidates for the final hypothesis, one of them at least being good. Now let them fight, pick the winner.
Of course, the actual analysis and proofs are way more involved, and they use for this machinery beyond the scope of my memory, talent and this small note; but — and that was the great force of this talk — the speaker explained their work clearly, with simple examples and catchy pictures, and managed to give a very good sense of what was going on — while nonchalantly sipping a cup of tea.
(Or coffee, maybe.)
† a.k.a. a Poisson Binomial Distribution: which has nothing to do with Poisson distributions, of course.
Michael Forbes on Polar Codes: Speed of polarization and polynomial gap to capacity, by Venkatesan Guruswami and Patrick Xia.
Consider the binary symmetric channel: you can send bits through this channel, and each will be flipped independently with probability . We want an optimal scheme for sending information reliably through this channel. In particular, the scheme should take a parameter
and devise a way to send
bits through the channel to reliably transmit
actual bits of information, so that the failure probability is at most
. Further, the scheme should be explicit, and
should be
.
Arikan (2008) gave a scheme called polar codes meeting these requirements, except that had to be very large for the scheme to succeed. Indeed, no concrete bounds were actually known, and the work of Guruswami-Xia shows that
suffices.
To understand polar codes, first consider the following simpler problem. Given Bernoulli(
) bits (each being 1 independently with probability
) devise a map from
bits to
bits that is injective with high probability. It is not to hard to do so, since there are only
-bit strings that appear with noticable probability, so we can just list the noticable strings in some enumerate (taking
bits), and discard the neglible strings.
Now suppose you had such a map that was linear over . The claim is that this would yield a good scheme for coding in the binary symmetric channel. Specifically, let the map be defined by the
matrix
. Let your “messages” be those strings in the
-nullspace of
. It is not hard to see that are
such strings. Suppose now we send such a message
through the channel. Then the channel will yield
, where
is a binary string with each bit independently being 1 with probability
. It follows then that applying
yiels
. But
is injective with high probability on such strings
, so
will determine
, from which point we can determine
since we have
.
The above thus reduces the problem of communication over the binary symmetric channel to that of compression. Arikan’s polar codes then solve this compression problem as follows. First, an matrix
is devised so that given the random vector
, the resulting random vector
has the polarization property. That is, if you look at the coordinates of
in order, then each coordinate (as a random variable) is either fully determined or fully independent of the previous coordinates (as random variables). Then, one can see that those coordinates of
that are fully determined are not helpful in establishing an injective map, so can be discarded. This will discard
rows of the matrix, by conservation of entropy, and thus will solve the compression problem.
To construct such a matrix , one can start with 2 Bernoulli(
) bits
. If we then transform those to
and
, we first see that
is now Bernoulli(
), and
. Thus, the entropy of
is larger than that of
. Individually, the entropy of
is clearly the same as the entropy of
, but when we look at the order of the bits things become interesting. That is, while
and
are independent,
and
are not, and in particular reading
will lower the entropy of
.
This 2-bit process induces a slight polarization, and now one recurses. That is, given 4 bits, first apply the 2-bit process to two pairs separately. Then, one has 2 of the bits being “higher entropy” and two bits being “lower entropy”, at which point the process can be repeated under this new grouping. The challenge in the analysis is: does this eventually lead to near full polarization? If so, how does the speed of polarization depend on ?
Arikan showed that this first question is a resounding yes, but used the martingale convergence theorem which does not address the second question. Guruswami-Xia address this second concern, by studying this process further and giving quantatitive bounds for how the polarization improves in each step.
Hi Notso (can I call you Notso?),
is known beforehand by the two parties, or if they also have to communicate it.
There’s a point I’m don’t quite understand in the Polar Code protocol: namely, whether the matrix
I reckon it boils down to the fact that I don’t really get what is meant by “The above thus reduces the problem of communication over the binary symmetric channel to that of compression.” Is the challenge to design an efficient way of computing a good A, or two compress such an A in a very succinct representation?
Both parties are assumed to know the protocol for communication, and this is a typical phenomenon in coding theory. Even in the non-constructive proofs of achieving channel capacity (via the probabilistic method), one can assume this as the two parties can use brute-force methods to find the lexicographically first coding scheme that works.
In polar codes, the scheme for achieving polarization – the construction of the NxN matrix – is deterministic. The discarding of redundant rows is somewhat more complicated. One method would be to use estimation/sampling procedures are run to determine which rows polarize to zero (conditional) entropy. This sampling requires randomness, and one might be concerned that if two parties used the same procedure then they would discard different rows and thus result in different coding schemes (and thus fail to communicate). I imagine this probability of error can be made sufficiently small.
However, Guruswami-Xia do show that this estimation can be done deterministically, so that the construction of NxN matrix, as well as the choice of which H(p)N rows to discard, are completely determined just by the parameters of the channel. This way both parties can run this procedure to get the same matrix A. Then the sender can sample codewords from the nullspace of this matrix, and the receiver can use the decoding algorithm to remove the noise (I didn’t describe this in the post, but there is a quasi-linear time decoding algorithm for polar codes, that was known before).
Thanks — this leaves me, I assume, with yet another point I don’t get: shouldn’t be then approximately $(1-H(p))N$ rows of the original $N\times N$ matrix A that end up being discarded? (right now, it reads $H(p)N$)
(seems I can’t reply directly to the latest question.)
Sorry, yes, you are correct. In the fourth paragraph I indicate (correctly) that the final matrix will be H(p)NxN, but then later in the fifth paragraph indicate that we should discard H(p)N rows, when we should really be keeping that number.
Welcome! – Not so Great Ideas in Theoretical Computer Science
Efficient Distribution Estimation 4: Greek Afternoon – STOC 2014 Recaps (Part 7) – Not so Great Ideas in Theoretical Computer Science