Efficient Distribution Estimation 4: Greek Afternoon – STOC 2014 Recaps (Part 7)

This post is the finale on our series on the Efficient Distribution Estimation workshop. See also, part 1, part 2, and part 3.

After one last caffeination, we gathered once more for the exciting finale to the day — we were in for a very Greek afternoon, with back-to-back tutorials by Ilias Diakonikolas and Costis Daskalakis. As a wise man once said a few blog posts ago, “if you can’t beat them… change what ‘them’ means”. That was the running theme of this session – they showed us how to exploit the structure of a distribution to design more efficient algorithms.

Gautam Kamath on Beyond Histograms: Exploiting Structure in Distribution Estimation

Ilias focused on the latter of these – in particular, k-flat distributions. A k-flat distribution can be described by k intervals, over which the probability mass function is constant – it literally looks like k flat pieces. Warm up: what if we knew where each piece started and ended? Then the learning problem would be easy: by grouping samples that fall into each interval together, we reduce the support size from n to k. Now we can use the same algorithm as for the general case – only this time, it takes O(k/\varepsilon^2) samples.

But we’re rarely lucky enough to know the correct intervals. What should we do? Guess the intervals? Too expensive. Guess the intervals approximately? A bit better, but still pricey. Make the problem a bit easier and allow ourselves to use k/\varepsilon intervals, instead of only k? This actually works — while lesser mortals would be satisfied with this solution and call it a day, Ilias refused to leave us with anything less than the grand prize. Can we efficiently output a k-flat distribution using only O(k/\varepsilon^2) samples?

Yes, we can, using some tools from Vapnik-Chervonenkis (VC) theory. The VC inequality is a useful tool which allows us to relate the empirical distribution and the true distribution, albeit under a weaker metric than the total variation distance. Skipping some technical details, the key idea is that we want to output a k-flat distribution that is close to the empirical distribution under this weaker metric. Using the triangle inequality, specific properties of this metric, and the fact that we’re comparing two k-flat distributions, this will give us a k-flat distribution which is close to the target in total variation distance, as desired. Computing a k-flat distribution that’s close to the empirical one isn’t actually too tough – a careful dynamic program works.

We can learn k-flat distributions – so what? This class might strike you as rather narrow, but this result leads to algorithms for a variety of classes of distributions, including monotone, t-modal, monotone hazard rate, and log-concave distributions. These classes are all close to k-flat, and this algorithm is fine with that. In this sense, this tool captures all these classes at the same time — One algorithm to rule them all, so to speak. This algorithm even directly generalizes to mixtures of these distributions, which is huge — studying mixtures usually makes the problem much more difficult.

Alright, but what’s the catch? Not all distributions are that close to k-flat. For example, this algorithm requires \tilde O(1/\varepsilon^3) samples to learn a log-concave distribution, even though the optimal sample-complexity is O(1/\varepsilon^{5/2}). It turns out that log-concave distributions are close to k-linear, rather than k-flat, and we must use a k-linear approximation if we want a near-optimal sample complexity.

“Please sir, I want some more.”

You can have it, Oliver — it’s actually possible to learn distributions which are close to mixtures of piecewise polynomials! But this topic is juicy enough that it deserves its own blog post.

Open problems

  • The perennial question – what can we do in high dimensions?
  • Most of these results are fundamentally improper – they approximate a distribution with a distribution which may be from a different class. Can these techniques lead to computationally efficient proper learning algorithms, where we output a hypothesis from the same class? We already know sample-efficient algorithms, the trouble is the running time.

Gautam Kamath on Beyond Berry Esseen: Structure and Learning of Sums of Random Variables

Finally, we had Costis, who talked about sums of random variables. We would like to understand a class of distributions \mathcal{D} in three (highly related) ways:

  • Structure: What “simpler” class of distributions approximates \mathcal{D}?
  • Covering: Can we generate a small set of distributions S, such that for any distribution D \in \mathcal{D}, we have a distribution D^* \in S such that D^* is \varepsilon-close to D?
  • Learning: Given sample access to a distribution D \in \mathcal{D}, can we output a distribution D^* such that D^* is \varepsilon-close to D?

Understanding the structure often implies covering results, since we can usually enumerate the simpler class of distributions. And covering implies learning, since generic results allow us to find a “good” distribution from S at the cost of O(\log |S|) samples. It’s not that easy though, since either the details are not obvious, or we’d like to go beyond these generic results.

Consider Poisson Binomial Distributions*, or PBDs for short. A PBD is the sum X = \sum_i X_i of n independent Bernoulli random variables, where X_i is \mathrm{Bernoulli}(p_i). This is like a Binomial distribution on steroids: the Binomial distribution is the special case when all p_i are equal.

PBDs pop up everywhere in math and computer science, so there’s a plethora of classical structural results – with a few catches. For example, check out a result by Le Cam, which approximates a PBD by a Poisson distribution with the same mean:

d_{TV}\left(\sum_i X_i, \mathrm{Poisson}\left(\sum_i p_i\right)\right) \leq \sum_i p_i^2

Catch one: this structural result describes only a small fraction of PBDs – it gives a good approximation when the p_i values are really small. Catch two: we’re approximating a PBD with a Poisson, a distribution from a different family – we’d ideally like to have a proper approximation, that is, approximate the PBD with another PBD.

Recent results from our community show that you can get around both of these issues at once (woo, chalk one up for the computer scientists!). Structurally, we know the following: any PBD is \varepsilon-close to either a Binomial distribution or a shifted PBD with the parameter n replaced by O(1/\varepsilon^3) — a much smaller number of “coin flips”. While the original distribution had n different p_i parameters, the approximating distribution has only O(1/\varepsilon^3) parameters. Since n is usually way bigger than 1/\varepsilon, this is a huge reduction in complexity. By carefully enumerating the distributions in this structural result, we can get a small cover. Specifically, the size of the cover is polynomial in n and quasi-polynomial in 1/\varepsilon, while the naive cover is exponential in n. Now, using the generic reduction mentioned before, we can learn the distribution with \mathrm{polylog}(n) samples. Again though, let’s pretend that n is so big that it makes André the Giant feel insecure – can we take it out of the picture? If you’ve been paying any attention to my rhetorical style, you might predict the answer is yes (and you’d be right): O(1/\varepsilon^2) samples suffice for learning.

Is that the end of the story? No, we need to go deeper — let’s generalize PBDs once more: Sums of Independent Integer Random Variables (SIIRVs)**. A k-SIIRV is the sum X = \sum_i X_i of n independent random variables (k-IRVs), where each X_i \in \{1, \dots, k\}. Note that when k = 2, this is essentially equivalent to a PBD. PBDs are actually quite tame, and have many nice properties – for example, they’re unimodal and log-concave. However, even a 3-SIIRV is far from being k-modal or log-concave. After some careful analysis and modular arithmetic, it turns out that a k-SIIRV with sufficiently large variance is approximated by cZ + Y, where c is some integer \in \{1, \dots, k-1\}, Z is a discretized Gaussian, Y is a c-IRV, and Y and Z are independent. On the other hand, if the variance is not large, the distribution has a sparse support – it can be approximated by a (k/\varepsilon)-IRV. This structural observation leads to a learning result with sample complexity which is polynomial in k and 1/\varepsilon, but once again independent of n.

Screenshot from 2014-06-22 15:16:30

A particularly sinister 3-SIIRV, far from being 3-modal or log-concave. Image taken from Costis’ slides.

Open problems

  • Current results for PBDs are quasi-polynomial in 1/\varepsilon, for both the cover size and the running time for the proper learning result. Can we make this polynomial?
  • For k-SIIRVs, our current understanding is fundamentally improper. Can we obtain proper covers or learning algorithms for this class?
  • What about higher dimensions? We have some preliminary results on the multidimensional generalization of PBDs (joint work with Costis and Christos Tzamos), but we’re totally in the dark when it comes to generalizing k-SIIRVs.


*For maximum confusion, Poisson Binomial Distributions have nothing to do with Poisson Distributions, except that they were both named after our good friend Siméon.
**For another perspective on the SIIRV result, see this post by my partner in crime, Clément.

[1] Learning k-modal distributions via testing, Daskalakis, Diakonikolas, Servedio (2012).
[2] Approximating and Testing k-Histogram Distributions in Sub-Linear time, Indyk, Levi, Rubinfeld (2012).
[3] Learning Poisson Binomial Distributions, Daskalakis, Diakonikolas, Servedio (2012).
[4] Learning Mixtures of Structured Distributions over Discrete Domains, Chan, Diakonikolas, Servedio, Sun (2013).
[5] Testing k-modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2013).
[6] Learning Sums of Independent Integer Random Variables, Daskalakis, Diakonikolas, Servedio (2013).
[7] Efficient Density Estimation via Piecewise Polynomial Approximation, Chan, Diakonikolas, Servedio, Sun (2013).
[8] Sparse Covers for Sums of Indicators, Daskalakis, Papadimitriou (2013).
[9] Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians, Daskalakis, Kamath (2014).
[10] Near-optimal-sample estimators for spherical Gaussian mixtures, Acharya, Jafarpour, Orlitsky, Suresh (2014).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s