Efficient Distribution Estimation – STOC 2014 Recaps (Part 2)

The STOC 2014 saga continues (check out Part 1 here), with Clément Canonne and Gautam “G” Kamath covering the day-long workshop on Efficient Distribution Estimation. There’s a wealth of interesting stuff from the workshop, so we’ve broken it up into episodes. Enjoy this one, and stay tuned for the next installment!

Clément Canonne on Taming distributions over big domains

Talk slides here

Before STOC officially began, before the talks and the awe and the New Yorker Hotel and all the neat and cool results the reviews on this blog will try to give an idea and a sense of, before even this very long and cumbersome and almost impossible — almost, only? to understand sentence, before all this came Saturday.

Three simultaneous sessions, all three very tempting: one covering “Recent Advances on the Lovasz Local Lemma”, one giving tools and insight on how to “[overcome] the intractability bottleneck in unsupervised machine learning”, and finally one — yay! — on “Efficient Distribution Estimation“. I will hereby focus on the third one, organized by Ilias Diakonikolas and Gregory Valiant, as it is the one I attended, and in particular 3 of the 6 talks of the workshop; G (Gautam) will review the other ones, I bet with shorter sentences and more relevance.

It all started with Ronitt Rubinfeld, from MIT and Tel Aviv University, describing the flavour of problems this line of work was dealing with, what exactly were the goals, and why should one care about this. Imagine we have samples from a space of data — a Big space of Data; whether it be records from car sales, or the past ten years of results for the New Jersey lottery, or even logs of network traffic; and we want to know something about the underlying distribution: has the trend in cars changed recently? Is the lottery fixed? Is someone doing unusual requests to our server?

How can we tell?

Let us rephrase slightly the problem: there exists a probability distribution D on a set of n elements [n]=\{1,\dots,n\}; given i.i.d. (independent, identically distributed) samples from D, can we decide with as few of these samples as possible if, for instance,

  • D is uniform, or “far”* from it?
  • D is equal to a known, fully specified distribution (say, the distribution of car sales from 10 years ago), or far from it?
  • has D big entropy, or not?

Of course, there exist many tools in statistics for this type of tasks: the Chi-Square test, for instance. Yet, they usually have a fatal flaw: namely, to give good guarantees they require a number of samples at least linear in n. And n is big. Like, huge. (and anyway, one could always actually learn D to high accuracy with O(n) samples) No, the question is really can one do anything interesting with o(n) samples?

As Ronitt explained: yes, one can. Testing uniformity can be done with (only!) O(\sqrt{n}) samples**; one can test identity to a known distribution with O(\sqrt{n}) as well; and testing whether two unknown distributions are equal or far from each other only requires O(n^{2/3}) samples from each. All these results (see e.g. [1] for a survey of this huge body of research) are quite recent; and come from works in TCS spanning the last 15 years or so.

“And they saw they could test distributions in o(n) samples; and they saw that this was good, and they also saw that most of these results were tight”. But as always, the same question arose: can we do better? This may sound strange: if the sample complexities are tight, how could we even hope to do better? If there is a lower bound, how can we even ask to beat it?

This was the second part of Ronitt’s talk: bypassing impossibility results. She presented two different approaches, focusing on the second:

  • “if you can’t beat them… change what ‘them’ means”. Indeed, getting better algorithms for general, arbitrary distributions may be impossible; but in practice, things are rarely arbitrary. What if we knew something more about this unknown D — for instance, it is a mixture of simple distributions? Or it has a particular “shape” (e.g., its probability mass function is monotone)? Or it is obtained by summing many “nice” and simple random variables? Under these assumptions, can the structure of D leveraged to get better, more sample efficients algorithms? (in short: yes. See for instance [2], where the sample complexity, from n^c in general, becomes \mathrm{poly}(\log n))
  • “if playing by the rules is too hard, change the rules”. All the results and lower bound above suppose the algorithms have (only) access to i.i.d. samples from the distribution D. But in many cases, they could get more: for instance, there are scenarii where a tester can focus on some subset of the domain, and only get samples from it (think about “stratified sampling”, or just the situation where a pollster targets in particular voters between age 25 and 46). Given this new type of access to the “conditional distributions” induced by D, can algorithms be more efficient? And if the data comes from a known, preprocessed (even sorted!) dataset, as it is the case for example with Google and the n-grams (frequencies and statistics of words in a vast body of English literature), the testing algorithm can actually ask the exact frequency (equivalently, the probability value) of any given element of its choosing, in addition to just sampling a random word! This type of access, with additionial or modified types of query to D, has been studied for instance in [3,4,5,6]; and, in a nutshell, the answer is positive.

Yes, Virginia, there is a Santa Claus. And he has constant sample complexity!

Open problems

  • Study more types of structures or assumptions on the distribution: namely, what type of natural assumptions on D can we get, for which the testing algorithms get a significant speedup? (in terms of sample complexity)
  • What is the relation between the new type of access? Are they related? Can one simulate the other? What other natural sorts of access can one consider?
  • What about learning with these new types of access? (so far, we were interested in testing whether D had some property of interest; but — maybe — learning the distribution also became easier?)

Such excitement — and this was only the beginning!


* Far in total variation (or statistical) distance: this is essentially (up to a factor 2) the \ell_1 distance between two probability distributions.
** and this is tight.

(this bibliography is far from comprehensive; if any reader wants additional pointers, I’ll be happy to provide more!)

[1] Taming big probability distributions, R. Rubinfeld (2012).
[2] Testing k-Modal Distributions: Optimal Algorithms via Reductions, Daskalakis, Diakonikolas, Servedio, Valiant, Valiant (2011).
[3] On the Power of Conditional Samples in Distribution Testing, Chakraborty, Fischer, Goldhirsh, Matsliah (2012).
[4] Testing probability distributions using conditional samples, Canonne, Ron, Servedio (2012).
[5] Streaming and Sublinear Approximation of Entropy and Information Distances, Guha, McGregor, Venkatasubramanian (2005).
[6] Testing probability distributions underlying aggregated data, Canonne, Rubinfeld (2014).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s