Gautam Kamath on Distributional Property Testing and Estimation: Past, Present, and Future
After lunch, we were lucky enough to witness back-to-back talks by the Valiant brothers, Greg and Paul. If you aren’t familiar with their work, first of all, how have you been dealing with all of your everyday property testing needs? But second of all, over the past few years, they’ve touched a number of classic statistical problems and left most of them with matching optimal upper and lower bounds.
Greg kicked off the show and focused on property estimation. Given a property of interest and access to independent draws from a fixed distribution over , how many draws are necessary to estimate the property accurately?
Of course, this is an incredibly broad problem. For example, properties of a distribution which could be of interest include the mean, the norm, or the probability that the samples will spell out tomorrow’s winning lottery numbers. In order to tame this mighty set of problems, we focus on a class of properties which are slightly easier to understand — symmetric properties. These are properties which are invariant to permuting the labels on the support — for instance, if we decided to say 3 was now called 12 and vice-versa, the property wouldn’t change. Notable examples of symmetric properties include the support size*, entropy, and distance between two distributions. Still a broad class of properties, but surprisingly, we can make statements about all of them at once.
What’s the secret? For symmetric properties, it turns out that the fingerprint of a set of samples contains all the relevant information. The fingerprint of a set of samples is a vector whose th component is the number of elements in the domain which occur exactly times in . The CliffNotes version of the rest of this post is that using linear programming on the fingerprint gives us an unreasonable amount of power**.
First off, Greg talked about a surprising dichotomy which demonstrates the power of linear estimators. A linear estimator of a property is an estimator which outputs a linear combination of the fingerprint. Given a property , a number of samples , and a distance , exactly one of the following holds:
- There exist distributions and such that but no estimator (linear or not) that uses only samples can distinguish the two with probability .
- There exists a linear estimator which requires samples and estimates the property to within with probability .
In other words, if there exists an estimator for a property, there is a linear estimator which is almost as good. Even better, their result is constructive, for both cases! A linear program is used to construct the estimator, where the variables are the estimator coefficients. Roughly, this linear program attempts to minimize the bias of the estimator while keeping the variance relatively small. On the other hand, when we take the dual of this linear program, we construct two distributions which have similar fingerprint expectations, but differ in their value for the property — we construct an explicit counter-example to any estimator which claims to use samples and estimate the property to accuracy .
But here’s the $64 question – do these estimators work in practice? Sadly, the answer is no, which actually isn’t that surprising here. The estimator is defined in terms of the worst-case instance for each property — in other words, it’s oblivious to the particular distribution it receives samples from, which can be wasteful in many cases.
Let’s approach the problem again. What if we took the fingerprint of our samples and computed the property for the empirical distribution? This actually works great – the empirical distribution optimally estimates the portion of the distribution which we have seen. The downside is that we have to see the entire distribution, which takes samples. If we want to break past this linear barrier, we must estimate the unseen portion. It turns out that, once again, linear programming saves the day. We solve a program which describes all distributions whose expected fingerprint is close to the observed fingerprint. It can be shown that the diameter of this set is small, meaning that any distribution in this space will approximate the target distribution for all symmetric properties at the same time! Furthermore, this estimator takes only samples, which turns out to be optimal.
Again, we ask the same question – do these estimators work in practice? This time, the answer is yes! While performance plots are often conspicuously missing from theory papers, Greg was happy to compare their results to estimators used in practice — their approach outperformed the benchmarks in almost all instances and regimes.
Finally, Greg mentioned a very recent result on instance-by-instance optimal identity testing. Recall that in identity testing, given a known distribution and an unknown distribution , we wish to test if they are equal or -far (in total variation distance). As mentioned before, when is uniform, are necessary and sufficient for this problem. But what about when is , or when assigns a probability to which is proportional to the th digit of — do we have to design an algorithm by scratch for every ?
Thankfully, the answer is no. The Valiants provide an algorithm which is optimal up to constant factors for all but the most pathological instances of . Somewhat unexpectedly, the optimal number of samples turns out to be – aside from the fact it matches the uniform case, it’s not obvious why the -norm is the “correct” norm here. Their estimator is reasonably simple – it involves a small tweak to Pearson’s classical -squared test.
Clément Canonne on Lower Bound Techniques for Statistical Estimation Tasks
At this point, Greg tagged out, allowing Paul to give us a primer in “Lower Bound Techniques for Statistical Estimation Tasks”: roughly speaking, the other side of Greg’s coin. Indeed, all the techniques used to derive upper bounds (general, systematic testers for symmetric properties) can also been turned into their evil counterparts: how to show no algorithm supposed to test a given property can be “too efficient”.
The dish is very different, yet the ingredients similar: without going into too much details, let’s sketch the recipe. In the following, keep in mind that we will mostly cook against symmetric properties, that is properties invariant by relabeling of the domain elements: if , then for any permutation of one has .
- elements do not matter: all information any tester can use can be found in the fingerprint of the samples it obtained. In other terms, we can completely forget the actual samples for the sake of our lower bound, and only prove things about their fingerprint, fingerprint of fingerprint, and so on.
- Poissonization is our friend: if we take samples from a distribution, then the fingerprint, a tuple of integer random variables , lacks a very nice feature: the ‘s are not independent (they do kind of have to sum to , for instance). But if instead of this, we were to take samples, then the individual components of the fingerprint will be independent! Even better, many nice properties of the Poisson distribution will apply: the random variable will be tightly concentrated around its mean , for instance. (this applies whether the property is symmetric or not. When in doubt, Poissonize. Independence!)
- Central Limit Theorems: the title ain’t witty here, but the technique is. As a very high-level, imagine we get to random variables corresponding to the “transcript” of what a tester/algorithm sees from taking samples from two possible distributions it should distinguish. We would like to argue that the distributions of and are close, so close (in a well-defined sense) that it is information-theoretically impossible to do so. This can be very difficult; a general idea would be to prove that there exists another distribution , depending on very few parameters — think of a Gaussian: only the mean and (co)variance (matrix) — such that both and are close to . Then, by the triangle inequality, they must be close, and we are done.Sweeping a lot of details under the rug, what Paul presented is this very elegant idea of proving new Central Limit Theorems that exactly give that: theorems that state that asymptotically (with quantitative bounds on the convergence rate), some class of distributions of interest will “behave like” a multivariate Gaussian or Multinomial with the “right parameters” — where “behave like” refers to a convient metric on distributions (e.g. Earthmover or Total Variation), and “right parameters” means “as few as possible”, in order to keep as many degrees of freedom when designing the distributions of and ).Proving such CLT’s is no easy task, but — on the bright side — they have many applications, even outside the field of distribution testing.
There would be much, much more to cover here, but I’m past 600 words already, and was only allowed .
*In order to avoid pathological cases for estimating support size, we assume none of the probability are in the range .
**At least, when channeled by sufficiently talented theoreticians.
 Estimating the unseen: A sublinear-sample canonical estimator of distributions, Valiant, Valiant (2010).
 A CLT and tight lower bounds for estimating entropy, Valiant, Valiant (2010).
 Estimating the Unseen: An n/log(n)-sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs, Valiant, Valiant (2011).
 Estimating the Unseen: Improved Estimators for Entropy and other Properties, Valiant, Valiant (2013).
 An Automatic Inequality Prover and Instance Optimal Identity Testing, Valiant, Valiant (2014).
 Testing Symmetric Properties of Distributions, Valiant (2011).
See also, Greg’s TCS+ talk, which covers [1,2,3,4].