After a long break since the last time we touched upon distribution testing, we are back to discuss two papers that appeared this month on the arXiv — and hopefully some of their implications.
- [ADK15] Optimal testing for properties of distributions.
- [CDRG15] Testing Shape Restrictions of Discrete Distributions.
As a warmup, recall that in distribution testing (“property testing of probability distributions,” but shorter), one is given access to independent samples from an unknown distribution over some fixed domain (say, the set of integers , and wishes to learn one bit of information about :
Does have the property I’m interested in, or is it far from every distribution that has this property?
In particular, the question is how to do things more efficiently (in terms of the number of samples required) than by fully learning the distribution (which would “cost” samples): surely, one bit of information is not that expensive, is it? Initiated by the seminal work of Batu, Fortnow, Rubinfeld, Smith, and White [BFRSW01], the field of distribution testing has since then aimed at answering that question by a resounding “No, it isn’t! Mostly. Sort of?” (And has had a fair amount of success in doing so.)
For more on property testing and distribution testing in general, we now take the liberty to refer the interested reader to [Ron08] and [Rub12,Can15] respectively.
Testing Membership: why, how, err… what?
Let be a property of distributions, which is a fancy way of saying a subset or class of probability distributions. Think for instance of as “the single uniform distribution on elements,” or “the class of all Binomial distributions.”
We mentioned above that the goal was, given a distance parameter as well as sample access to an arbitrary, unknown distribution over , to determine the (asymptotic) minimum number of samples needed from to distinguish with high probability between the following two cases. (a) (it has the property); versus (b) ( is at distance at least from every distribution that has the property).
The metric here is the distance, or equivalently the total variation distance. Now, many results, spanning the last decade, have culminated in pinpointing the exact sample complexity for various properties, such as (being the uniform distribution), or (being equal to a fixed distribution, specified in advance), or even (being a Poisson Binomial Distribution, a generalization of the Binomial distributions). There has also been work in testing if a joint distribution is a product distribution (“independence”), or if a distribution has a monotone probability mass function.
Note that in the above, two trends appear: the first is consider a property that is a singleton, that is to test if the distribution is equal to something fixed in advance; while the second looks at a bigger set of distributions, typically a “class” of distributions that exhibits some structure; and amounts to testing if belongs to this class. (To be consistent with this change of terminology, we switch from now on from — property — to — class.)
This latter type of result has many applications, e.g. for hypothesis testing, or model selection (“Is my statistical model completely off if I assume it’s Gaussian?”, or “I know how to deal with histograms. Histograms are nice. Can we say it’s a histogram?”). But for many classes of interest, the optimal sample complexity (as a function of and ) remained either completely open, or some gap subsisted between the known upper and lower bounds. (For instance, the cases of monotone distributions and histograms was addressed in [BKR04] and [ILR12], but the results left room for improvement.)
Up until now, roughly. Two very recent papers tackle this question, and simultaneously solve it for many such distribution classes at once. The first, [CDGR15], gives a generic “meta-algorithm” that does ‘quite well’ for a whole range of classes: that is, a one-fits-all tester which has nearly-tight sample complexity (with regard to ). Their main theorem has the following flavor:
Let be a class of distributions that all exhibit some nice structure. If it can be tested, there is a good chance this algorithm can — maybe not optimally, but close enough.
The second, [ADK15] describes and instantiates a general method of “testing-by-learning” for distributions, which yields optimal sample complexity for several of these classes (with respect to both and , for a wide range of parameters). Interestingly, any such testing-by-learning approach was, until today, thought to be a dead-end for distribution testing — strong known impossibility results seemed to doom it. By coming up with an elegant twist, [ADK15] shows that, well, impossibility can only doom so much. Roughly and not quite accurately speaking, here is what they manage to show:
Let be a class of distributions which all exhibit some nice structure. If one can efficiently -learn it, then one can optimally test it.
A Unified Approach to Things
One Tester to Rule Them All
At the root of the generic tester of [CDGR15] is the observation that the testing algorithm of [BKR04], specifically tailored, built, and analyzed for testing whether a distribution is monotone, actually need not be. Namely, one can abstract and generalize the core idea of Batu, Kumar, and Rubinfeld; and by modifying it carefully obtain an algorithm that can apply to any class that satisfies a structural assumption:
Definition 1. (Structural Criterion) Any distribution in is close (in a specific strong sense) to some piecewise-constant distribution on ‘pieces’, where is small (say ).
By “close,” here we mean that on each of the “pieces” either the distribution puts very little probability weight, or stays within an factor of the piecewise-constant distribution . Moreover quite crucially one does not need to be able to compute this decomposition in pieces explicitly: it is sufficient to prove existence, for each , of a good decomposition.
Theorem 2. ([CDGR15, Main Theorem]) There exists an algorithm which, given sampling access to an unknown distribution over and parameter , can distinguish with probability 2/3 between (a) versus (b) , for any class that satisfies the above structural criterion. Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on ).
As it turns out, this assumption is satisfied by many interesting classes of distributions: monotone, -modal, log-concave, monotone hazard rate, Poisson Binomials, histograms, to cite only few… moreover, it’s easy to see that any mixture of distributions from these “structural” classes will also satisfy the above criterion (for a corresponding value of ).
To apply the tester to some class you may fancy testing, it then only remains to find the best possible for the structural criterion (i.e., to prove an existential result for the class, independent on any algorithmic aspect), and plug this into Theorem 2. This will immediately yield an upper bound — maybe not optimal, but maybe not far from it.
For an intuition of how the algorithm works, we paraphrase the authors’ description:
The algorithm proceeds in 3 stages:
- the decomposition step attempts to recursively construct a partition of the domain in a small number of intervals [that is, ], with a very strong guarantee [the distribution is almost uniform on each piece]. If this succeeds, then the unknown distribution will be close (in ) to its “flattening” on the partition; while if it fails (too many intervals have to be created), this serves as evidence that and we can reject.
- The second stage, the approximation step, learns this — which can be done with few samples since by construction we do not have many intervals.
- The last stage is purely computational, the projection step: we verify that is indeed close to .
If all three stages succeed, then by the triangle inequality is close to and by the structural assumption on the class, if then it will admit succinct enough partitions, and all three stages will go through.
Upper bounds are fine, but lower bounds?
As a counterpart to generically proving testing is not too hard, it would be nice to have an equally generic way of establishing it is actually not very easy either… Canonne, Diakonikolas, Gouleakis, and Rubinfeld also tackle this question, and provide a general “testing-by-narrowing” reduction that enables to do just that. We won’t dwelve into the details here (the reader is encouraged to read — or skim — the paper to learn more, including slightly more general statements and extensions to tolerant testing), but roughly speaking the theorem goes as follows:
Theorem 3. ([CDGR15, Lower Bound Theorem]) Let be a class of distributions that can be efficiently (agnostically) learned. Then, testing membership to is at least as hard as testing identity to any fixed distribution in .
While this may seem intuitive, there is actually no obvious reason why it should hold — and it actually fails if one removes the “efficiently learnable” assumption. (As an extreme case, consider the class of all distributions (which is hard to learn). Testing if a distribution is in there can trivially be done with no sample at all, yet there are elements in for which identity testing requires samples.)
As examples, [CDGR15] then instantiates Theorem 3 to show lower bounds for all the classes considered — choosing a suitably right “hard instance” in each case.
So… we’re done?
Well, not quite. As said above, this paper provides “nearly-tight” results for many classes, basically a Swiss army knife for class testing. Depending on the situation, it may be more than enough; but sometimes more fine-grained tools are needed… what about an optimal testing algorithm for these classes? What about getting the sample complexity right? And what about testing in higher dimensions?
What about all of these? There is a next section, you know.
Testing by Learning: “Yes, we can!”
A naive first approach
As mentioned before, learning an arbitrary distribution on is expensive, requiring samples. However, if we assume the distribution belongs to some class , we may be able to do better. In particular, for many classes , we can efficiently solve the proper learning problem, in which we output a near distribution in the same class. Formally stated:
Given sample access to , output such that .
This leads to the following naive algorithm for testing. First, regardless of whether or not is actually in the class, attempt to properly learn it up to accuracy . If is in the class, then we will learn a distribution which is actually -close. If it is not, since , by assumption, and are -far. Thus, we have reduced to the following testing problem:
Given sample access to and a description of a distribution , identify whether they are -close in -distance or -far in -distance.
Unfortunately, this is where we hit a brick wall: Valiant and Valiant showed that this problem requires samples, even for the simplest distribution possible, the uniform distribution [VV10]. All this work, only to achieve a barely sublinear tester?
The solution: relax
It turns out that one can circumvent this lower bound by considering a relaxation of the previous problem. For this purpose, we shall use the -distance (which can be seen as a pointwise nonuniform rescaling of ): . The main feature of this distance we require is that . As such, we can now consider the following (easier) testing problem:
Given sample access to and a description of a distribution , identify whether they are -close in -distance or -far in -distance.
Now, the upshot: surprisingly, this new problem can be solved using samples, significantly bypassing the previous lower bound. The actual test is a slight modification of the -test by Pearson, a statistician so far ahead of his time that he casually introduced techniques which lead to optimal algorithms for major problems over a century later (and this isn’t the first time, either).
The fruits of our labor
Given this powerful primitive, we have a framework, which is a slight modification of the naive approach described above. Instead of: proper learn, then test; we proper learn in -distance, then test:
1. Use samples to obtain an explicit s.t.
– if , small
– if , large
2. Perform the -close vs -far test.
The learning problem is now slightly harder than before, but still requires samples for a large number of classes in the parameter regime of interest. As a result, this leads to optimal algorithms for testing many classes, including monotonicity, unimodality, log-concavity, and monotone hazard rate.
The story doesn’t end here — this framework also applies for testing multivariate discrete distributions. While the previous literature on testing one-dimensional distributions was quite mature (i.e., for the previously studied classes, we sort of “knew” the right answer to be , up to log factors and the precise dependence on epsilon), fairly little was known about higher dimensional testing problems. This work manages to give optimal algorithms for testing monotonicity and independence of marginals over the -dimensional hypergrid. In particular, for monotonicity, it gives a quadratic improvement in the sample complexity, reducing it from to the optimal .
The hope is that these frameworks may find applications for many other problems, in distribution testing and beyond. The world is yours…
[ADK15] Optimal testing for properties of distributions. J. Acharya, C. Daskalakis, and G. Kamath. arXiv, 2015.
[BFRSW01] Testing that distributions are close. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White, FOCS’01.
[BKR04] Sublinear algorithms for testing monotone and unimodal distributions. T. Batu, R. Kumar, and R. Rubinfeld, STOC’04.
[Can15] A Survey on Distribution Testing: Your Data is Big. But is it Blue?. C. Canonne. ECCC, 2015.
[CDGR15] Testing Shape Restrictions of Discrete Distributions. C. Canonne, I. Diakonikolas, T. Gouleakis, and R, Rubinfeld. arXiv, 2015.
[ILR12] Approximating and Testing -Histogram Distributions in Sub-linear Time. P. Indyk, R. Levi, and R. Rubinfeld, PODS’12.
[Ron08] Property Testing: A Learning Theory Perspective. D. Ron. FTML, 2008.
[Rub12] Taming Big Probability Distributions. R. Rubinfeld. XRDS, 2012.
[VV10] A CLT and tight lower bounds for estimating entropy. G. Valiant and P. Valiant. ECCC, 2010.