# Distribution Testing: Do It With Class!

After a long break since the last time we touched upon distribution testing, we are back to discuss two papers that appeared this month on the arXiv — and hopefully some of their implications.

• [ADK15] Optimal testing for properties of distributions.
• [CDRG15] Testing Shape Restrictions of Discrete Distributions.

As a warmup, recall that in distribution testing (“property testing of probability distributions,” but shorter), one is given access to independent samples from an unknown distribution $p$ over some fixed domain (say, the set of integers $[n] = \{1,\dots,n\}$, and wishes to learn one bit of information about $p$:

Does $p$ have the property I’m interested in, or is it far from every distribution that has this property?

In particular, the question is how to do things more efficiently (in terms of the number of samples required) than by fully learning the distribution $p$ (which would “cost” $\Theta(n)$ samples): surely, one bit of information is not that expensive, is it? Initiated by the seminal work of Batu, Fortnow, Rubinfeld, Smith, and White [BFRSW01], the field of distribution testing has since then aimed at answering that question by a resounding “No, it isn’t! Mostly. Sort of?” (And has had a fair amount of success in doing so.)

For more on property testing and distribution testing in general, we now take the liberty to refer the interested reader to [Ron08] and [Rub12,Can15] respectively.

# Testing Membership: why, how, err… what?

Let $\mathcal{P}$ be a property of distributions, which is a fancy way of saying a subset or class of probability distributions. Think for instance of $\mathcal{P}$ as “the single uniform distribution on $[n]$ elements,” or “the class of all Binomial distributions.”

We mentioned above that the goal was, given a distance parameter $\varepsilon$ as well as sample access to an arbitrary, unknown distribution $p$ over $[n]$, to determine the (asymptotic) minimum number of samples needed from $p$ to distinguish with high probability between the following two cases. (a) $p\in\mathcal{P}$ (it has the property); versus (b) $\ell_1(p,\mathcal{P})\geq \varepsilon$ ($p$ is at distance at least $\varepsilon$ from every distribution that has the property).

The metric here is the $\ell_1$ distance, or equivalently the total variation distance. Now, many results, spanning the last decade, have culminated in pinpointing the exact sample complexity for various properties, such as $\{\mathcal{U}_n\}$ (being the uniform distribution), or $\{p^\ast\}$ (being equal to a fixed distribution, specified in advance), or even $\mathcal{PBD}$ (being a Poisson Binomial Distribution, a generalization of the Binomial distributions). There has also been work in testing if a joint distribution is a product distribution (“independence”), or if a distribution has a monotone probability mass function.

Note that in the above, two trends appear: the first is consider a property that is a singleton, that is to test if the distribution $p$ is equal to something fixed in advance; while the second looks at a bigger set of distributions, typically a “class” of distributions that exhibits some structure; and amounts to testing if $p$ belongs to this class. (To be consistent with this change of terminology, we switch from now on from $\mathcal{P}$ — property — to $\mathcal{C}$ — class.)

“We’re Binomials. Are you one of us?”

This latter type of result has many applications, e.g. for hypothesis testing, or model selection (“Is my statistical model completely off if I assume it’s Gaussian?”, or “I know how to deal with histograms. Histograms are nice. Can we say it’s a histogram?”). But for many classes of interest, the optimal sample complexity (as a function of $n$ and $\varepsilon$) remained either completely open, or some gap subsisted between the known upper and lower bounds. (For instance, the cases of monotone distributions and histograms was addressed in [BKR04] and [ILR12], but the results left room for improvement.)

Up until now, roughly. Two very recent papers tackle this question, and simultaneously solve it for many such distribution classes at once. The first, [CDGR15], gives a generic “meta-algorithm” that does ‘quite well’ for a whole range of classes: that is, a one-fits-all tester which has nearly-tight sample complexity (with regard to $n$). Their main theorem has the following flavor:

Let $\mathcal{C}$ be a class of distributions that all exhibit some nice structure. If it can be tested, there is a good chance this algorithm can — maybe not optimally, but close enough.

The second, [ADK15] describes and instantiates a general method of “testing-by-learning” for distributions, which yields optimal sample complexity for several of these classes (with respect to both $n$ and $\varepsilon$, for a wide range of parameters). Interestingly, any such testing-by-learning approach was, until today, thought to be a dead-end for distribution testing — strong known impossibility results seemed to doom it. By coming up with an elegant twist, [ADK15] shows that, well, impossibility can only doom so much. Roughly and not quite accurately speaking, here is what they manage to show:

Let $\mathcal{C}$ be a class of distributions which all exhibit some nice structure. If one can efficiently $\chi^2$-learn it, then one can optimally test it.

# A Unified Approach to Things

### One Tester to Rule Them All

At the root of the generic tester of [CDGR15] is the observation that the testing algorithm of [BKR04], specifically tailored, built, and analyzed for testing whether a distribution is monotone, actually need not be. Namely, one can abstract and generalize the core idea of Batu, Kumar, and Rubinfeld; and by modifying it carefully obtain an algorithm that can apply to any class $\mathcal{C}$ that satisfies a structural assumption:

Definition 1. (Structural Criterion) Any distribution in $\mathcal{C}$ is close (in a specific strong sense) to some piecewise-constant distribution on $L=L(\mathcal{C})$ ‘pieces’, where $L$ is small (say $\mathrm{poly}\log n$).

By “close,” here we mean that on each of the $L$ “pieces” either the distribution $p$ puts very little probability weight, or stays within an $(1\pm \varepsilon)$ factor of the piecewise-constant distribution $\bar{q} =\bar{q}(p)$. Moreover quite crucially one does not need to be able to compute this decomposition in pieces explicitly: it is sufficient to prove existence, for each $p\in\mathcal{C}$, of a good decomposition.

Theorem 2. ([CDGR15, Main Theorem]) There exists an algorithm which, given sampling access to an unknown distribution $p$ over $[n]$ and parameter $\varepsilon\in(0,1]$, can distinguish with probability 2/3 between (a) $p\in\mathcal{C}$ versus (b) $\ell_1(p, \mathcal{C}) > \varepsilon$, for any class $\mathcal{C}$ that satisfies the above structural criterion. Moreover, for many such properties this algorithm is computationally efficient, and its sample complexity is optimal (up to logarithmic factors and the exact dependence on $\varepsilon$).

As it turns out, this assumption is satisfied by many interesting classes of distributions: monotone, $t$-modal, log-concave, monotone hazard rate, Poisson Binomials, histograms, to cite only few… moreover, it’s easy to see that any mixture of distributions from these “structural” classes will also satisfy the above criterion (for a corresponding value of $L$).

To apply the tester to some class $\mathcal{C}$ you may fancy testing, it then only remains to find the best $L=L(n,\varepsilon)$ possible for the structural criterion (i.e., to prove an existential result for the class, independent on any algorithmic aspect), and plug this into Theorem 2. This will immediately yield an upper bound — maybe not optimal, but maybe not far from it.

For an intuition of how the algorithm works, we paraphrase the authors’ description:

The algorithm proceeds in 3 stages:

1. the decomposition step attempts to recursively construct a partition of the domain in a small number of intervals [that is, $L(\mathcal{C})$], with a very strong guarantee [the distribution is almost uniform on each piece]. If this succeeds, then the unknown distribution $p$ will be close (in $\ell_1$) to its “flattening” $q$ on the partition; while if it fails (too many intervals have to be created), this serves as evidence that $p\notin\mathcal{C}$ and we can reject.
2. The second stage, the approximation step, learns this $q$ — which can be done with few samples since by construction we do not have many intervals.
3. The last stage is purely computational, the projection step: we verify that $q$ is indeed close to $\mathcal{C}$.

If all three stages succeed, then by the triangle inequality $p$ is close to $\mathcal{C};$ and by the structural assumption on the class, if $p\in\mathcal{C}$ then it will admit succinct enough partitions, and all three stages will go through.

### Upper bounds are fine, but lower bounds?

As a counterpart to generically proving testing is not too hard, it would be nice to have an equally generic way of establishing it is actually not very easy either… Canonne, Diakonikolas, Gouleakis, and Rubinfeld also tackle this question, and provide a general “testing-by-narrowing” reduction that enables to do just that. We won’t dwelve into the details here (the reader is encouraged to read — or skim — the paper to learn more, including slightly more general statements and extensions to tolerant testing), but roughly speaking the theorem goes as follows:

Theorem 3. ([CDGR15, Lower Bound Theorem]) Let  $\mathcal{C}$ be a class of distributions that can be efficiently (agnostically) learned. Then,  testing membership to $\mathcal{C}$ is at least as hard as testing identity to any fixed distribution in $\mathcal{C}$.

While this may seem intuitive, there is actually no obvious reason why it should hold — and it actually fails if one removes the “efficiently learnable” assumption. (As an extreme case, consider the class $\mathcal{C}$ of all distributions (which is hard to learn). Testing if a distribution is in there can trivially be done with no sample at all, yet there are elements in $\mathcal{C}$ for which identity testing requires $\Omega(\sqrt{n})$ samples.)

As examples, [CDGR15] then instantiates Theorem 3 to show lower bounds for all the classes considered — choosing a suitably right “hard instance” in each case.

### So… we’re done?

Well, not quite. As said above, this paper provides “nearly-tight” results for many classes, basically a Swiss army knife for class testing. Depending on the situation, it may be more than enough; but sometimes more fine-grained tools are needed…  what about an optimal testing algorithm for these classes? What about getting the sample complexity right? And what about testing in higher dimensions?

What about all of these? There is a next section, you know.

# Testing by Learning: “Yes, we can!”

### A naive first approach

As mentioned before, learning an arbitrary distribution on $[n]$ is expensive, requiring $\Theta(n)$ samples. However, if we assume the distribution belongs to some class $\mathcal{C}$, we may be able to do better. In particular, for many classes $\mathcal{C}$, we can efficiently solve the proper learning problem, in which we output a near distribution in the same class. Formally stated:

Given sample access to $p \in \mathcal{C}$, output $q \in \mathcal{C}$ such that $\ell_1(p,q) \leq \varepsilon$.

This leads to the following naive algorithm for testing. First, regardless of whether or not $p$ is actually in the class, attempt to properly learn it up to accuracy $\varepsilon/2$. If $p$ is in the class, then we will learn a distribution $q$ which is actually $\varepsilon/2$-close. If it is not, since $q \in \mathcal{C}$, by assumption, $p$ and $q$ are $\varepsilon$-far. Thus, we have reduced to the following testing problem:

Given sample access to $p$ and a description of a distribution $q$, identify whether they are $\varepsilon/2$-close in $\ell_1$-distance or $\varepsilon$-far in $\ell_1$-distance.

Unfortunately, this is where we hit a brick wall: Valiant and Valiant showed that this problem requires $\Omega(n/\log n)$ samples, even for the simplest distribution $q$ possible, the uniform distribution [VV10]. All this work, only to achieve a barely sublinear tester?

### The solution: relax

It turns out that one can circumvent this lower bound by considering a relaxation of the previous problem. For this purpose, we shall use the $\chi^2$-distance (which can be seen as a pointwise nonuniform rescaling of $\ell_2^2$): $\chi^2(p,q) = \sum_{i=1}^n \frac{(p(i)-q(i))^2}{q(i)}$. The main feature of this distance we require is that $\chi^2(p,q) \leq \ell_1^2(p,q)$. As such, we can now consider the following (easier) testing problem:

Given sample access to $p$ and a description of a distribution $q$, identify whether they are $\varepsilon^2/2$-close in $\chi^2$-distance or $\varepsilon$-far in $\ell_1$-distance.

Now, the upshot: surprisingly, this new problem can be solved using $O(\sqrt{n}/\varepsilon^2)$ samples, significantly bypassing the previous $\Omega(n/\log n)$ lower bound. The actual test is a slight modification of the $\chi^2$-test by Pearson, a statistician so far ahead of his time that he casually introduced techniques which lead to optimal algorithms for major problems over a century later (and this isn’t the first time, either).

Another brick off the wall!

### The fruits of our labor

Given this powerful primitive, we have a framework, which is a slight modification of the naive approach described above. Instead of: proper learn, then test; we proper learn in $\chi^2$-distance, then test:

1. Use samples to obtain an explicit $q$ s.t.
– if $p\in \mathcal{C}$, $\chi^2(p,q)$ small
– if $p\notin \mathcal{C}$, $\ell_1(p,q)$ large
2. Perform the $\chi^2$-close vs $\ell_1$-far test.

The learning problem is now slightly harder than before, but still requires $o(\sqrt{n}/\varepsilon^2)$ samples for a large number of classes in the parameter regime of interest. As a result, this leads to optimal algorithms for testing many classes, including monotonicity, unimodality, log-concavity, and monotone hazard rate.

The story doesn’t end here — this framework also applies for testing multivariate discrete distributions. While the previous literature on testing one-dimensional distributions was quite mature (i.e., for the previously studied classes, we sort of “knew” the right answer to be $\Theta(\sqrt{n})$, up to log factors and the precise dependence on epsilon), fairly little was known about higher dimensional testing problems. This work manages to give optimal algorithms for testing monotonicity and independence of marginals over the $d$-dimensional hypergrid. In particular, for monotonicity, it gives a quadratic improvement in the sample complexity, reducing it from $O(n^{d - \frac12})$ to the optimal $\Theta(n^{\frac{d}{2}})$.

# Conclusions

The hope is that these frameworks may find applications for many other problems, in distribution testing and beyond. The world is yours…

# References

[ADK15] Optimal testing for properties of distributions. J. Acharya, C. Daskalakis, and G. Kamath. arXiv, 2015.
[BFRSW01] Testing that distributions are close. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White,  FOCS’01.
[BKR04] Sublinear algorithms for testing monotone and unimodal distributions. T. Batu, R. Kumar, and R. Rubinfeld, STOC’04.
[Can15] A Survey on Distribution Testing: Your Data is Big. But is it Blue?. C. Canonne. ECCC, 2015.
[CDGR15] Testing Shape Restrictions of Discrete Distributions. C. Canonne, I. Diakonikolas, T. Gouleakis, and R, Rubinfeld. arXiv, 2015.
[ILR12] Approximating and Testing $k$-Histogram Distributions in Sub-linear Time. P. Indyk, R. Levi, and R. Rubinfeld, PODS’12.
[Ron08] Property Testing: A Learning Theory Perspective. D. Ron. FTML, 2008.
[Rub12] Taming Big Probability Distributions. R. Rubinfeld. XRDS, 2012.
[VV10] A CLT and tight lower bounds for estimating entropy. G. Valiant and P. Valiant. ECCC, 2010.