# An Encore: More Learning and Testing – STOC 2014 Recaps (Part 8)

Thought you were rid of us? Not quite: in a last hurrah, Clément and I come back with a final pair of distribution estimation recaps — this time on results from the actual conference!

Density estimation is the question on everyone’s mind. It’s as simple as it gets – we receive samples from a distribution and want to figure out what the distribution looks like. The problem rears its head in almost every setting you can imagine — fields as diverse as medicine, advertising, and compiler design, to name a few. Given its ubiquity, it’s embarrassing to admit that we didn’t have a provably good algorithm for this problem until just now.

Let’s get more precise. We’ll deal with the total variation distance metric (AKA statistical distance). Given distributions with PDFs $f$ and $g$, their total variation distance is $d_{\mathrm{TV}}(f,g) = \frac12\|f - g\|_1$. Less formally but more intuitively, it upper bounds the difference in probabilities for any given event. With this metric in place, we can define what it means to learn a distribution: given sample access to a distribution $X$, we would like to output a distribution $\hat X$ such that $d_{\mathrm{TV}}(f_X,f_{\hat X}) \leq \varepsilon$.

This paper presents an algorithm for learning $t$-piecewise degree-$d$ polynomials. Wow, that’s a mouthful — what does it mean? A $t$-piecewise degree-$d$ polynomial is a function where the domain can be partitioned into $t$ intervals, such that the function is a degree-$d$ polynomial on each of these intervals. The main result says that a distribution with a PDF described by a $t$-piecewise degree-$d$ polynomial can be learned to accuracy $\varepsilon$ using $O((d+1)kt/\varepsilon^2)$ samples and polynomial time. Moreover, the sample complexity is optimal up to logarithmic factors.

A 4-piecewise degree-3 polynomial.
Lifted from Ilias’ slides.

Now this is great and all, but what good are piecewise polynomials? How many realistic distributions are described by something like “$-x^2 + 1$ for $x \in [0,1)$ but $\frac12x - 1$ for $x \in [1,2)$ and $2x^4 - x^2 +$…”? The answer turns out to be a ton of distributions — as long as you squint at them hard enough.

The wonderful thing about this result is that it’s semi-agnostic. Many algorithms in the literature are God-fearing subroutines, and will sacrifice their first-born child to make sure they receive samples from the class of distributions they’re promised — otherwise, you can’t make any guarantees about the quality of their output. But our friend here is a bit more skeptical. He deals with a funny class of distributions, and knows true piecewise polynomial distributions are few and far between — if you get one on the streets, who knows if it’s pure? Our friend is resourceful: no matter the quality, he makes it work.

Let’s elaborate, in slightly less blasphemous terms. Suppose you’re given sample access to a distribution $\mathcal{D}$ which is at total variation distance $\leq \tau$ from some $t$-piecewise degree-$d$ polynomial (you don’t need to know which one). Then the algorithm will output a $(2t-1)$-piecewise degree-$d$ polynomial which is at distance $\leq 4\tau + O(\varepsilon)$ from $\mathcal{D}$. In English: even if the algorithm isn’t given a piecewise polynomial, it’ll still produce something that’s (almost) as good as you could hope for.

With this insight under our cap, let’s ask again — where do we see piecewise polynomials? They’re everywhere: this algorithm can handle distributions which are log-concave, bounded monotone, Gaussian, $t$-modal, monotone hazard rate, and Poisson Binomial. And the kicker is that it can handle mixtures of these distributions too. Usually, algorithms fail catastrophically when considering mixtures, but this algorithm keeps chugging and handles them all — and near optimally, most of the time.

The analysis is tricky, but I’ll try to give a taste of some of the techniques. One of the key tools is the Vapnik-Chervonenkis (VC) inequality. Without getting into the details, the punchline is that if we output a piecewise polynomial which is “close” to the empirical distribution (under a weaker metric than total variation distance), it’ll give us our desired learning result. In this setting, “close” means (roughly) that the CDFs don’t stray too far from each (though in a sense that is stronger than the Kolmogorov distance metric).

Let’s start with an easy case – what if the distribution is a $1$-piecewise polynomial? By the VC inequality, we just have to match the empirical CDF. We can do this by setting up a linear program which outputs a linear combination of the Chebyshev polynomials, constrained to resemble the empirical distribution.

It turns out that this subroutine is the hardest part of the algorithm. In order to deal with multiple pieces, we first discretize the support into small intervals which are roughly equal in probability mass. Next, in order to discover a good partition of these intervals, we run a dynamic program. This program uses the subroutine from the previous paragraph to compute the best polynomial approximation over each contiguous set of the intervals. Then, it stitches the solutions together in the minimum cost way, with the constraint that it uses fewer than $2t - 1$ pieces.

In short, this result essentially closes the problem of density estimation for an enormous class of distributions — they turn existential approximations (by piecewise polynomials) into approximation algorithms. But there’s still a lot more work to do — while this result gives us improper learning, we yearn for proper learning algorithms. For example, this algorithm lets us approximate a mixture of Gaussians using a piecewise polynomial, but can we output a mixture of Gaussians as our hypothesis instead? Looking at the sample complexity, the answer is yes, but we don’t know of any computationally efficient way to solve this problem yet. Regardless, there’s many exciting directions to go — I’m looking forward to where the authors will take this line of work!

-G

Clément Canonne on $L_p$-Testing, by Piotr Berman, Sofya Raskhodnikova, and Grigory Yaroslavtsev [1])

Almost every — if not all — work in property testing of functions are concerned with the Hamming distance between functions, that is the fraction of inputs on which they disagree. Very natural when we deal for instance with Boolean functions $f\colon\{0,1\}^d\to\{0,1\}$, this distance becomes highly arguable when the codomain is, say, the real line: sure, $f\colon x\mapsto x^2-\frac{\sin^2 x}{2^{32}}$ and $g\colon x\mapsto x^2$ technically disagree on almost every single input, but should they be considered two completely different functions?

This question, Grigory answered by the negative; and presented (joint work with Piotr Berman and Sofya Raskhodnikova [2]) a new framework for testing real-valued functions $f\colon X^d\to[0,1]$, less sensitive to this sort of annoying “technicalities” (i.e., noise). Instead of the usual Hamming/$L_0$ distance between function, they suggest the more robust $L_p$ ($p\geq 1$) distance

$\mathrm{d}_p(f,g) = \frac{ \left(\int_{X^d} \lvert f(x)-g(x) \rvert^p dx\right)^{\frac{1}{p}} }{ \left(\int_{X^d} \mathbf{1} dx \right)^{\frac{1}{p}} } = \mathbb{E}_{x\sim\mathcal{U}(X^d)}\left[\lvert f(x)-g(x) \rvert^p\right]^{\frac{1}{p}}\in [0,1]$

(think of $X^d$ as being the hypercube $\{0,1\}^d$ or the hypergrid $[n]^d$, and $p$ being 1 or 2. In this case, the denominator is just a normalizing factor $\lvert X\rvert$ or $\sqrt{\lvert X\rvert}$)

Now, erm… why?

• because it is much more robust to noise in the data;
• because it is much more robust to outliers;
• because it plays well (as a preprocessing step for model selection) with existing variants of PAC-learning under $L_p$ norms;
• because $L_1$ and $L_2$ are pervasive in (machine) learning;
• because they can.

Their results and methods turn out to be very elegant: to outline only a few, they

• give the first example of testing monotonicity testing (de facto, for the $L_1$ distance) when adaptivity provably helps; that is, a testing algorithm that selects its future queries as a function of the answers it previously got can outperform any tester that commits in advance to all its queries. This settles a longstanding question for testing monotonicity with respect to Hamming distance;
• improve several general results for property testing, also applicable to Hamming testing (e.g. Levin’s investment strategy [3]);
• provide general relations between sample complexity of testing (and tolerant testing) for various norms ($L_0, L_1,L_2,L_p$);
• have quite nice and beautiful algorithms (e.g., testing via partial learning) for testing monotonicity and Lipschitz property;
• give close-to-tight bounds for the problems they consider;
• have slides in which the phrase “Big Data” and a mention to stock markets appear (!);
• have an incredibly neat reduction between $L_1$ and Hamming testing of monotonicity.

I will hereafter only focus on the last of these bullets, one which really tickled my fancy (gosh, my fancy is so ticklish) — for the other ones, I urge you to read the paper. It is a cool paper. Really.

Here is the last bullet, in a slightly more formal fashion — recall that a function $f$ defined on a partially ordered set is monotone if for all comparable inputs $x,y$ such that $x\preceq y$, one has $f(x)\leq f(y)$; and that a one-sided tester is an algorithm which will never reject a “good” function: it can only err on “bad” functions (that is, it may sometimes accept, with small probability, a function far from monotone, but will never reject a monotone function).

Theorem.
Suppose one has a one-sided, non-adaptive tester $T$ for monotonicity of Boolean functions $f\colon X\to \{0,1\}$ with respect to Hamming distance, with query complexity $q$. Then the very same $T$ is also a tester for monotonicity of real-valued functions $f\colon X\to [0,1]$ with respect to $L_1$ distance.

Almost too good to be true: we can recycle testers! How? The idea is to express our real-valued $f$ as some “mixture” of Boolean functions, and use $T$ as if we were accessing these. More precisely, let $f\colon [n]^d \to [0,1]$ be a function which one intends to test for monotonicity. For all thresholds $t\in[0,1]$, the authors define the Boolean function $f_t$ by

$f_t(x) = \begin{cases} 1 & \text{ if } f(x) \geq t\\ 0 & \text{ if } f(x) < t \end{cases}$

All these $f_t$ are Boolean; and one can verify that for all $x$, $f(x)=\int_0^1 f_t(x) dt$. Here comes the twist: one can also show that the distance of $f$ to monotone satisfies

$\mathrm{d}_1(f,\mathcal{M}) = \int_0^1 \mathrm{d}_0(f_t,\mathcal{M}) dt$

i.e. the $L_1$ distance of $f$ to monotone is the integral of the Hamming distances of the $f_t$‘s to monotone. And by a very simple averaging argument, if $f$ is far from monotone, then at least one of the $f_t$‘s must be…
How does that help? Well, take your favorite Boolean, Hamming one-sided (non-adaptive) tester for monotonicity, $T$: being one-sided, it can only reject a function if it has some “evidence” it is not monotone — indeed, if it sees some violation: i.e., a pair $x,y$ with $x\prec y$ but $f(x) > f(y)$.

Feed this tester, instead of the Boolean function it expected, our real-valued $f$; as one of the $f_{t^\ast}$‘s is far from monotone, our tester would reject $f_{t^\ast}$; so it would find a violation of monotonicity by $f_{t^\ast}$ if it were given access to $f_{t^\ast}$. But being non-adaptive, the tester does exactly the same queries on $f$ as it would have done on this $f_{t^\ast}$! And it is not difficult to see that a violation for $f_{t^\ast}$ is still a violation for $f$: so the tester finds a proof that $f$ is not monotone, and rejects. $\square$

Wow.

— Clément.

Final, small remark: one may notice a similarity between $L_1$ testing of functions $f\colon[n]\to[0,1]$ and the “usual” testing (with relation to total variation distance, $\propto L_1$) of distributions $D\colon [n]\to[0,1]$. There is actually a quite important difference, as in the latter the distance is not normalized by $n$ (because distributions have to sum to $1$ anyway). In this sense, there is no direct relation between the two, and the work presented here is indeed novel in every respect.

Edit: thanks to Sofya Raskhodnikova for spotting an imprecision in the original review.

[1] Slides available here: http://grigory.github.io/files/talks/BRY-STOC14.pptm
[2] http://dl.acm.org/citation.cfm?id=2591887
[3] See e.g. Appendix A.2 in “On Multiple Input Problems in Property Testing”, Oded Goldreich. 2013. http://eccc-preview.hpi-web.de/report/2013/067/