# Efficient Distribution Estimation 3: Rocco’s Modern Life – STOC 2014 Recaps (Part 5)

The road twists, as Clément takes us down a slightly different path in the penultimate post of our Efficient Distribution Estimation workshop summary series. See also: Part 1 and Part 2.

Clément Canonne on A complexity theoretic perspective on unsupervised learning

And now, for something completely different… Rocco Servedio presented a (joint with Ilias Diakonikolas and Anindya De [1]) work, not dealing with distribution testing. Showing there is a world outside distribution testing, and that this world is filled with wonder, probabilities and Boolean functions. A world of inverse problems.

I will only try to give the main idea, maybe the main ideas, of the problem they consider, why it is hard, and how they overcame the difficulties. To start, fix an unknown Boolean function $f\colon\{0,1\}^n\to\{0,1\}$ in some class $\mathcal{C}$ — think of the class of linear threshold functions, for instance (LTFs, a.k.a. weighted majorities, or halfspaces). $f$ is unknown, but it brings with it the set of its satisfying inputs, $I\stackrel{\rm{}def}{=} f^{-1}(\{1\})$; and you can get uniform samples from this $I$. You press a button, you get $x\in I$.

Can you do this a few time, and then output a new device with its own button, that will generate on demand i.i.d. samples from some distribution $D$ close (in total variation distance, as usual) to the uniform distribution on $I$? In other terms, can you approximate the uniform distribution on $I$, by observing i.i.d. samples from $I$ (but without knowing $f$)?

Of course, one way to do this would be to use a learning algorithm for $\mathcal{C}$ (but using only what we get: only positive examples): learn an approximate version $\hat{f}$ of $f$, and then study it “offline” to find out exactly what $\hat{f}^{-1}(\{1\})$ is, before spitting uniform samples from it. Clearly, if $\hat{f}$ is close enough to ${f}$, this will work!

But does it always? Recall that here, we want to be close to uniform on $I$, that is we only consider the error with relation to the positive examples of the function; while the general results for (PAC-)learning of functions under the uniform distribution only care about the overall error. This means $\hat{f}$ may need to be very close to $f$ for this approach to succeed… As an example, take $f$ to be $\mathrm{AND}_n$, i.e. 0 everywhere but on the single point $x=1^n$. The function $\hat{f}= \lnot\mathrm{OR}_n$ is a very good approximation: it has error only $1/2^{n-1}$; and yet, this would lead to a distribution supported on the single point $0^n$, which is thus 1-far from the actual distribution (supported on $1^n$)!

Intuitively, the problem here is that PAC-learning gives a sort of additive guarantee:

$\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon$

while we need some some of multiplicative one:

$\mathbb{P}\{\hat{f}(x) \neq f(x)\} \leq \varepsilon\cdot \mathbb{P}\{f(x) = 1\}$

which is much harder to obtain when $f$ has very few satisfying assignments.

This being said, the discussion above shows that when we are in the dense case, things look simpler: i.e., when $f$ has “many” (an inverse polynomial fraction of the hypercube) satisfying inputs. In this case, the idea is to combine two steps — and show they can be performed:
(1) SQ Learning: get a hypothesis $\hat{f}$. Upshot: one has to show that it’s possible to use a SQ learner, even from positive examples only.
(2) Hypothesis testing: test if $\hat{f}$ is good enough (if not, try again)
before then using the now explicitly known “good” $\hat{f}$ to approximate the set of satisfying inputs of $f$.

But then, once this “simple” (erm) case is dealt with, how to handle the general case, where $f$ could have an exponentially small fraction of satisfying inputs? Rocco explained one of their main ingredient — what they called a “densifier”, i.e. an algorithm that will start from positive examples from $f$ and output a “densified function” $g$ which agrees with $f$ on most of $f$‘s satisfying inputs, and does not have many more (so that $f^{-1}(\{1\})$ is “dense” in $g^{-1}(\{1\})$). Provided such a densifier exists (and can be efficiently simulated) for a class $\mathcal{C}$, they show how to combine it with (1) and (2) above to iteratively “densify” the sparse set $f^{-1}(\{1\})$ and reduce to the “easy”, dense case.

Now that all the bricks are in place, the next step in order to play Lego is, of course, to show that such densifiers do exist… for interesting classes, and in particular classes who also have SQ learning algorithms (otherwise, brick (1) is still missing!). In the paper, the authors describe how to build such a densifier for (for instance) the class of LTFs. Combining this with the ingredients sketched above, this gives them a polynomial time algorithm for learning the uniform distribution over satisfying assignments to an unknown LTF. As another instantiation of their approach, the authors also give a densifier for DNF formulas and a resulting quasi-polynomial time distribution learning algorithm.

Two remarks here:

• maybe surprisingly, the hardness of this problem stems from the Boolean setting — indeed, the same question, dealing with functions $f\colon\mathbb{R}^n\to\{0,1\}$ on the Gaussian space, becomes way easier;
• the related “direct” problem (on input the full specification of $f$, generate close-to-uniform samples from $f^{-1}(\{1\})$), is neither harder nor simpler: the authors provide examples where (under some suitable well-believed assumption) the former is easy and the latter hard, and vice-versa. In particular, if $f^{-1}(\{1\})$ is a finite cyclic group, getting even one positive example allows the algorithm to get easily the whole $f^{-1}(\{1\})$ “for free”.

[1] Inverse Problems in Approximate Uniform Generation, De, Diakonikolas, Servedio, 2012.

## 2 thoughts on “Efficient Distribution Estimation 3: Rocco’s Modern Life – STOC 2014 Recaps (Part 5)”

1. J says:

Nice post! Tiny correction: I think you mean \hat{f}=\neg OR, not \hat{f} = \neg AND.