Clément Canonne on A complexity theoretic perspective on unsupervised learning
And now, for something completely different… Rocco Servedio presented a (joint with Ilias Diakonikolas and Anindya De ) work, not dealing with distribution testing. Showing there is a world outside distribution testing, and that this world is filled with wonder, probabilities and Boolean functions. A world of inverse problems.
I will only try to give the main idea, maybe the main ideas, of the problem they consider, why it is hard, and how they overcame the difficulties. To start, fix an unknown Boolean function in some class — think of the class of linear threshold functions, for instance (LTFs, a.k.a. weighted majorities, or halfspaces). is unknown, but it brings with it the set of its satisfying inputs, ; and you can get uniform samples from this . You press a button, you get .
Can you do this a few time, and then output a new device with its own button, that will generate on demand i.i.d. samples from some distribution close (in total variation distance, as usual) to the uniform distribution on ? In other terms, can you approximate the uniform distribution on , by observing i.i.d. samples from (but without knowing )?
Of course, one way to do this would be to use a learning algorithm for (but using only what we get: only positive examples): learn an approximate version of , and then study it “offline” to find out exactly what is, before spitting uniform samples from it. Clearly, if is close enough to , this will work!
But does it always? Recall that here, we want to be close to uniform on , that is we only consider the error with relation to the positive examples of the function; while the general results for (PAC-)learning of functions under the uniform distribution only care about the overall error. This means may need to be very close to for this approach to succeed… As an example, take to be , i.e. 0 everywhere but on the single point . The function is a very good approximation: it has error only ; and yet, this would lead to a distribution supported on the single point , which is thus 1-far from the actual distribution (supported on )!
Intuitively, the problem here is that PAC-learning gives a sort of additive guarantee:
while we need some some of multiplicative one:
which is much harder to obtain when has very few satisfying assignments.
This being said, the discussion above shows that when we are in the dense case, things look simpler: i.e., when has “many” (an inverse polynomial fraction of the hypercube) satisfying inputs. In this case, the idea is to combine two steps — and show they can be performed:
(1) SQ Learning: get a hypothesis . Upshot: one has to show that it’s possible to use a SQ learner, even from positive examples only.
(2) Hypothesis testing: test if is good enough (if not, try again)
before then using the now explicitly known “good” to approximate the set of satisfying inputs of .
But then, once this “simple” (erm) case is dealt with, how to handle the general case, where could have an exponentially small fraction of satisfying inputs? Rocco explained one of their main ingredient — what they called a “densifier”, i.e. an algorithm that will start from positive examples from and output a “densified function” which agrees with on most of ‘s satisfying inputs, and does not have many more (so that is “dense” in ). Provided such a densifier exists (and can be efficiently simulated) for a class , they show how to combine it with (1) and (2) above to iteratively “densify” the sparse set and reduce to the “easy”, dense case.
Now that all the bricks are in place, the next step in order to play Lego is, of course, to show that such densifiers do exist… for interesting classes, and in particular classes who also have SQ learning algorithms (otherwise, brick (1) is still missing!). In the paper, the authors describe how to build such a densifier for (for instance) the class of LTFs. Combining this with the ingredients sketched above, this gives them a polynomial time algorithm for learning the uniform distribution over satisfying assignments to an unknown LTF. As another instantiation of their approach, the authors also give a densifier for DNF formulas and a resulting quasi-polynomial time distribution learning algorithm.
Two remarks here:
- maybe surprisingly, the hardness of this problem stems from the Boolean setting — indeed, the same question, dealing with functions on the Gaussian space, becomes way easier;
- the related “direct” problem (on input the full specification of , generate close-to-uniform samples from ), is neither harder nor simpler: the authors provide examples where (under some suitable well-believed assumption) the former is easy and the latter hard, and vice-versa. In particular, if is a finite cyclic group, getting even one positive example allows the algorithm to get easily the whole “for free”.
 Inverse Problems in Approximate Uniform Generation, De, Diakonikolas, Servedio, 2012.