# Purifying spoiled randomness with spoiled randomness

Recently a preprint was posted to ECCC, with the pulse-quickening title “Explicit Two-Source Extractors and Resilient Functions“. If the result is correct, then it really is — shall I say it — a breakthrough in theoretical computer science. One of my favorite things about TCS is how fresh it is. It seems like every other week, an exciting result in all sorts of subareas are announced on arxiv or ECCC. I thought I’d use this as an opportunity to (briefly) explain what this breakthrough is about, for those who aren’t theoretical computer scientists.

Computer programs need sequences of random numbers all the time. Encryption, games, scientific applications need it. But these applications usually assume that they have access to completely unpredictable random number sequences that have no internal patterns or regularities.

But where in the world would we find such perfect, pristine sources of random numbers? The methods that people devise to generate sources of randomness is a fascinating subject in its own right, but long story short, it’s actually really, really hard to get our hands on such good random numbers. Pure randomness, as it turns out, is a valuable resource that is as rare as — perhaps even rarer than — gold or diamond.

On the other hand, it seems much easier to get access to weak randomness, which are number sequences that are pretty unpredictable, but there might be subtle (or not-so-subtle) correlations or relations between the different numbers. For example, one way you might try to get random numbers is to take a digital thermometer that’s accurate to within, say, 2 decimal places — and always take the last digit of the temperature reading whenever you need a random number.
Since the temperature of the air around the thermometer is always fluctuating, this will give some randomness, but it’s probably not perfectly random. If the temperature at this second is 74.32 F, in the next second, it’s probably more likely to be 74.34 F instead of 74.39 F. So there are patterns and correlations within these random numbers.

You might try to devise more elaborate ways to mix up the numbers, but it’s pretty difficult to actually get your scheme to produce perfect unpredictability. Like a pernicious disease, corrupting correlations and predictable patterns can run deep within a sequence of random digits, and removing these unwanted blemishes is a tough task confounded by the fact that you can never be sure if what you’re seeing is a pattern or merely an artifact of the randomness.

But randomness extractors do just do that. They are careful surgeons that expertly cut out the malignant patterns, and leave fresh, pure unpredictability. More technically: randomness extractors are algorithms that are designed to read in weak random number sequences, and output much higher quality randomness — in fact, something that is very, very close to perfectly unpredictable.

Extractors are beautiful algorithms, powered by elegant mathematics and wonderful ideas that have taken decades to understand. And we’re still just beginning to understand them, as this breakthrough result shows.

For a long time now, we’ve known that extractors exist. Via nonconstructive mathematical arguments, we know these algorithms are out there, whose job it is to transform weak randomness into perfect randomness. The only problem was, for quite some time we didn’t know what these algorithms looked like! No one could actually write down a description of the algorithm!

It took a long while, but we finally learned how to construct a certain type of extractor algorithm — called a seeded extractor. Basically, what these algorithms do is to take in, say, 1000 weakly random numbers (for example, temperature readings), and then 10 perfectly random numbers (called the seed), and combine these two sources of numbers to produce 500 perfectly random numbers. So they give you a lot more perfectly unpredictable numbers than you started with.

That’s really nice, but where would you get these 10 perfectly random numbers from? There’s a chicken and the egg problem here (albeit with a smaller chicken and egg each time). Now what theoretical computer scientists have been deep in the hunt for are two-source extractors — the subject of this breakthrough work.

As you might guess, these are extractors that take two unrelated sources of weak randomness, and combine them together to produce ONE perfect random source. So you might imagine taking temperature readings, and then have another apparatus that reads in the direction of the wind, and combine these two weak sources of randomness, to get pristine randomness.

Again, we’ve known that two source extractors exist “in the wild” — but actual sightings of such magical beasts have been rare. Until recently, the best two source extractor we knew how to construct required that the two weak sources weren’t actually that weak — these weak sources needed to be incredibly random, possessing very very few correlations and patterns. Bourgain’s extractor, as it is known, takes these almost-perfect random sources and combines them to be perfect.

It might not sound like much, but Bourgain’s extractor was a tremendous achievement, involving some of the newest ideas from arithmetic combinatorics, a branch of mathematics. And for 10 years, this was the best known.

Until today. Chattopadhyay and Zuckerman have told us how to create a two-source extractor that works with sources that are quite weak. There can be a littany of cross-correlations and patterns and relations between the numbers, but as long as the the random numbers have some unpredictability, their two-source extractor will crunch the numbers, clean up the patterns, clean up the correlations, and produce something that’s pristine, perfect, and unpredictable.

It’s not the end of the story, though. Their algorithm only produces ONE random bit. The next step — and I’m sure people at this very moment are working hard on this — is to improve their algorithm to produce MANY perfectly random bits.

Update (August 15, 2015): It didn’t take long for Xin Li, one of the world’s leading extractor designers, to extend the Chattopadhyay and Zuckerman’s construction to output multiple bits. See his followup paper here. You’re watching science unfold, folks!

## 3 thoughts on “Purifying spoiled randomness with spoiled randomness”

1. Could you point us to some source code implementing this?

2. npc says:

This doesn’t seem as exciting to me as you make it sound, so I’d like to understand what this is a little better, if you’ll indulge me.

It seems to me that a weak source of randomness means you have less than 1 bit of entropy for each bit of data in a “random” string. Consequently, the trick is to “squeeze” out the predictable portion. My understanding is that in practice the most common way to do this is to use a cryptographically secure hash function. Why is this new method better? Is it because it doesn’t rely on the strength of the cryptography? Even if the hash could be reversed, it would merely reveal one (of several, probably) weakly random input strings. How does that help one predict the next set of random bits? Yes, there’s a correlation bit-by-bit with the next input string to be hashed, but the whole idea of hashing it is to eliminate that correlation, right? What am I missing? Should I be preparing to rewrite my RNG code? Feel free to point me to some place that already explains this. I couldn’t find such a thing on my own.

• Henry Yuen says:

Hi npc,

Sorry for the late response!

Yes, you can think of weak randomness is that on average, there is less than 1 bit of entropy per bit of data. One example of this might be, even though the first three bits of the string are completely uniform, the fourth bit is a complicated function of the first three. However you have no way of knowing what that complicated function is, nor where this “dependent bit” may be.

Re: using cryptographic hash function. I think something like this is used in practice, but the problem is that this isn’t a theoretically sound method for extracting randomness. The problem is that if you’re given a fixed function (say, for example, SHA1 or AES), then there always *exists* a weak random source X such that SHA1(X) or AES(X) is not uniformly random. Whether we as human beings can find X or not is another question, but theoretically it is not a hard task.

This work instead focuses on getting sound theoretical guarantees. You need a function that takes in two independent sources of randomness and combines them together, such that no matter what sorts of dependencies exist within each source, the function will squeeze it out. Furthermore, there are no cryptographic assumptions needed!

Do you have to rewrite your RNG code? It depends on what sorts of guarantees you’re looking for. If you’re looking for sound mathematical guarantees that your randomness extractor is always going to give you near uniform randomness, then the RNG in the C compiler is not going to cut it, nor is using some fixed function such as SHA1 or AES. However, “in practice”, probably using a hash function such as SHA will work for most applications. It’s only when you’re writing code that will protect, say, a power plant that you might want to look into something a little more secure.