Randomness for computer programmers
Random numbers appear in almost everything that involves computers. The cryptographic algorithms that underpin almost all online commerce are based on random number generators, in academic use there is the entire field of Monte-Carlo simulations that use random numbers to generate and test configurations, and many computer games use random number generators to generate the rich behaviour that is observed without having a human have to program everything manually.
Despite their ubuquity though, random number generators (RNGs) are a little regarded element of computational techniques in most areas of academic use. Before I start, I want to stress something if you have been working with random numbers and haven't worried about the things that I'm talking about you are probably fine. You have to be careful with RNGs, but mostly any ones that you will encounter in the wild are good enough.
The first thing to do is to distinguish between true (or hardware) random number generators and psuedorandom number generators. True random number generators are hardware devices that convert random physical properties into random electrical signals and then convert them into random numbers. Even these are split into ones that use truly random sources (like the decay of radioactive nucleii) or chaotic sources (like atmospheric turbulence), but in general they produce sequences of random numbers that are non-repeating and unpredictable. You can read more about them here. True RNGs produce a signal that has maximum entropy for a signal of that length.
You might remember the concept of entropy from physics where it is a measure of disorder, so a system that is disordered (like a gas) has higher entropy than an ordered system (like a solid). Information entropy is very similar and represents a measure of surprise. If you have a signal that reads "1, 1, 1, 1, 1, 1, 1, 1" then being told that the next number is also "1" isn't much of a surprise, so that signal has low information entropy. If you have the signal "1, 1, 0, 1, 0, 1, 0, 0" then you'd be much harder put to guess what the next symbol is going to be so that signal has a much higher entropy. Information entropy is actually a fairly complex concept that can be measured in different ways (you can see more here) and explaining it in detail is well beyond the scope of a blog post, but you can easily see some of the complexity by an analogy. You could say that my second signal ("1, 1, 0, 1, 0, 1, 0, 0") was higher entropy than my first ("1, 1, 1, 1, 1, 1, 1, 1") because it has as many "1"s as "0"s, but what about the signal "1, 0, 1, 0, 1, 0, 1, 0"? On the one hand it has as many "1"s as "0"s, but on the other hand it can be represented by a very simple pattern and so you'd be less surprised if the next digit was "1" than if it was "0".
While true RNGs are extremely important in the real world, particularly in areas such as cryptography, most academic users don't use true random number generators, or use them only in combination with pseudorandom number generators (PRNGs).
PRNGs are algorithms that take a simple seed value and then operate on it to generate a sequence of numbers that has as many of the properties of a true RNG as possible. They are, by definition, not random. If you give them the same seed then they will always generate the same sequence, so they cannot contain any more entropy than is contained in the combination of the seed value and the algorithm. This means that they always perform worse than true RNGs. They do however work well enough for almost all scientific and statistical purposes if you choose a good one. Unless you are working on a system that requires Cryptographically Secure PseudoRandom Numbers (and if you are I'd hope that you aren't getting your information from a blog post!) then almost any modern PRNG is a good one. The only problem is that language standards don't update very fast, so some programming languages don't insist on a modern one in the standard library. Java uses an old algorithm called a Linear Congruential Generator, the C and Fortran standards don't require anything specific about the algorithm to be implemented and the popular GFortran compiler uses a very modern algorithm if you use the modern Fortran RANDOM_NUMBER interface and a very old one if you use the older RAND interface. The main thing to take away is that you should be very careful using language supplied random number generator. You cannot rely on the quality.
The ultimate limit on the use of PRNGs is that they inevitably have a period of repetition. That is a finite number of random numbers that it will generate until the same sequence of numbers will repeat. If you use n bits for the seed then the longest possible number of iterations until it repeats is 2n. This means that if you use a single byte (8 bits) for your seed you will see a repeating sequence after 256 random numbers have been drawn. That is far too low for most practical purposes. On the other hand if you use a 64 bit value then you will have a bit less than 2 x 1019 numbers before you see the sequence repeat, which is large enough for almost any use that you can think of. Most real implementations of PRNGs use large enough seed values that this isn't a problem in the real world. It is important to remember that repetition of a particular sequence doesn't mean that the RNG has looped and is repeating. You can easily imagine that a PRNG will produce a lot of sequences that go "1,1,0,1,0" in 2 x 1019numbers.
There are a lot of tests for randomness (the popular TestU01 test suite has 160 tests in it's "Big Crush" test), but the most important ones for scientific use (in my opinion at least) are
- Chi-Square test - Make sure that all symbols that the PRNG can produce are produced at the expected rate
- Serial-correlation test - Make sure that the autocorrelation (degree of relatedness between pairs of numbers) is zero. This ensures that you can't predict that you will always get the number 8 in three iterations if you get 3 now
- Spectral test - How densely the numbers generated by the PRNG fill the space of all possible numbers (that is a massive simplification of the spectral test, see here and links for more details)
These three tests are well passed by modern PRNGs so for most non cryptographic (I stress this again!) purposes will be fine. If you are doing cryptography follow the best practices of the cryptographic community. Better yet do not write your own cryptographic code!
The final question is : how do you choose a good seed to start from? That is quite easy for most scientific or technical problems. If you want a random sequence but want to be able to reproduce it reliably you should choose a seed youself by any mechanism that you want, for a good PRNG there isn't a bad seed. If you want it to be random every time you run your code you need some source of entropy to bootstrap the PRNG. The classic bootstrapping entropy is to use the system clock, since the exact time at which you run your problem is probably random. The "better" way is to use one of the system entropy sources such as /dev/random and /dev/urandom on POSIX systems or CryptGenRandom on Windows. These will make use of any hardware RNGs that the system has (where appropriate) and also generates entropy based on other sources of randomness available to the system.
Final note: IT IS VERY IMPORTANT TO RUN YOUR PROBLEM WITH DIFFERENT SEEDS! Unless your result is robust to different random initial conditions it isn't a result, it is an artifact of a particular random sequence that shouldn't priviliged compared to other random sequences.
PRNGs are an efficient way of generating uniformly distributed random numbers, but what if you don't want uniformly distributed random numbers? That will be covered in a later blog post.