#
All entries for May 2018

## May 31, 2018

### More randomness and a bit of non–randomness

## Normally distributed random numbers

The last blog post was about getting high quality, uniformly distributed random numbers. In this post, I'm going to address the question of what to do if you want to have non uniformly distributed random numbers, that is random numbers where some are more likely than others. One of the ways that we're going to be looking at this is by looking at **Probability Distribution Functions(PDF)**. These graphs are very like histograms, but they instead say "I have a source of random data U. For the entire source U, what is the probability that any given value is in the range x to x + dx, where dx is a finite but usually small deviation from x". In practice, what you do is you split up the range of U into a series of bins of width "dx" and then just count the number of particles in each bin and divide by the total number of particles.

The above graph shows the probability distribution for a uniform random number when you have 1,000 bins between 0 and 1. This number of bins explains why the graph is showing a rough line centred around 0.1% probability. Since a uniform random number generator should produce values in all bins equally, you would expect (100/nbins)% (so 0.1% for 1,000 bins) to lie in each bin. The graph shows that the random number generator that I'm using is working reasonably well.

This graph shows the same thing for a normally distributed random number (also called Gaussian, or in some fields Maxwellian distributions and sometimes bell curves). It is strongly peaked at zero and extends out to ±6. Normal distributions are very important in many areas of study (see here for some examples), and is found in things as diverse as the distribution of speeds of particles in a gas to the distribution of heights of members of a sports club. As a consequence there is a lot of interest in generating normally distributed random numbers. This is often called "sampling" the distribution. The graphs above were created by getting random numbers and counting how many fell into each bin on the graph, but often you want to go the other way. You want to draw the graph and then get random numbers such that their distribution will match the graph.

Fortunately there's a very easy way of converting from a uniform random number generator to a normally distributed random number generator via an algorithm called the Box-Muller Transform. I'm not going to go into the maths of it which are detailed on the linked Wikipedia page, but if you want normally distributed random numbers the Box-Muller Transform is your friend (usually the specific variant called the Polar Box-Muller Transform). Lots of languages already have built in implementations of normally distributed random number generators, generally using that transform so it's worth looking for those first, especially in compiled languages like Python where the built in function will probably be much faster than yours. However, in languages like C or Fortran which don't have a built in one you probably want to find or write an implementation of the Box-Muller Transform if you need normally distributed random numbers.

## Arbitrary distributions in one variable

Sometimes, however, you have a more complicated distribution function that you want to sample. Sometimes there are other transforms like Box-Muller, but quite often there aren't. At which point the most powerful tool you have is called Inverse Transform Sampling. This introduces a new idea the **Cumulative Distribution Function (CDF)**. The CDF is very much like the PDF in many senses, but instead of saying "how much of U is between x and x+dx" it says "how much of U is between min(U) and x", hence the cumulative in the name. It is the accumulation of the probability of the value being at x or lower.

The graph this time compares the PDF (left) to the CDF (right) for a normally distributed number. The CDF has a value of 100% at the right because everything has to be less than the largest value. You can create the CDF in a number of ways, but on a computer a very common way is to define a grid of values (x(i), say) and calculate the PDF (PDF(i)) for each of them. This does mean discretising your distribution but that is often sufficient and you can often interpolate the result. The CDF is then simply defined by

CDF(i) = SUM(PDF(0:i))

where i is the index into a zero based array and is evaluated from 0 to n, where n is the number of elements in the array.

Once you have the CDF sampling it is surprisingly easy. You simply pick a uniform random number between 0 and 100 and then find out which value of i is the cell in which the CDF exceeds that value. You then have sampled the value x(i). You keep repeating this until you have sufficient values. If you then check the distribution of the values it will match the PDF that you specified. Effectively this divides the PDF into bins of equal **area under the curve**, and selects one bin at a time with equal probability. On the CDF graph, equal area bins correspond to a uniform division of the y axis.

The actual implementation of this can be a bit messier because while the easiest way of finding which cell of the CDF exceeds your random value is simply to start at i=0 and count up, the *fastest*way is to use binary bisection on the index. This works because the CDF is guaranteed to increase from left to right, so if you pick any point you know if the point that you want is to the left of that point if the value is smaller and to the right of the point if the value is larger. So you can keep dividing the space into sections until you have found the value that you want. But that is an implementation issue rather than a conceptual one.

Note also that if you can integrate the PDF to obtain a CDF, which you can then invert, you can do all of this analytically, but even simple PDFs like a Normal distribution don't have an invertible CDF.

## Distributions in more than one variable

Distributions in more than one variable can be tricky. But if the function is separable, that is it can be written as two functions multiplied together, each of which is in a single variable (f(x, y) = g(x) * h(y)) each part can be sampled as above. For each complete sample you take one random sample of the function g, and one of h, and multiply them together.

Functions in more than one variable which don't break down like this are generally treated using a technique called accept-reject sampling, or some more sophisticated technique in tricky cases. More on this next time.

## May 16, 2018

### Randomness for computer programmers

Random numbers appear in almost everything that involves computers. The cryptographic algorithms that underpin almost all online commerce are based on random number generators, in academic use there is the entire field of Monte-Carlo simulations that use random numbers to generate and test configurations, and many computer games use random number generators to generate the rich behaviour that is observed without having a human have to program everything manually.

Despite their ubuquity though, random number generators (RNGs) are a little regarded element of computational techniques in most areas of academic use. Before I start, I want to stress something **if you have been working with random numbers and haven't worried about the things that I'm talking about you are probably fine**. You have to be careful with RNGs, but mostly any ones that you will encounter in the wild are good enough.

The first thing to do is to distinguish between true (or hardware) random number generators and psuedorandom number generators. True random number generators are hardware devices that convert random physical properties into random electrical signals and then convert them into random numbers. Even these are split into ones that use truly random sources (like the decay of radioactive nucleii) or chaotic sources (like atmospheric turbulence), but in general they produce sequences of random numbers that are non-repeating and unpredictable. You can read more about them here. True RNGs produce a signal that has maximum **entropy **for a signal of that length.

You might remember the concept of entropy from physics where it is a measure of disorder, so a system that is disordered (like a gas) has higher entropy than an ordered system (like a solid). **Information entropy** is very similar and represents a measure of **surprise**. If you have a signal that reads "1, 1, 1, 1, 1, 1, 1, 1" then being told that the next number is also "1" isn't much of a surprise, so that signal has low information entropy. If you have the signal "1, 1, 0, 1, 0, 1, 0, 0" then you'd be much harder put to guess what the next symbol is going to be so that signal has a much higher entropy. Information entropy is actually a fairly complex concept that can be measured in different ways (you can see more here) and explaining it in detail is well beyond the scope of a blog post, but you can easily see some of the complexity by an analogy. You could say that my second signal ("1, 1, 0, 1, 0, 1, 0, 0") was higher entropy than my first ("1, 1, 1, 1, 1, 1, 1, 1") because it has as many "1"s as "0"s, but what about the signal "1, 0, 1, 0, 1, 0, 1, 0"? On the one hand it has as many "1"s as "0"s, but on the other hand it can be represented by a very simple pattern and so you'd be less surprised if the next digit was "1" than if it was "0".

While true RNGs are extremely important in the real world, particularly in areas such as cryptography, most academic users don't use true random number generators, or use them only in combination with **pseudorandom number generators (PRNGs).**

PRNGs are algorithms that take a simple **seed **value and then operate on it to generate a sequence of numbers that has as many of the properties of a true RNG as possible. They are, by definition, not random. If you give them the same seed then they will always generate the same sequence, so they cannot contain any more entropy than is contained in the combination of the seed value and the algorithm. This means that they always perform worse than true RNGs. They do however work well enough for almost all scientific and statistical purposes **if you choose a good one**. Unless you are working on a system that requires Cryptographically Secure PseudoRandom Numbers (and if you are I'd hope that you aren't getting your information from a blog post!) then almost any **modern** PRNG is a good one. The only problem is that language standards don't update very fast, so some programming languages don't insist on a modern one in the standard library. Java uses an old algorithm called a Linear Congruential Generator, the C and Fortran standards don't require anything specific about the algorithm to be implemented and the popular GFortran compiler uses a very modern algorithm if you use the modern Fortran **RANDOM_NUMBER** interface and a very old one if you use the older **RAND** interface. The main thing to take away is that you should be very careful using language supplied random number generator. You cannot rely on the quality.

The ultimate limit on the use of PRNGs is that they inevitably have a period of repetition. That is a finite number of random numbers that it will generate until the same sequence of numbers will repeat. If you use *n* bits for the seed then the longest possible number of iterations until it repeats is 2^{n}. This means that if you use a single byte (8 bits) for your seed you will see a repeating sequence after 256 random numbers have been drawn. That is far too low for most practical purposes. On the other hand if you use a 64 bit value then you will have a bit less than 2 x 10^{19} numbers before you see the sequence repeat, which is large enough for almost any use that you can think of. Most real implementations of PRNGs use large enough seed values that this isn't a problem in the real world. It is important to remember that repetition of a particular sequence doesn't mean that the RNG has looped and is repeating. You can easily imagine that a PRNG will produce a lot of sequences that go "1,1,0,1,0" in 2 x 10^{19}numbers.

There are** a lot **of tests for randomness (the popular TestU01 test suite has 160 tests in it's "Big Crush" test), but the most important ones for scientific use (in my opinion at least) are

- Chi-Square test - Make sure that all symbols that the PRNG can produce are produced at the expected rate
- Serial-correlation test - Make sure that the autocorrelation (degree of relatedness between pairs of numbers) is zero. This ensures that you can't predict that you will always get the number 8 in three iterations if you get 3 now
- Spectral test - How densely the numbers generated by the PRNG fill the space of all possible numbers (that is a massive simplification of the spectral test, see here and links for more details)

These three tests are well passed by modern PRNGs so for most **non cryptographic** (I stress this again!) purposes will be fine. If you are doing cryptography follow the best practices of the cryptographic community. **Better yet do not write your own cryptographic code**!

The final question is : how do you choose a good seed to start from? That is quite easy for most scientific or technical problems. If you want a random sequence but want to be able to reproduce it reliably you should choose a seed youself by any mechanism that you want, for a good PRNG there isn't a bad seed. If you want it to be random every time you run your code you need some source of entropy to bootstrap the PRNG. The classic bootstrapping entropy is to use the system clock, since the exact time at which you run your problem is probably random. The "better" way is to use one of the system entropy sources such as /dev/random and /dev/urandom on POSIX systems or CryptGenRandom on Windows. These will make use of any hardware RNGs that the system has (where appropriate) and also generates entropy based on other sources of randomness available to the system.

Final note: **IT IS VERY IMPORTANT TO RUN YOUR PROBLEM WITH DIFFERENT SEEDS**! Unless your result is robust to different random initial conditions it isn't a result, it is an artifact of a particular random sequence that shouldn't priviliged compared to other random sequences.

PRNGs are an efficient way of generating uniformly distributed random numbers, but what if you don't want uniformly distributed random numbers? That will be covered in a later blog post.

## May 02, 2018

### Hierarchies of Git(s)

## Version Control and Git

Version control of some sort is essential in a software project. Suppose you make a change and your code no longer works - how can you go back? Suppose you find something you don't understand - who made that change? If you publish work using a code, can you recover the exact state at a later time? While carefully maintained records can go a long way, these exact problems led to the development of sophisticated version control systems which could do these things almost automtically. One of the most popular systems in use now is named 'git'.

For a basic rundown of version control, git, and why and how to use it, we have some notes here. *This post is a (very brief) introduction to a useful but relatively uncommon feature, git submodules. *

The name git, by the way, is no coincidence. It is a handy, pronounceable, 3-letter combination, that wasn't already in use for any Linux utilities, but also a convenient descriptor for the project creator, Linus Torvalds of Linux fame.

## Great Fleas have Smaller Fleas

On one occasion we had a request for support in which a user was getting a cryptic error messages about `fatal: bad revision 'HEAD'` when attempting any git commands in any directories. In particular, he was unable to build our code because it used a `git describe` to obtain version information. After a bit of digging, we were able to work out that somewhere in a directory above our source code, he must have run a `git init` command.

This left our git repository in a invalid state and most git commands unusable. The user recalled typing the command, but not where he did it, so first we had to find the bad directory level. Since commands were failing, we could simply work upwards one directory at a time running `git describe` until we instead saw `fatal: Not a git repository (or any of the parent directories)`. A better solution is to use the command `git rev-parse --show-toplevel` which displays the current top-level git repository name.

Running this, it appears he had managed to do this in his `home` directory, leaving any and all git repositories among his files in this state. Thankfully the final remedy is simple: remove the invalid git directory by simply deleting the `.git/` folder at the wrongly initialised level. Much easier than feared!

## Submodules

The proper way to have this sort of nested git project is using a relatively uncommon feature called submodules. These are separate git projects which can be imported and used, and are commonly used for deeply embedded library code. A submodule is a repository in its own right and can be checked out alone, and work on it should generally be done separately in this way.

For most people their only encounter with submodules will be the following two commands. When cloning a project that uses them, clone all submodules also using:

git clone --recursive [repo]

When making and removing small, temporary alterations within the submodule, some times when the submodule changes in the repository you cloned from, or any time you see a message like

-Subproject commit abc123xyz +Subproject commit def456uvw

in a git diff, the command:

git submodule update --recursive

will update all submodules to their current versions. The 'recursive' flag ensures all submodules, including any within other submodules, are updated.

## But What Actually Are Submodules?

Basically, a submodule is a separate repository, connected to the main project. This connection is via a single commit, and a submodule within a main project is in a detached-head state. As of git 1.8.2 submodules included in a larger project can specify which branch of the submodule project to link to.

For all the gory details of working with submodules, see resources such as

https://www.git-scm.com/docs/gitsubmodules

https://git.wiki.kernel.org/index.php/GitSubmoduleTutorial

and https://blog.github.com/2016-02-01-working-with-submodules/

For now, you can think of them as a way to include 'library' type code which is under active development into several other projects and also allow it to stand alone.

## Rescuing a Bad Chain

Sadly it is not particularly difficult to make a small alteration in a submodule (such as temporarily amending a compiler flag) and find yourself with a main repository insisting that there are uncommitted changes in the submodule, or worse, an unexpected detached head somewhere. Don't lose your head over this (sorry). Sometimes the submodule update command is all you need. Other times you will have to reset all the directories. First, commit or stash any changes you want to keep! Then the command

git submodule foreach --recursive git reset --hard

will remove all changes in **all** submodules, and a final

git submodule update --recursive

should put everything back to normal.

## Summary

Git submodules are a powerful, sometimes tricky feature, which basically allow the inclusion of shared code into several other projects while also letting it to stand alone, to be released as a library and/or developed as a separate project. If you encounter submodules it will probably be as simple as somebody's repo needing to be cloned with the --recursive flag. To contribute to the submodule, clone it and only it instead. If you get stuff in your `git status` about changes in a submodule within a larger project try the update command.