June 27, 2018

More randomness

Dealing with randomness and random samples of distributions is a huge area, but we're going to finish these posts by considering one last big thing that we haven't touched on at all : random distributions in multiple variables. These look at the question of what happens if you have a set of related variables where the probability of one having a given value of one variable affects the probability of having a given value in another variable. You do have to be a little bit careful though because not all probability functions in more than one variable have this property. A classical example from physics of one that doesn't is the distribution of velocities of particles in a gas.

Particles in a gas are free to move in any direction that they want to, so they have three velocity components since the universe is has 3 spatial dimensions (at least to a good enough approximation for this problem). In fact, the distribution of velocities is given by the Maxwell-Boltzmann distribution, which is basically the Gaussian or Normal distribution that we saw earlier. Mathematically, this looks like

Maxwell-Boltzmann distribution

Where Vx, Vy and Vz are the x, y and z components of the particle's velocity, m is the mass of the particle, k is a constant called Boltzmann's constant and T is the temperature in Kelvin (Which is the same as centigrade but offset so that 0C is 273.15K. 0K is "absolute zero" which doesn't exist in nature, so it's fine that this equation breaks at T=0). If you've not seen it before the alpha symbol rather than = means that I'm hiding a constant multiplier in the equation. Using the properties of exponentials we can rewrite this as

Maxwell-Boltzmann distribution separated

Imagine now that you have chosen Vx. That first exponential then just becomes a constant and can be absorbed into the constant that I'm hiding in that alpha symbol so the distribution of Vy and Vz aren't changed by selecting a Vx. That means that I can simply choose Vx, Vy and Vz independently and since they are all Normally distributed I can use the Box-Muller transform that I mentioned in the previous randomness post so I can sample it very fast. Always look at the mathematical description of your distribution to see if you can break it down like this into simpler forms (in more technical language this is termed "seperability" of the equations). It's always faster to sample on a smaller number of variables if at all possible.

But there are some distributions that really can't be separated like this, for example consider

Made up distribution function

as far as I know this isn't a distribution that anyone is interested in, but it certainly could exist. You can't separate this into functions of Vx, Vy and Vz so you have to sample it in all three variables directly. In the previous post we mentioned inverse transform sampling. This involves forming the cumulative distribution function from the distribution function and sampling along that curve. You can in theory do this for multdimensional functions as well. You have to define a space-filling curve that covers the area of interest of the function and then integrate the PDF along this curve and then finally do inverse sampling along this curve. Doing inverse transform sampling analytically is already tricky and working along an arbitrary curve really doesn't help matters, so in general you have to do it numerically. This means that you have to define your function on a grid of some kind. The problem is that this grid has to have as many dimensions as you have variables, so in this case there's Vx, Vy and Vz so 3D. The more points you have in each dimension of your grid the better your numerical approximation is to the real distribution (you can do things like interpolating between grid points to make it better than the simplest implementation but this is still true overall). So if you want to have 1000 grid points in each direction for this 3D example you'd need between 4 and 8 GB of memory (depending on how you store your numbers, which also affects accuracy). That's possible on a pretty typical computer nowadays, but still a lot of memory. If you had an even harder 4 variable sampling function then this rises to 4-8 TB of memory which is the preserve of high end workstations costing tens of thousands of pounds, and by 6 variable sample functions it couldn't even be run on world leading supercomputers.

Fortunately there is an alternative called rejection sampling. It's much simpler to implement too, although the downside is that it can be much slower to run. You can comparatively easily change any distribution function so that it has a maximum value of 1. Simply find the maximum value in the range of interest of all variables (either analytically or numerically) and divide the distribution function by this value. What you have now is often called the "acceptance function". That is, if I randomly select a value for each variable (selecting this value as a uniform random number in some range of interest), what is the probability that I should accept this value and include it in my sample? This is why the you want a function that has a maximum value of 1 - you want there to be an "optimal" place where acceptance is guaranteed. Then you simply roll one more random value and if it is less than the random probability that you get from the acceptance function then you should take that value, add it to the list of values and continue. If it isn't then you just randomly select new values of all the variablesand keep going until the value is accepted. When you have accepted enough values for your purpose you just stop sampling.

This works beautifully and doesn't require that you allocate any arrays for your distribution function, so long as you can calculate it for any possible value of your variables. Since there's no array there's no problem with discretising the results and you get a nice smooth sample of your distributions function. The problem is in all of the times that you have to throw away sample points because they are rejected. This means that in general rejection sampling is quite slow. While there's no need to, pretend that you want to sample a Normal distribution in a single variable using rejection sampling. If you want to sample it out to one standard deviation from the mean then the average acceptance probability is 86% (ish), so it'll take about 16% longer to compute than it would to be able to simply specify the sample value directly (this isn't quite true because Box-Muller is quite complex and takes longer than a simple rejection sample, but I'm going to ignore this here). Unfortunately there are very few examples of problems where one standard deviation would be enough. In fact, a normal distribution to 1 standard deviation looks quite odd


Normal to 1 standard deviation

so you generally have to go further out. At 2 standard deviations you're now accepting about 60% of trials, so you're taking about 66% longer than simple sampling. At 3 standard deviations you're at 42% average acceptance and you're talking about taking 2.4 times longer than a simple sample. At 4 standard deviations you're now down at 31% average acceptance and you've having to do 3 trials for an average sample until you accept it. Unsurprisingly this takes about 3 times longer than simply being able to specify the value, but finally your Normal curve looks about right by eye. As you can see, rejection sampling really should be your last resort since Normal curves are about as benign as distribution functions get - most functions that you'd actually want to run acceptance sampling on have lower acceptance probabilities by the time that you are sampling everything that you want to.

There are more advanced algorithms that are better than rejection sampling (try looking at the Metropolis-Hastings algorithm if you want a place to start), so you're not completely out of luck if you've reached this point and rejection sampling is too slow, but things do start getting more complex from here on in.


June 13, 2018

Fix Me

Fancy Latex Customisation

Writing up notes and papers, I often find that I want to leave myself reminders, hints and other snippets that I'd rather didn't make it into the final draft. Or I want to collate notes from throughout my text into a file I can put in my appendix, such as a list of future work. I often think of Latex as just a typesetting language, but Tex itself is actually a fully Turing complete language and Latex isn't too far off. The hardest bit of Turing completeness is usually the ability to do conditional branches which Latex can do with a bit of work. Thankfully there's a common package that takes some of the load off us.

NOTE: many journals wont accept manuscripts containing custom commands, so don't forget to remove your commands from papers you submit in future.

The Latex snippets are available on our Github page and are described below in more detail.

Defining Latex Commands

Firstly, we use the Latex \newcommand to do some custom formatting we can easily vary, in this case to make our text bold. \newcommand is fairly simple. The syntax is

\newcommand{\<name>}[<n_args>]{<command>}

where the parts in <> should be replaced with appropriate text. Thist creates a command called "name", taking "n_args" arguments and applies "command". The arguments are referred to using a '#' within your command, as in the following example where we feed the text inside the \fbrtext braces into \textbf.

\newcommand{\fbrtext}[1]{\textbf{#1}}

If there are no arguments, you can omit the [<n_args>] part entirely.

We can define as many commands this way as we like, but they must not be already defined. To redefine something, use \renewcommand (but carefully, making sure not to clobber something important like \section).

Collating Snippets with Output Streams

The first snippet is to collate some things into a file so we can review things we'd like to fix or add to our document before we release it. Snippet 1 is a simple Latex document which does this. We put the commands in a .tex file named ReleaseFixes, and include this into our main document, so that we can reuse it easily.

The core part of this is a package called "newfile" which allows you to write to files as the Latex compilation runs. This is how things like index and glossary packages work at heart, creating a temporary file that can then be included. This package lets us create "streams" for output, that we can write to, which may be familiar from Java, C++ or other languages. We name our stream "releasefixes", and then open a file to which it should output.

Now we create the main command, which takes one argument, a text chunk. This chunk will appear in our document in bold red text, and will also appear in our output file. In the output file, it'll show the line number (in the input .tex file) where the text appears. We could also include the page number in the output pdf if we wanted or other such useful commands.

Including Snippets into our Document

Perhaps we want to include the snippets we've generated into an appendix, so that we can send the draft to somebody else and they have a handy list of all the changes we want to make, or things we need filling in. Or maybe we want to use a similar command to list the changes we've made to a draft for easy review.

In Snippet2 we adjust our command so that rather than creating a simple text file, we put a '\item' command at the start of every item. Then we add a section to our document, and include the file we generated, inside an \itemize environment. This should give a neat bullet-pointed list of things to fix.

This doesn't quite work yet: the stream seems to be empty at the point where we read it, even though the file is complete when we look afterwards. When you have an unexpectedly empty output file it is usually because the stream didn't get "flushed". Flushing an output stream means moving its contents from a buffer in memory to the actual output file. This could be done every time we write to the stream, but if we do lots of tiny writes it is much slower this way than waiting until we have "enough" and writing in a chunk.

Normally, output flushing happens when the compile ends and everything gets closed, but we want access to our file before this. So we insert an explicit \closeoutputstream in just before our new section to write the file immediately, and then we can open it for input. We include its contents, and they are then typeset just like a static included file.

Using ifthen to Typeset Variants

Suppose that rather than showing us the changes, we want a way to include or exclude them from the output file entirely. This is really easy to do with a quick edit, because we just redefine our \fbr command to omit the text instead, and compile again. Easy for one command, but what if we have several things we want to change for our "draft" mode? It's tedious and error prone to edit them all.

A perfectly good solution for this precise case, by the way, is to create two versions of our "ReleaseFixes.tex" file, one for draft mode, and one for release mode, and edit our file to swap between them

\include{ReleaseFixes_draft.tex} %Swap with next line for release

%\include{ReleaseFixes_release.tex}

However, this is a good demonstration of conditionals, so let's try that. In Snippet3 we use the package "ifthen" which provides an "ifthenelse" command, that looks like

\ifthenelse{<condition>}{<command if true>}{<command else>}

The package allows us to define boolean flags. We define one called usered which chooses whether our text should be output in bold or red-bold.

However we can only define flags after we've include the package, and we'd prefer the draft mode selection it in our main document. So we define a normal command, \draft, and check if this is defined instead. We adjust the \fbr command to do nothing in release mode (\draft not defined) or to give us bold text (red if the flag says so) and a file entry in draft mode. Then by un/commenting line 6 of Document.tex we control the final form.

Summary

Latex typesetting gets a lot easier once you have a few simple commands at hand. It gets a lot more powerful once you add the ability to do conditionals. "Clever" commands like this should be used with caution, since anybody who wants to compile the document needs the relevant packages, but they are very handy when you need them, and can save a lot of time and effort. And do remember that you'll probably have to remove anything like this if you submit Latex to journals.


May 31, 2018

More randomness and a bit of non–randomness

Normally distributed random numbers

The last blog post was about getting high quality, uniformly distributed random numbers. In this post, I'm going to address the question of what to do if you want to have non uniformly distributed random numbers, that is random numbers where some are more likely than others. One of the ways that we're going to be looking at this is by looking at Probability Distribution Functions(PDF). These graphs are very like histograms, but they instead say "I have a source of random data U. For the entire source U, what is the probability that any given value is in the range x to x + dx, where dx is a finite but usually small deviation from x". In practice, what you do is you split up the range of U into a series of bins of width "dx" and then just count the number of particles in each bin and divide by the total number of particles.


Probability distribution function for uniformly distributed random number

The above graph shows the probability distribution for a uniform random number when you have 1,000 bins between 0 and 1. This number of bins explains why the graph is showing a rough line centred around 0.1% probability. Since a uniform random number generator should produce values in all bins equally, you would expect (100/nbins)% (so 0.1% for 1,000 bins) to lie in each bin. The graph shows that the random number generator that I'm using is working reasonably well.

Probability distribution function for normally distributed random number

This graph shows the same thing for a normally distributed random number (also called Gaussian, or in some fields Maxwellian distributions and sometimes bell curves). It is strongly peaked at zero and extends out to ±6. Normal distributions are very important in many areas of study (see here for some examples), and is found in things as diverse as the distribution of speeds of particles in a gas to the distribution of heights of members of a sports club. As a consequence there is a lot of interest in generating normally distributed random numbers. This is often called "sampling" the distribution. The graphs above were created by getting random numbers and counting how many fell into each bin on the graph, but often you want to go the other way. You want to draw the graph and then get random numbers such that their distribution will match the graph.

Fortunately there's a very easy way of converting from a uniform random number generator to a normally distributed random number generator via an algorithm called the Box-Muller Transform. I'm not going to go into the maths of it which are detailed on the linked Wikipedia page, but if you want normally distributed random numbers the Box-Muller Transform is your friend (usually the specific variant called the Polar Box-Muller Transform). Lots of languages already have built in implementations of normally distributed random number generators, generally using that transform so it's worth looking for those first, especially in compiled languages like Python where the built in function will probably be much faster than yours. However, in languages like C or Fortran which don't have a built in one you probably want to find or write an implementation of the Box-Muller Transform if you need normally distributed random numbers.

Arbitrary distributions in one variable

Sometimes, however, you have a more complicated distribution function that you want to sample. Sometimes there are other transforms like Box-Muller, but quite often there aren't. At which point the most powerful tool you have is called Inverse Transform Sampling. This introduces a new idea the Cumulative Distribution Function (CDF). The CDF is very much like the PDF in many senses, but instead of saying "how much of U is between x and x+dx" it says "how much of U is between min(U) and x", hence the cumulative in the name. It is the accumulation of the probability of the value being at x or lower.


PDF and CDF of Gaussian distribution

The graph this time compares the PDF (left) to the CDF (right) for a normally distributed number. The CDF has a value of 100% at the right because everything has to be less than the largest value. You can create the CDF in a number of ways, but on a computer a very common way is to define a grid of values (x(i), say) and calculate the PDF (PDF(i)) for each of them. This does mean discretising your distribution but that is often sufficient and you can often interpolate the result. The CDF is then simply defined by


CDF(i) = SUM(PDF(0:i))


where i is the index into a zero based array and is evaluated from 0 to n, where n is the number of elements in the array.

Once you have the CDF sampling it is surprisingly easy. You simply pick a uniform random number between 0 and 100 and then find out which value of i is the cell in which the CDF exceeds that value. You then have sampled the value x(i). You keep repeating this until you have sufficient values. If you then check the distribution of the values it will match the PDF that you specified. Effectively this divides the PDF into bins of equal area under the curve, and selects one bin at a time with equal probability. On the CDF graph, equal area bins correspond to a uniform division of the y axis.

The actual implementation of this can be a bit messier because while the easiest way of finding which cell of the CDF exceeds your random value is simply to start at i=0 and count up, the fastestway is to use binary bisection on the index. This works because the CDF is guaranteed to increase from left to right, so if you pick any point you know if the point that you want is to the left of that point if the value is smaller and to the right of the point if the value is larger. So you can keep dividing the space into sections until you have found the value that you want. But that is an implementation issue rather than a conceptual one.

Note also that if you can integrate the PDF to obtain a CDF, which you can then invert, you can do all of this analytically, but even simple PDFs like a Normal distribution don't have an invertible CDF.

Distributions in more than one variable

Distributions in more than one variable can be tricky. But if the function is separable, that is it can be written as two functions multiplied together, each of which is in a single variable (f(x, y) = g(x) * h(y)) each part can be sampled as above. For each complete sample you take one random sample of the function g, and one of h, and multiply them together.

Functions in more than one variable which don't break down like this are generally treated using a technique called accept-reject sampling, or some more sophisticated technique in tricky cases. More on this next time.


May 16, 2018

Randomness for computer programmers

Random numbers appear in almost everything that involves computers. The cryptographic algorithms that underpin almost all online commerce are based on random number generators, in academic use there is the entire field of Monte-Carlo simulations that use random numbers to generate and test configurations, and many computer games use random number generators to generate the rich behaviour that is observed without having a human have to program everything manually.

Despite their ubuquity though, random number generators (RNGs) are a little regarded element of computational techniques in most areas of academic use. Before I start, I want to stress something if you have been working with random numbers and haven't worried about the things that I'm talking about you are probably fine. You have to be careful with RNGs, but mostly any ones that you will encounter in the wild are good enough.

The first thing to do is to distinguish between true (or hardware) random number generators and psuedorandom number generators. True random number generators are hardware devices that convert random physical properties into random electrical signals and then convert them into random numbers. Even these are split into ones that use truly random sources (like the decay of radioactive nucleii) or chaotic sources (like atmospheric turbulence), but in general they produce sequences of random numbers that are non-repeating and unpredictable. You can read more about them here. True RNGs produce a signal that has maximum entropy for a signal of that length.

You might remember the concept of entropy from physics where it is a measure of disorder, so a system that is disordered (like a gas) has higher entropy than an ordered system (like a solid). Information entropy is very similar and represents a measure of surprise. If you have a signal that reads "1, 1, 1, 1, 1, 1, 1, 1" then being told that the next number is also "1" isn't much of a surprise, so that signal has low information entropy. If you have the signal "1, 1, 0, 1, 0, 1, 0, 0" then you'd be much harder put to guess what the next symbol is going to be so that signal has a much higher entropy. Information entropy is actually a fairly complex concept that can be measured in different ways (you can see more here) and explaining it in detail is well beyond the scope of a blog post, but you can easily see some of the complexity by an analogy. You could say that my second signal ("1, 1, 0, 1, 0, 1, 0, 0") was higher entropy than my first ("1, 1, 1, 1, 1, 1, 1, 1") because it has as many "1"s as "0"s, but what about the signal "1, 0, 1, 0, 1, 0, 1, 0"? On the one hand it has as many "1"s as "0"s, but on the other hand it can be represented by a very simple pattern and so you'd be less surprised if the next digit was "1" than if it was "0".

While true RNGs are extremely important in the real world, particularly in areas such as cryptography, most academic users don't use true random number generators, or use them only in combination with pseudorandom number generators (PRNGs).

PRNGs are algorithms that take a simple seed value and then operate on it to generate a sequence of numbers that has as many of the properties of a true RNG as possible. They are, by definition, not random. If you give them the same seed then they will always generate the same sequence, so they cannot contain any more entropy than is contained in the combination of the seed value and the algorithm. This means that they always perform worse than true RNGs. They do however work well enough for almost all scientific and statistical purposes if you choose a good one. Unless you are working on a system that requires Cryptographically Secure PseudoRandom Numbers (and if you are I'd hope that you aren't getting your information from a blog post!) then almost any modern PRNG is a good one. The only problem is that language standards don't update very fast, so some programming languages don't insist on a modern one in the standard library. Java uses an old algorithm called a Linear Congruential Generator, the C and Fortran standards don't require anything specific about the algorithm to be implemented and the popular GFortran compiler uses a very modern algorithm if you use the modern Fortran RANDOM_NUMBER interface and a very old one if you use the older RAND interface. The main thing to take away is that you should be very careful using language supplied random number generator. You cannot rely on the quality.

The ultimate limit on the use of PRNGs is that they inevitably have a period of repetition. That is a finite number of random numbers that it will generate until the same sequence of numbers will repeat. If you use n bits for the seed then the longest possible number of iterations until it repeats is 2n. This means that if you use a single byte (8 bits) for your seed you will see a repeating sequence after 256 random numbers have been drawn. That is far too low for most practical purposes. On the other hand if you use a 64 bit value then you will have a bit less than 2 x 1019 numbers before you see the sequence repeat, which is large enough for almost any use that you can think of. Most real implementations of PRNGs use large enough seed values that this isn't a problem in the real world. It is important to remember that repetition of a particular sequence doesn't mean that the RNG has looped and is repeating. You can easily imagine that a PRNG will produce a lot of sequences that go "1,1,0,1,0" in 2 x 1019numbers.

There are a lot of tests for randomness (the popular TestU01 test suite has 160 tests in it's "Big Crush" test), but the most important ones for scientific use (in my opinion at least) are

  1. Chi-Square test - Make sure that all symbols that the PRNG can produce are produced at the expected rate
  2. Serial-correlation test - Make sure that the autocorrelation (degree of relatedness between pairs of numbers) is zero. This ensures that you can't predict that you will always get the number 8 in three iterations if you get 3 now
  3. Spectral test - How densely the numbers generated by the PRNG fill the space of all possible numbers (that is a massive simplification of the spectral test, see here and links for more details)

These three tests are well passed by modern PRNGs so for most non cryptographic (I stress this again!) purposes will be fine. If you are doing cryptography follow the best practices of the cryptographic community. Better yet do not write your own cryptographic code!

The final question is : how do you choose a good seed to start from? That is quite easy for most scientific or technical problems. If you want a random sequence but want to be able to reproduce it reliably you should choose a seed youself by any mechanism that you want, for a good PRNG there isn't a bad seed. If you want it to be random every time you run your code you need some source of entropy to bootstrap the PRNG. The classic bootstrapping entropy is to use the system clock, since the exact time at which you run your problem is probably random. The "better" way is to use one of the system entropy sources such as /dev/random and /dev/urandom on POSIX systems or CryptGenRandom on Windows. These will make use of any hardware RNGs that the system has (where appropriate) and also generates entropy based on other sources of randomness available to the system.

Final note: IT IS VERY IMPORTANT TO RUN YOUR PROBLEM WITH DIFFERENT SEEDS! Unless your result is robust to different random initial conditions it isn't a result, it is an artifact of a particular random sequence that shouldn't priviliged compared to other random sequences.

PRNGs are an efficient way of generating uniformly distributed random numbers, but what if you don't want uniformly distributed random numbers? That will be covered in a later blog post.


May 02, 2018

Hierarchies of Git(s)

Version Control and Git

Version control of some sort is essential in a software project. Suppose you make a change and your code no longer works - how can you go back? Suppose you find something you don't understand - who made that change? If you publish work using a code, can you recover the exact state at a later time? While carefully maintained records can go a long way, these exact problems led to the development of sophisticated version control systems which could do these things almost automtically. One of the most popular systems in use now is named 'git'.

For a basic rundown of version control, git, and why and how to use it, we have some notes here. This post is a (very brief) introduction to a useful but relatively uncommon feature, git submodules.

The name git, by the way, is no coincidence. It is a handy, pronounceable, 3-letter combination, that wasn't already in use for any Linux utilities, but also a convenient descriptor for the project creator, Linus Torvalds of Linux fame.

Great Fleas have Smaller Fleas

On one occasion we had a request for support in which a user was getting a cryptic error messages about `fatal: bad revision 'HEAD'` when attempting any git commands in any directories. In particular, he was unable to build our code because it used a `git describe` to obtain version information. After a bit of digging, we were able to work out that somewhere in a directory above our source code, he must have run a `git init` command.

This left our git repository in a invalid state and most git commands unusable. The user recalled typing the command, but not where he did it, so first we had to find the bad directory level. Since commands were failing, we could simply work upwards one directory at a time running `git describe` until we instead saw `fatal: Not a git repository (or any of the parent directories)`. A better solution is to use the command `git rev-parse --show-toplevel` which displays the current top-level git repository name.

Running this, it appears he had managed to do this in his `home` directory, leaving any and all git repositories among his files in this state. Thankfully the final remedy is simple: remove the invalid git directory by simply deleting the `.git/` folder at the wrongly initialised level. Much easier than feared!

Submodules

The proper way to have this sort of nested git project is using a relatively uncommon feature called submodules. These are separate git projects which can be imported and used, and are commonly used for deeply embedded library code. A submodule is a repository in its own right and can be checked out alone, and work on it should generally be done separately in this way.

For most people their only encounter with submodules will be the following two commands. When cloning a project that uses them, clone all submodules also using:

git clone --recursive [repo]

When making and removing small, temporary alterations within the submodule, some times when the submodule changes in the repository you cloned from, or any time you see a message like

-Subproject commit abc123xyz 
+Subproject commit def456uvw

in a git diff, the command:

git submodule update --recursive

will update all submodules to their current versions. The 'recursive' flag ensures all submodules, including any within other submodules, are updated.

But What Actually Are Submodules?

Basically, a submodule is a separate repository, connected to the main project. This connection is via a single commit, and a submodule within a main project is in a detached-head state. As of git 1.8.2 submodules included in a larger project can specify which branch of the submodule project to link to.

For all the gory details of working with submodules, see resources such as

https://www.git-scm.com/docs/gitsubmodules

https://git.wiki.kernel.org/index.php/GitSubmoduleTutorial

and https://blog.github.com/2016-02-01-working-with-submodules/

For now, you can think of them as a way to include 'library' type code which is under active development into several other projects and also allow it to stand alone.

Rescuing a Bad Chain

Sadly it is not particularly difficult to make a small alteration in a submodule (such as temporarily amending a compiler flag) and find yourself with a main repository insisting that there are uncommitted changes in the submodule, or worse, an unexpected detached head somewhere. Don't lose your head over this (sorry). Sometimes the submodule update command is all you need. Other times you will have to reset all the directories. First, commit or stash any changes you want to keep! Then the command

git submodule foreach --recursive git reset --hard

will remove all changes in all submodules, and a final

git submodule update --recursive

should put everything back to normal.

Summary

Git submodules are a powerful, sometimes tricky feature, which basically allow the inclusion of shared code into several other projects while also letting it to stand alone, to be released as a library and/or developed as a separate project. If you encounter submodules it will probably be as simple as somebody's repo needing to be cloned with the --recursive flag. To contribute to the submodule, clone it and only it instead. If you get stuff in your `git status` about changes in a submodule within a larger project try the update command.


April 20, 2018

Proper Planning and Preparation

There is an old, supposedly military adage, which reads

Proper Planning and Preparation Prevents Piss Poor Performance.

This applies to many things, including software development projects. There is also a popular idea known as Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

This is an amusing idea, but is sometimes allowed to apply to real life, which rapidly spirals out of control. There are many reasons a software project keeps taking longer the more you work on it, most of which are best avoided. This post is a brief introduction to how to plan, how to work, and how to keep things on track.

Software for Research

RSE is Research Software Engineering, which can mean many things, but ultimately means we are in the business of writing software used for doing research. Similarly, researchers-who-write-code should be aiming to for the least work which gets them reliable, verifiable, useable code. Note that usable includes being fast enough, and reliable includes doing enough testing.

This makes planning and scheduling essential, to make sure work gets done in time to produce research outputs, but also means it should be lightweight, simple and flexible.

Why Plans Fail

No plan survives contact with the enemy

(A well known paraphrase of a quote by H. von Moltke). Just like Hofstadter's law, one can add the corollary 'even if you include that in the plan'. Knowing the challenges ahead of time lets you plan around them, which gives the vast chance of something from the plan surviving. The following are the commonest sorts of issues to consider.

Shooting in the Dark

It's quite common to start a project not knowing exactly how to do it. This is fine, but has to be allowed for in the plan. For instance, allowing time to write a small test problem using a new library/tool before jumping into the main project, or allowing time at the end to rewrite/refactor (see below). Do not fall victim to the sunk costs fallacy and continue something unpromising just because you have already invested time and effort into it.

Scope Creep

Scope creep is probably the biggest problem many people face with research code. What starts as a simple script turns into a complex behemoth, and never seems to end. A vital part of planning, particularly if you are shooting in the dark, is to decide exactly what the code must do, and once that goal is reached, to stop. This is the perfect point to rejig things, debug well, document, optimise etc, before making a new plan to add all the things you now know you need. Endless rolling feature addition is risky.

Debug Deluge

It isn't uncommon to see a programmer writing pages of code before ever testing if it runs, or even compiles. This goes away with experience, but is often replaced by a related issue, of writing a bunch of stuff and then finding out that it doesn't quite work and has created a debugging nightmare. A better approach is to work in chunks, making each bit work, testing it, and only then integrating it into the whole. This way there's no deluge of debugging to be done, which can take a very long time; only small pieces at a time which are easier to plan for. This is related to the programming style called Test Driven Development (see below).

Premature Optimisation...

...is the root of all evil (Donald Knuth). Until your code works, there is nothing to optimise, as you'd just be doing the wrong thing faster. However, it is important to consider what the code needs to do in the end, and design and plan something that will be fast and efficient enough.

Refactor or Rewrite

Refactoring is rewriting pieces of a code so that the blocks (usually functions) give the same answers but are simpler, faster, more maintainable etc. Rewriting covers more substantial changes, such as changing core data structures or moving from hard-coded input values to a configuration file etc.

Getting Agile

If you have seen any of the Indiana Jones films, you may notice he applies something very like this. Rather than crafting a complex plan and trying to implement it in the face of changing circumstances, he adapts to things as they happen and is never more than a step, or a few steps ahead. This is basically the idea behind the 'Agile Methodology' of software development.

Since clients (or in our case, a research problem) change what they need over time, the idea is to plan only a few weeks ahead, adapting to changing needs as they arise. This is great for commercial development, which is all about happy customers, but does have a few pathologies for research software if done strictly. The main concerns are stopping yourself letting your scope creep out of control, and getting good enough time estimates to be sure you can complete a project, and that it will benefit you.

It's well worth reading a few articles about Agile to see what it's all about, and taking any bits you feel are useful for your own development.

Test Driven Developement (Lite)

'Test-driven-development' (TDD) means writing tests first, and considering a piece of code complete once it passes them, at which point it is done and work should stop. This has plenty of advantages, in particular that you avoid the pitfall of writing the test already knowing how the code works, which often leads to writing a test you know will pass.

The over-literal genie crops up all over the place and something very similar happens when you give clever people a set of rules. For a lot of scientific code, the natural test set is a handful of analytically soluble problems, for which exact answers are known. Using these as the test suite, the obvious, and strictly correct, TDD approach is just to hard-code their solutions. Every test passes, and the code is complete and usually completely useless too.

Rather than strict Test Driven Development, we'd recommend using writing tests for functions before or just after you finish them, making sure they work for all the parameter ranges etc that you need especially things like negative numbers, and then keeping the tests for when or if you change those functions. Rather than running everything every time you make changes you can just run the relevant tests and satisfy yourself that nothing is broken and then focus on integrating pieces together and making sure they all work in combination.

Planning for a Project

Everybody has their own way of planning, and really the only important thing is to do something so that you can estimate how long something will take and keep yourself honest along the way. For small things you only need to take an educated guess how many hours or weeks something will require. For more complex projects you may want to consider some kind of itemised planning process.

This file project_estimatorwexample.pdf is an example of such a planning worksheet, including an example based on my last project. This sheet tries to calculate a time requirement based on some common factors in research code. The multipliers and scalings are only a guess, and the last section adds an arbitrary buffer factor, but this would hopefully provide a starting point for making formal time estimates.

Last Word

This is quite an eclectic post, running through a few bits of useful knowledge that often get left out of programing introductions. The core point is simply

If you fail to plan, you are planning to fail!

Make a plan, stick to it as far as possible, but don't plan so far ahead that you're constantly wasting time re-forming your plans instead of getting work done!


April 04, 2018

Valgrind and Extended Precision

Valgrind is an invaluable tool for checking for memory errors in compiled code. Briefly, it runs your code inside its own runtime which includes checked versions of system functions like malloc, so can detect undefined variables, memory leaks and much more. Using this on one of my C++ codes, I found something rather odd. When I ran normally, I had no problems, but inside valgrind I got unexpected NaN results. The actual code, in Minimum Working Example form was

#include <boost/math/special_functions.hpp>
#include <iostream>
int main(){

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call "<<boost::math::cyl_bessel_j(5+1, (float) arg)<<'\n';
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';

}

Running normally, I got the expected answer of 0.196642 for both calls. Run inside valgrind, the float version was unchanged, but the double version became NaN! This was very strange. Until I constructed the example above, I assumed the bug was elsewhere within my code, and not with the Bessel functions themselves.

After a lot of head-scratching and searching, I found a link to the valgrind manualand the answer:

Precision: There is no support for 80 bit arithmetic. Internally, Valgrind represents all such "long double" numbers in 64 bits, and so there may be some differences in results. Whether or not this is critical remains to be seen.

After a bit more searching and then checking of DBL_MIN and LDBL_MIN from limits.h I had an answer:

#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';
}

Normally one gets a result like:

3.3621e-4932
2.22507e-308
Minimum value is 3.3621e-4932

In valgrind once again:

0
2.22507e-308
Minimum value is 0

So the limited 80-bit support referred to was, on my platform at least, breaking __LDBL_MIN__ which should be the smallest representable non zero value. Since the real problem is happening off inside boost::bessel somewhere, there's no simple fix. Most likely, the problem is that boost uses __LDBL_MIN__ to soften a division, preventing a divide by zero which later results in a NaN. The easier workaround requires that you detect running inside valgrind and adjust calculation. This is easy using the supplied macro:

#include "valgrind.h"
#ifdef RUNNING_ON_VALGRIND
  if(RUNNING_ON_VALGRIND){
    //Alternative path here
  }
#endif

Note valgrind.h is specially licensed, differently to the rest of Valgrind, so that including it in your code does not require you open-source your code. Inside the special path, I used a float version of the bessel call, since inside valgrind I am interested in general output, but don't need the full precision of a double.

Another alternative overrides __LDBL_MIN__ before including the boost header. I chose to set it equal to __DBL_MIN__ Note this only works with header-only libraries and is a bit risky. The code becomes something like:

#undef __LDBL_MIN__
#define __LDBL_MIN__ __DBL_MIN__
#include <boost/math/special_functions.hpp>
#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';
}

Sometimes, the bug really is in the compiler or tool!


March 21, 2018

Branch By Boolean

Once upon a time, all programming was for performance. With computer power at a premium, every instruction was a cost. Oddly, the cost of branching (acting differently based on a condition) has got both more and less expensive. Branches are evaluated more quickly now, but improvements in branch prediction (see below) mean branches can be more costly relative to the branchless option. Modern processors are in some ways more similar to the original supercomputers than ever, in particular, their use of vector operations. In this post we talk a bit about what branch prediction means and show a couple of tricks which can be quite useful in code in important loops.

What is Branch Prediction?

Branch prediction refers to the ability of the processor to predict (aha!) which branch is likely, and to get ready to take it. This can happen before the condition is actually evaluated. Once the condition has been checked, if the guess was correct, the operation stream continues, but if it was wrong the incorrect instructions in the queue must be dumped and the correct instructions for the actual branch assembled. This can sometimes be quite costly. Some systems instead execute both possible branches, and throw away the unneeded result, which can halve their possible throughput.

Generally branch predictors are dumb. The commonest option is to assume whatever track was taken previously will be taken again. This is great when you have one rare case (for example, an error condition) and a common one. Sometimes the predictor can 'understand' simple alternation (option 1, then option 2, then option 1...) or similar patterns if they are 'obvious enough'.

How much does it cost, really?

The costs of branch misprediction vary widely, depending on hardware, vectorisation, surrounding code etc. Most profiling tools will give you a miss rate, and reducing this in regularly repeated pieces of code can be worthwhile. A simple demo of branch misses is given here. I have tried to make all paths as similar as possible. The final print statement ensures that the total is used so the loop cannot be simply optimised away.

On my machine, the example compiled without optimisation takes about 0.8 seconds with compile-time option -DBR_3, i.e. the maximum possible number of addition operations, and the case which would take the longest if there was nothing more than simple looping and addition in play. Options 1 and 2 both take about 1 second, and the option with half true, half false, just over 0.8 seconds. This suggests something like the 'same as last time' prediction occuring.

Avoiding Branches

When it does matter, there are several ways to avoid true branches and greatly improve performance in a piece of code. Always profile first, though, to be sure this piece warrants improving.

A Simple Rejig

The first thing to consider is whether a simple rewrite of the code can eliminate a branch, or improve its prediction. Many branch predictors are quite dumb and assume whatever branch was taken the previous iteration will be taken again. If you can rewrite to ensure longer 'runs' of one branch, this may help. The previous section showed one example where this can be quite dramatic.

Do More to Do Less

Sometimes it may even be more effective to do more work, but without a condition, rather than check if the work needs to be done. A trivial example is something like

  DO i = 0, 10000
    IF (B[i] /= 0) A = A + B[i]
  ENDDO

where the check is redundant because if B[i] is zero, the addition does nothing.

Function Pointers

Another useful trick is to use function pointers (or named functions), as discussed in this previous Soup. Rather than putting an 'if' into a vital loop, first work out which function needs to be called and then call it. As given in the previous post, we can replace

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

with

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

As usual, this may or may not help: in some languages function pointers are slow themselves (not C or Fortran though, but e.g. std::function in C++ can be), and sometimes the called function is already complex enough to break any vectorisation or similar optimisations.

Boolean Branch

One particularly sneaky one is the Boolean Branch, most useful in languages which equate 'True' with 1 and False with '0', but still viable with integers otherwise. The trick is to note that

IF condition THEN
  result = value1
ELSE
  result = value2
ENDIF

is the same as

flag = condition !Flag may be an integer; in C it can be Boolean
result = value1*flag + value2*(1 - flag)

The conditional assignment in the first case is replaced with two simple assignments. This may or may not be faster! Profiling is essential to find out. It is more likely to pay off in a case where the two options alternate, especially when this is irregular, so the branch prediction fails regularly.


March 09, 2018

Odd MPI IO bug in Open–MPI

Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)

First you create a type representing the total array using MPI_Type_create_subarray

  sizes = (/nx_global, ny_global/)
  subsizes = (/nx_local, ny_local/)
  starts = (/x_cell_min, y_cell_min/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)

This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.

The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again

  sizes = (/nx_local+2, ny_local+2/)
  subsizes = (/nx_local, ny_local/)
  starts = (/1, 1/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)

When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.

The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.

Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.


There are a couple of take away messages from this

  1. If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
  2. Sometimes there really are bugs in well tested and widely used libraries
  3. It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.

February 21, 2018

SOUP: Function Pointers

Today's snippet demos function pointers (objects in Python), in particular an array of function pointers. We use them to print a table of arithmetic operations on all of the IEEE special values.

IEEE 754 defines the behaviour of floating point numbers, in particular what happens when numbers get unrepresentable, whether that is too large, too small or plain not-numbers.

Infinity is a familiar enough concept and in floating-point it mostly means a number which is too large to be represented. There's a positive and a negative infinity and most arithmetic works as expected.

Negative zero (-0.0) seems quite odd at a first glance. The sign-bit is set, but in every other way, and in general, it is equal to positive zero. Comparisons like `0.0 == -0.0` are defined to be true and `-0.0 < 0.0` is false. Most languages have a function to copy the sign from one number to another though, and -0.0 works as expected.

NaN, Not a Number, mostly appears as a lament, "Why is it NaN!?" or worse "Why is it NaN again!?" Any operations involving NaN also give NaN as do several others.

It seems strange that `Inf * 0.0` or `Inf + -Inf` both give NaN where they could both reasonably give zero. Philosophically, both are numbers, but completely unknown ones. Inf isn't mathematical infinity, it is just too large to represent, and 0.0 stands in for any number too small to represent. Their product then can be absolutely anything, hence defining it as NaN.

Code snippets in C, Fortran and Python are in our SOUP repo under 001_IEEE. All three use named functions to "dry out" the code: in fact they use an array of them. Note in Fortran this needs us to use a custom type to hold the function pointer, as we can't have an array of simply pointers.

The core of all snippets is the loop over operations and inputs

In C:

  float (*op)(float, float);/*Holds operation function*/
  float(*allOps[4])(float, float);

  allOps[0] = &add; allOps[1] = &sub; allOps[2] = &mul; allOps[3] = &div;

  for(opNum=0; opNum< 4; opNum++){
    op = allOps[opNum];
    
    for(rowNum = 0; rowNum < 7; rowNum++){
      row = allRows[rowNum];
      /*print result of op(row, column)*/
    }
  }

In Fortran (where f_ptr is our type holding a pointer):

  TYPE(f_ptr), DIMENSION(4) :: allOps
  TYPE(f_ptr) :: currOp

  allOps(1)%ptr => add
  allOps(2)%ptr => sub
  allOps(3)%ptr => mult
  allOps(4)%ptr => div
  DO opNum = 1, 4
    currOp%ptr => allOps(opNum)%ptr

    DO rowNum = 1, 7
      row = allRows(rowNum)
      !Print results of currOp%ptr applied to row, column
    ENDDO
  ENDDO

And in Python (using range-based for loops on lists):

  allOps = [add, sub, mul, div]

  for opNum in range(0, 4):
    op = allOps[opNum]

    for rowNum in range(0, 7):
      row = allRows[rowNum]
      #Print result of op(row, column)

Note all three are subtly different in how one assigns the active operation. In C you just get a pointer to a function with the usual address-of operator, &. In Fortran you point your pointer to the proper target using the points-to operator, =>. In Python we just set the operator to equal the function, omitting the brackets () turning this into an assignment, not a call.

Function pointers also have a powerful use in High-Performance code. A long if-else chain inside a loop, which calls a different function in each branch, can be replaced with the same if-chain outside the loop setting a function pointer and then a simple call inside the loop, eliminating the branch. As pseudo-code:

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

becomes

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

January 2025

Mo Tu We Th Fr Sa Su
Dec |  Today  |
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31      

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXV