All 12 entries tagged Code

View all 34 entries tagged Code on Warwick Blogs | View entries tagged Code at Technorati | There are no images tagged Code on this blog

September 05, 2018

Data structures 2 – Arrays Part 2

Part 2 of arrays is about dynamic (run time) sized arrays. That is arrays where you don't know how large they're going to be until the code is running. First, I'll go through it in Fortran because it is really easy in Fortran, and powerful array operations are one of the major advantages of Fortran.

Dynamic arrays in Fortran are generally called ALLOCATABLE arrays (there are also POINTER arrays, but they are generally harder to use for only fairly specific benefits), and you declare them pretty much the same way that you declare any array in Fortran

INTEGER, DIMENSION(:), ALLOCATABLE :: myarray

You then allocate it using

ALLOCATE(myarray(0:100))

First, note that the DIMENSION(:) syntax in the variable declaration is the same as when you're passing an array to a function where you don't know how big the array is. This is the general way of telling Fortran "I don't know what size this is until runtime" (there are a few features where it's * instead for things like the length of strings, but in general it's :). Also note that I have explicitly allocated the array bounds, with the array running from 0 to 100 inclusive. If I'd just used

ALLOCATE(myarray(100))

then the array would have run from 1 to 100 inclusive. In Fortran you can have any upper and lower bounds for arrays that you want which is sometimes useful. You do have to be careful though because unless you are careful when you pass arrays into functions these bound markers are ignored inside the function and the array just runs from 1:n.

Moving to multidimensional arrays in Fortran is easy.

INTEGER, DIMENSION(:,:), ALLOCATABLE :: myarray
ALLOCATE(myarray(0:100, -1:101))
myarray(10,10) = 1

will create a 2D column major array with the array bounds that I specified in my allocate statement. Job done. Fortran ALLOCATABLE arrays also have the nice property that they are automatically deallocated when they go out of scope so it's much harder to have memory leaks with Fortran ALLOCATABLE arrays (you don't have this guarantee with POINTER arrays because this doesn't make use of reference counting and garbage collection. It relies on the fact that there can only be a single reference to a Fortran allocatable.). If you want more control over when memory is returned to you then you can also manually DEALLOCATE the array

DEALLOCATE(myarray)

Before I start on the C section, I should note one thing: C99 and later standards do define variable length arrays (usually referred to by the acronym VLA). They aren't really very common in scientific code and they have all sorts of oddities about how you use pointers to them and where you can and can't use them (you can't have them in structs for example). Given their general rarity and strangeness (and the fact that they aren't valid in C++ before C++14) I'm not going to talk any more about them, but they do exist and you might want to look them up if you're writing a new code in C. 1D run time arrays in C are traditionally created using the "malloc" memory allocation function. You tell malloc how many bytes of memory you want and it creates a chunk of memory that long and hands you a pointer to it. The syntax is easy enough

#include <stdlib.h>
int main(int argc, char **argv){
int * myarray;
myarray = malloc(sizeof(int) * 100);
myarray[0] = 10;
free(myarray);
}

Note the use of the "sizeof" operator to find out how many bytes are needed to store an integer. If you want to write code that works on multiple machines you'll have to use this because integers are not always the same size. In most senses you can consider a pointer and a 1D array to be very similar in C. When you use the square bracket operator you get the same element of your array in both cases, and in fact the layout of the memory behind the scenes is conceptually similar (although if you want to be precise a static array will generally be allocated on the stack, while malloc gives you memory from the heap). Also, note that I have used the function "free" to delete the memory when I am finished. If I don't do this then there will be a memory leak. C explicitly does not keep track of memory at all. Sadly it gets rather harder in multiple dimensions.

malloc always gives you a 1D strip of memory, and there is no equivalent function to give you a multidimensional array in standard C. My preferred solution is just to allocate a 1D array and then use an indexing function to access the element that you want. For example

#include <stdlib.h>
int main(int argc, char **argv){
int * myarray;
int nx, ny; /*Size of array to be filled somehow*/
int ix, iy; /*index of element to access to be filled somehow*/
myarray = malloc(sizeof(int) * nx * ny);
myarray[ix*ny + iy] = 10;
free(myarray);}

This will work perfectly and will give pretty good performance. Usually you'd write some kind of helper function or macro rather than having ix*ny+iy all over your code, but this shows the general technique. You do have to be careful writing your index function though because [ix*ny + iy] gives you a row major array and [iy*nx + ix] gives you column major. This can be useful but you need to be sure what you're doing. The other problem with this is that you can't just access your array as if it was a compile time multidimensional array. If you try to access this using myarray[ix][iy] you will get a compile error. This is because behind the scenes the [] operator dereferences your pointer with an offset (you can replace myarray[ix] with *(myarray+ix) if you like pointer arithmetic). Because of this when the first square bracket has happened you are just left with an integer, so the second square bracket operator is an invalid operation. So can you keep the multidimensional access? Yes, but there are always downsides.


#include <stdlib.h>

int main(int argc, char **argv){
int **myarray;
int nx, ny; /*Size of array to be filled somehow*/
int ix, iy; /*index of element to access to be filled somehow*/
int ind; /*Loop index*/
myarray = malloc(sizeof(int*) * nx);
for (ind = 0; ind<nx;++ind) myarray[ind] = malloc(sizeof(int) * ny);
myarray[ix][iy] = 10;
for (ind = 0; ind<nx;++ind) free(myarray[ind]);
free(myarray);}


This works by making myarray a pointer to a pointer to an int (int **myarray). You then allocate the outer pointer to be an array of nx pointers to int (note that my first sizeof is now sizeof(int*), not sizeof(int)!). You then go through all of these pointers to integers and allocate those pointers to themselves be ny element long arrays of integers (note that the second sizeof is sizeof(int)). This will work as expected, and you can now index your array with square bracket operators as normal, so what's the problem. The problem is that malloc doesn't give any guarantees about where the memory that you get back from it is located. In this simple test case the memory that you get will probably be layed out in a contiguous block, but in general this won't be true. By splitting your memory allocation up like this you are potentially creating a memory prefetch problem much like the one that you get by accessing an array in the wrong order, with similar effects on performance. There is a fringe benefit for this type of array : because the rows are allocated separately they don't have to be the same length as each other. It's tricky to make use of this property (because you have to keep track of how long each row is yourself) it can be very powerful for certain types of problem. On a related note you will often find that multidimensional arrays in C++ are created using std::vector<std::vector<int>> types. This in general has the same type of problem with memory not being contiguous, as you can clearly see from the fact that you can push back into each vector individually.

The final solution is to create a 1 dimensional chunk of memory and then rather than using an indexer function use a separate list of pointers to the start of the row. There are lots of ways of doing this, and this example isn't a terribly clever one but it does work

#include <stdlib.h>


int main(int argc, char **argv){

int nx = 10, ny=10;
int ind;
int **myarray;
int *buffer;
myarray = malloc(sizeof(int*) * nx);
buffer = malloc(sizeof(int) * nx * ny);
for (ind = 0; ind< nx; ++ind) myarray[ind] = &buffer[ind * ny];
myarray[5][5]=54;
free(buffer);
free(myarray);}


July 30, 2018

Committing Sins

A short one again, this time about commit messages. Some of this, particularly the message formats, is specific to Git, but the rest applies to any version control system, even the trivial one of saving a timestamped version of code files.

What I say here has been said many times, but it always bears repeating, because things like this don't matter... until they do, and you're rummaging through weeks of messages looking for some key word you don't recall and trawling through endless useless messages that only say "Fixed a bug" or "Cleaned up code". Having a code history is only useful when you come to use it, usually when something has gone a bit wrong, at which point you want concise, meaningful and absolutely correct data.

The Unwritten Rules

First, there are some simple rules on how a git commit message should be formatted, that are a bit hard to find written down, although some editors like Vim will highlight commit messages according to them. They are noted in the Discussion section of the man page for git-commit too and are discussed here. The expanded rules are:

  1. Follow any specific rules for a particular project first!
  2. The first line is the subject, and should be less than 50 characters.
  3. The second line must be blank. Any text here is ignored.
  4. The subsequent lines are the message body, and should generally be less than 72 characters.
  5. Use as many lines as you like, but be concise. If you need paragraphs, put blank lines between them
  6. If your code uses an issue tracker, include reference numbers for connected bugs etc
  7. Explain why, and what in the broad sense (see next section)

DRYness and Wit

Next, one of the overarching themes in programming is 'Don't repeat yourself' (DRY). As soon as information appears in two different places, you risk them becoming out of sync, and double the work of changing or understanding the thing. I talked about this a bit in the last post under the banner of "self documenting code". There, the idea is to let the code speak for itself, rather than trying to explain it with comments, saving your comments for the 'whys' and the information not available this way. Commit messages are the same. Anybody can look at the changes you made to the code, and know what precisely you did. The message is your chance to tell them why you did this, how this fixes a problem or what was wrong.

XKCD has a comic on this (he always does) and there are many humoroussites, some ruderthan others, as well as endless blog, forum and user guide entries with their own ideas about what is best. I like to mention once again the 'Principle of least surprise'. Make sure your code does nothing surprising in light of your message. For instance, NEVER commit something with message"Whitespace changes" if it does other things too!

Making History

Finally, there is the question of how large, how contained and how functional commits should be. As usual, there are as many theories about this are there are programmers. Some like to commit little and often, and before sharing that branch with anybody, squash togetherchunks into logical blocks. Some consider that to be a forbidden sort of rewriting of history. There are many, many ways to manage your work, and few rules. But:

  1. Alwaysfollow any guidelines for the project you're working on first and foremost!
  2. Anything you share should compile (if relevant) and run, although it needn't work in all respects. Otherwise, tracking when a bug was introduced can be a nightmare, particularly once your part-working code is merged with somebody else's and the result does what neither of you expected. See below for more.
  3. A large feature should probably be spread across several commits - building it up piece by piece.
  4. What you share should make sense. It's OK to try things out, find they don't work, and go back, but you probably only want that to be part of the history if its useful in showing why it didn't work.

Note: There is a crucial difference between code which is written to not work, and code which doesn't work. The former is things such as a function called 'is_thingy' always returning 'false'. The latter is that function returning something undefined (in fact, one release of the GCC C compiler had exactly the former for its regular-expression parsing code, always returning no-match). However it is usually acceptable for such a function to return an error, or throw an exception, because the calling code can handle that.

Summary

There's nothing in here that hasn't been said many times before, but the real take away is pretty simple - whatever strategy for messages, comments and code style, it should always be about increasing what you know about the code, not repeating it!


July 11, 2018

Maths in Code

Programming Equations

Short post this week, with a simple theme: how should you translate equations into code?

Self Documenting Code

Self-documenting code (aka self-describing code) is code that documents itself, rather like the term 'self-documenting code' already tells you what it means. The idea is to choose function names, variable names and types, and (in relevant languages) Class or Object hierarchies so that they demonstrate what they do. This can mean a lot of things, and tends to mean slightly different things to different people. In general, especially in cases of ambiguity, one aims for the "principle of least surprise" - what is the least unexpected meaning of something with this name? For instance, old fashioned C programmers, might consider "i" to be a perfectly self-documenting loop variable, as the usage is very familiar.

This leads us nicely into the spirit of self-documentation. Programming is hard. Choosing sensible names and structures reduces the effort needed to keep track of what things are, as well as everything else in your program. This means that the names you choose should help you, and whoever reads or uses your code, work out what things mean. Imagine coming back to your code after weeks or months not working on it. Can you read it through and guess, correctly, what things mean? If not, consider reworking your namings or structure.

Names shouldn't just repeat information already available though, as that wastes space which could be better used. For example, a boolean or Logical variable already tells you that this is probably some sort of flag, controlling whether or not something should happen or has happened. "write_to_file" is a useful name. "flag" is not. "is_open" is a useful name for a flag associated with a file, "open_flag" is rather less so.

On this note, there is a common naming scheme which uses so-called Hungarian Warts, where some sort of 'type' information is prepended using a few (< 3) letters. Originally these were used to add 'extra' information, such as 'us' for 'unsafe string', which can be very useful. In languages like Python, where definitions don't specify type info, even just tages like 's' for string can be useful, as long as you make sure this is actually true!

Don't go mad with descriptions though. Supposedly the average person can keep 7 things in mind at once so you definitely don't want to occupy all that space with your variable names. 2 or 3 pieces is about the limit for most names, so try and use things like "input_file" not "file_to_read_data_from" and avoid linking words as far as practical. Keep names short and sweet, to save typing and brain-space.

By the way, if you use an IDE to program, you might feel that none of this matters, but while it will fill in the names for you, it really doesn't solve everything. You still need to keep track of which variable you want and select the correct one in ambiguous cases, and keep track in your head of what is what. The self-documenting ideal means that when your IDE gives you a list of 5 similar looking variables, you barely have to think to select the one you need in context.

Self Documenting Equations

Take a nice simple equation, ax^2 + bx + c = 0 and imagine programming something to solve it. If you took the self-documenting advice at face value, you might convert "x" into "polynomial_variable", and "a, b, c" into "coefficient_of_square_term", "coefficient_of_linear_term" and "coefficient_of_constant_term". That leaves the quadratic equation looking pretty complicated, as

polynomial_variable = (- coefficient_of_linear_term +|- sqrt( coefficient_of_linear_term^2 - 4 * coefficient_of_square_term * coefficient_of_constant_term) )/ (2 * coefficient_of_square_term)

instead of the far more familiar

x = (- b +|- sqrt( b^2 - 4 * a * c)) /(2 * a)

Similarly, if you program equations from a paper, it can be much, much easier to use variable names that match the paper, especially when trying to verify them. Sometimes it is even a good idea to mimic the paper's equation format, at least until things are working. More on that in the next section.

Optimising Equations from Papers

So what about rearranging equations for optimisation purposes? If you ask ten different people, you'll get ten different answers, but in my opinion the following are the keys:

  1. Only simplify once the code works. Don't bother collecting, hoisting or other tweaks until you need to.
  2. Keep names generally the same as the paper, but ideally remove ambiguity, so don't use 'a' as well as 'A'. Consider 'a' and 'bigA' for more distinction. Absolutely avoid ambiguities like 'l' and '1'.
  3. When you do change to rearranged equations, consider leaving the original code present in a comment. This is one of the few places where commented code is a good thing.
  4. When you combine or collect terms, start by just combining their names or the functions use, so use 'A_squared', 'sin_theta' or 'A_over_B'
  5. Include a link to the paper or source you used in the code, so that your variable names remain useful. Somebody with a source that uses different namings will find this style really tricky. If you can, consider including a snippet in your docs detailling the original equations and what rearrangements you did.

Find your style, stick to it, and optimise only as much as necessary to get the performance you need!

Mathematica for Complicated Expressions

If your equations need substantial rearranging or simplifying, you might turn to a tool like Mathematica to do this for you, but did you know that it can even write C or Fortran code directly?

https://reference.wolfram.com/language/tutorial/GeneratingCAndFortranExpressions.html

You can save a lot of typing, and avoid some really easy mistakes by taking advantage of this. Don't forget FullSimplify and to apply suitable restrictions on variables to get a nicer expression before doing this though.


June 13, 2018

Fix Me

Fancy Latex Customisation

Writing up notes and papers, I often find that I want to leave myself reminders, hints and other snippets that I'd rather didn't make it into the final draft. Or I want to collate notes from throughout my text into a file I can put in my appendix, such as a list of future work. I often think of Latex as just a typesetting language, but Tex itself is actually a fully Turing complete language and Latex isn't too far off. The hardest bit of Turing completeness is usually the ability to do conditional branches which Latex can do with a bit of work. Thankfully there's a common package that takes some of the load off us.

NOTE: many journals wont accept manuscripts containing custom commands, so don't forget to remove your commands from papers you submit in future.

The Latex snippets are available on our Github page and are described below in more detail.

Defining Latex Commands

Firstly, we use the Latex \newcommand to do some custom formatting we can easily vary, in this case to make our text bold. \newcommand is fairly simple. The syntax is

\newcommand{\<name>}[<n_args>]{<command>}

where the parts in <> should be replaced with appropriate text. Thist creates a command called "name", taking "n_args" arguments and applies "command". The arguments are referred to using a '#' within your command, as in the following example where we feed the text inside the \fbrtext braces into \textbf.

\newcommand{\fbrtext}[1]{\textbf{#1}}

If there are no arguments, you can omit the [<n_args>] part entirely.

We can define as many commands this way as we like, but they must not be already defined. To redefine something, use \renewcommand (but carefully, making sure not to clobber something important like \section).

Collating Snippets with Output Streams

The first snippet is to collate some things into a file so we can review things we'd like to fix or add to our document before we release it. Snippet 1 is a simple Latex document which does this. We put the commands in a .tex file named ReleaseFixes, and include this into our main document, so that we can reuse it easily.

The core part of this is a package called "newfile" which allows you to write to files as the Latex compilation runs. This is how things like index and glossary packages work at heart, creating a temporary file that can then be included. This package lets us create "streams" for output, that we can write to, which may be familiar from Java, C++ or other languages. We name our stream "releasefixes", and then open a file to which it should output.

Now we create the main command, which takes one argument, a text chunk. This chunk will appear in our document in bold red text, and will also appear in our output file. In the output file, it'll show the line number (in the input .tex file) where the text appears. We could also include the page number in the output pdf if we wanted or other such useful commands.

Including Snippets into our Document

Perhaps we want to include the snippets we've generated into an appendix, so that we can send the draft to somebody else and they have a handy list of all the changes we want to make, or things we need filling in. Or maybe we want to use a similar command to list the changes we've made to a draft for easy review.

In Snippet2 we adjust our command so that rather than creating a simple text file, we put a '\item' command at the start of every item. Then we add a section to our document, and include the file we generated, inside an \itemize environment. This should give a neat bullet-pointed list of things to fix.

This doesn't quite work yet: the stream seems to be empty at the point where we read it, even though the file is complete when we look afterwards. When you have an unexpectedly empty output file it is usually because the stream didn't get "flushed". Flushing an output stream means moving its contents from a buffer in memory to the actual output file. This could be done every time we write to the stream, but if we do lots of tiny writes it is much slower this way than waiting until we have "enough" and writing in a chunk.

Normally, output flushing happens when the compile ends and everything gets closed, but we want access to our file before this. So we insert an explicit \closeoutputstream in just before our new section to write the file immediately, and then we can open it for input. We include its contents, and they are then typeset just like a static included file.

Using ifthen to Typeset Variants

Suppose that rather than showing us the changes, we want a way to include or exclude them from the output file entirely. This is really easy to do with a quick edit, because we just redefine our \fbr command to omit the text instead, and compile again. Easy for one command, but what if we have several things we want to change for our "draft" mode? It's tedious and error prone to edit them all.

A perfectly good solution for this precise case, by the way, is to create two versions of our "ReleaseFixes.tex" file, one for draft mode, and one for release mode, and edit our file to swap between them

\include{ReleaseFixes_draft.tex} %Swap with next line for release

%\include{ReleaseFixes_release.tex}

However, this is a good demonstration of conditionals, so let's try that. In Snippet3 we use the package "ifthen" which provides an "ifthenelse" command, that looks like

\ifthenelse{<condition>}{<command if true>}{<command else>}

The package allows us to define boolean flags. We define one called usered which chooses whether our text should be output in bold or red-bold.

However we can only define flags after we've include the package, and we'd prefer the draft mode selection it in our main document. So we define a normal command, \draft, and check if this is defined instead. We adjust the \fbr command to do nothing in release mode (\draft not defined) or to give us bold text (red if the flag says so) and a file entry in draft mode. Then by un/commenting line 6 of Document.tex we control the final form.

Summary

Latex typesetting gets a lot easier once you have a few simple commands at hand. It gets a lot more powerful once you add the ability to do conditionals. "Clever" commands like this should be used with caution, since anybody who wants to compile the document needs the relevant packages, but they are very handy when you need them, and can save a lot of time and effort. And do remember that you'll probably have to remove anything like this if you submit Latex to journals.


May 31, 2018

More randomness and a bit of non–randomness

Normally distributed random numbers

The last blog post was about getting high quality, uniformly distributed random numbers. In this post, I'm going to address the question of what to do if you want to have non uniformly distributed random numbers, that is random numbers where some are more likely than others. One of the ways that we're going to be looking at this is by looking at Probability Distribution Functions(PDF). These graphs are very like histograms, but they instead say "I have a source of random data U. For the entire source U, what is the probability that any given value is in the range x to x + dx, where dx is a finite but usually small deviation from x". In practice, what you do is you split up the range of U into a series of bins of width "dx" and then just count the number of particles in each bin and divide by the total number of particles.


Probability distribution function for uniformly distributed random number

The above graph shows the probability distribution for a uniform random number when you have 1,000 bins between 0 and 1. This number of bins explains why the graph is showing a rough line centred around 0.1% probability. Since a uniform random number generator should produce values in all bins equally, you would expect (100/nbins)% (so 0.1% for 1,000 bins) to lie in each bin. The graph shows that the random number generator that I'm using is working reasonably well.

Probability distribution function for normally distributed random number

This graph shows the same thing for a normally distributed random number (also called Gaussian, or in some fields Maxwellian distributions and sometimes bell curves). It is strongly peaked at zero and extends out to ±6. Normal distributions are very important in many areas of study (see here for some examples), and is found in things as diverse as the distribution of speeds of particles in a gas to the distribution of heights of members of a sports club. As a consequence there is a lot of interest in generating normally distributed random numbers. This is often called "sampling" the distribution. The graphs above were created by getting random numbers and counting how many fell into each bin on the graph, but often you want to go the other way. You want to draw the graph and then get random numbers such that their distribution will match the graph.

Fortunately there's a very easy way of converting from a uniform random number generator to a normally distributed random number generator via an algorithm called the Box-Muller Transform. I'm not going to go into the maths of it which are detailed on the linked Wikipedia page, but if you want normally distributed random numbers the Box-Muller Transform is your friend (usually the specific variant called the Polar Box-Muller Transform). Lots of languages already have built in implementations of normally distributed random number generators, generally using that transform so it's worth looking for those first, especially in compiled languages like Python where the built in function will probably be much faster than yours. However, in languages like C or Fortran which don't have a built in one you probably want to find or write an implementation of the Box-Muller Transform if you need normally distributed random numbers.

Arbitrary distributions in one variable

Sometimes, however, you have a more complicated distribution function that you want to sample. Sometimes there are other transforms like Box-Muller, but quite often there aren't. At which point the most powerful tool you have is called Inverse Transform Sampling. This introduces a new idea the Cumulative Distribution Function (CDF). The CDF is very much like the PDF in many senses, but instead of saying "how much of U is between x and x+dx" it says "how much of U is between min(U) and x", hence the cumulative in the name. It is the accumulation of the probability of the value being at x or lower.


PDF and CDF of Gaussian distribution

The graph this time compares the PDF (left) to the CDF (right) for a normally distributed number. The CDF has a value of 100% at the right because everything has to be less than the largest value. You can create the CDF in a number of ways, but on a computer a very common way is to define a grid of values (x(i), say) and calculate the PDF (PDF(i)) for each of them. This does mean discretising your distribution but that is often sufficient and you can often interpolate the result. The CDF is then simply defined by


CDF(i) = SUM(PDF(0:i))


where i is the index into a zero based array and is evaluated from 0 to n, where n is the number of elements in the array.

Once you have the CDF sampling it is surprisingly easy. You simply pick a uniform random number between 0 and 100 and then find out which value of i is the cell in which the CDF exceeds that value. You then have sampled the value x(i). You keep repeating this until you have sufficient values. If you then check the distribution of the values it will match the PDF that you specified. Effectively this divides the PDF into bins of equal area under the curve, and selects one bin at a time with equal probability. On the CDF graph, equal area bins correspond to a uniform division of the y axis.

The actual implementation of this can be a bit messier because while the easiest way of finding which cell of the CDF exceeds your random value is simply to start at i=0 and count up, the fastestway is to use binary bisection on the index. This works because the CDF is guaranteed to increase from left to right, so if you pick any point you know if the point that you want is to the left of that point if the value is smaller and to the right of the point if the value is larger. So you can keep dividing the space into sections until you have found the value that you want. But that is an implementation issue rather than a conceptual one.

Note also that if you can integrate the PDF to obtain a CDF, which you can then invert, you can do all of this analytically, but even simple PDFs like a Normal distribution don't have an invertible CDF.

Distributions in more than one variable

Distributions in more than one variable can be tricky. But if the function is separable, that is it can be written as two functions multiplied together, each of which is in a single variable (f(x, y) = g(x) * h(y)) each part can be sampled as above. For each complete sample you take one random sample of the function g, and one of h, and multiply them together.

Functions in more than one variable which don't break down like this are generally treated using a technique called accept-reject sampling, or some more sophisticated technique in tricky cases. More on this next time.


May 02, 2018

Hierarchies of Git(s)

Version Control and Git

Version control of some sort is essential in a software project. Suppose you make a change and your code no longer works - how can you go back? Suppose you find something you don't understand - who made that change? If you publish work using a code, can you recover the exact state at a later time? While carefully maintained records can go a long way, these exact problems led to the development of sophisticated version control systems which could do these things almost automtically. One of the most popular systems in use now is named 'git'.

For a basic rundown of version control, git, and why and how to use it, we have some notes here. This post is a (very brief) introduction to a useful but relatively uncommon feature, git submodules.

The name git, by the way, is no coincidence. It is a handy, pronounceable, 3-letter combination, that wasn't already in use for any Linux utilities, but also a convenient descriptor for the project creator, Linus Torvalds of Linux fame.

Great Fleas have Smaller Fleas

On one occasion we had a request for support in which a user was getting a cryptic error messages about `fatal: bad revision 'HEAD'` when attempting any git commands in any directories. In particular, he was unable to build our code because it used a `git describe` to obtain version information. After a bit of digging, we were able to work out that somewhere in a directory above our source code, he must have run a `git init` command.

This left our git repository in a invalid state and most git commands unusable. The user recalled typing the command, but not where he did it, so first we had to find the bad directory level. Since commands were failing, we could simply work upwards one directory at a time running `git describe` until we instead saw `fatal: Not a git repository (or any of the parent directories)`. A better solution is to use the command `git rev-parse --show-toplevel` which displays the current top-level git repository name.

Running this, it appears he had managed to do this in his `home` directory, leaving any and all git repositories among his files in this state. Thankfully the final remedy is simple: remove the invalid git directory by simply deleting the `.git/` folder at the wrongly initialised level. Much easier than feared!

Submodules

The proper way to have this sort of nested git project is using a relatively uncommon feature called submodules. These are separate git projects which can be imported and used, and are commonly used for deeply embedded library code. A submodule is a repository in its own right and can be checked out alone, and work on it should generally be done separately in this way.

For most people their only encounter with submodules will be the following two commands. When cloning a project that uses them, clone all submodules also using:

git clone --recursive [repo]

When making and removing small, temporary alterations within the submodule, some times when the submodule changes in the repository you cloned from, or any time you see a message like

-Subproject commit abc123xyz 
+Subproject commit def456uvw

in a git diff, the command:

git submodule update --recursive

will update all submodules to their current versions. The 'recursive' flag ensures all submodules, including any within other submodules, are updated.

But What Actually Are Submodules?

Basically, a submodule is a separate repository, connected to the main project. This connection is via a single commit, and a submodule within a main project is in a detached-head state. As of git 1.8.2 submodules included in a larger project can specify which branch of the submodule project to link to.

For all the gory details of working with submodules, see resources such as

https://www.git-scm.com/docs/gitsubmodules

https://git.wiki.kernel.org/index.php/GitSubmoduleTutorial

and https://blog.github.com/2016-02-01-working-with-submodules/

For now, you can think of them as a way to include 'library' type code which is under active development into several other projects and also allow it to stand alone.

Rescuing a Bad Chain

Sadly it is not particularly difficult to make a small alteration in a submodule (such as temporarily amending a compiler flag) and find yourself with a main repository insisting that there are uncommitted changes in the submodule, or worse, an unexpected detached head somewhere. Don't lose your head over this (sorry). Sometimes the submodule update command is all you need. Other times you will have to reset all the directories. First, commit or stash any changes you want to keep! Then the command

git submodule foreach --recursive git reset --hard

will remove all changes in all submodules, and a final

git submodule update --recursive

should put everything back to normal.

Summary

Git submodules are a powerful, sometimes tricky feature, which basically allow the inclusion of shared code into several other projects while also letting it to stand alone, to be released as a library and/or developed as a separate project. If you encounter submodules it will probably be as simple as somebody's repo needing to be cloned with the --recursive flag. To contribute to the submodule, clone it and only it instead. If you get stuff in your `git status` about changes in a submodule within a larger project try the update command.


April 20, 2018

Proper Planning and Preparation

There is an old, supposedly military adage, which reads

Proper Planning and Preparation Prevents Piss Poor Performance.

This applies to many things, including software development projects. There is also a popular idea known as Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

This is an amusing idea, but is sometimes allowed to apply to real life, which rapidly spirals out of control. There are many reasons a software project keeps taking longer the more you work on it, most of which are best avoided. This post is a brief introduction to how to plan, how to work, and how to keep things on track.

Software for Research

RSE is Research Software Engineering, which can mean many things, but ultimately means we are in the business of writing software used for doing research. Similarly, researchers-who-write-code should be aiming to for the least work which gets them reliable, verifiable, useable code. Note that usable includes being fast enough, and reliable includes doing enough testing.

This makes planning and scheduling essential, to make sure work gets done in time to produce research outputs, but also means it should be lightweight, simple and flexible.

Why Plans Fail

No plan survives contact with the enemy

(A well known paraphrase of a quote by H. von Moltke). Just like Hofstadter's law, one can add the corollary 'even if you include that in the plan'. Knowing the challenges ahead of time lets you plan around them, which gives the vast chance of something from the plan surviving. The following are the commonest sorts of issues to consider.

Shooting in the Dark

It's quite common to start a project not knowing exactly how to do it. This is fine, but has to be allowed for in the plan. For instance, allowing time to write a small test problem using a new library/tool before jumping into the main project, or allowing time at the end to rewrite/refactor (see below). Do not fall victim to the sunk costs fallacy and continue something unpromising just because you have already invested time and effort into it.

Scope Creep

Scope creep is probably the biggest problem many people face with research code. What starts as a simple script turns into a complex behemoth, and never seems to end. A vital part of planning, particularly if you are shooting in the dark, is to decide exactly what the code must do, and once that goal is reached, to stop. This is the perfect point to rejig things, debug well, document, optimise etc, before making a new plan to add all the things you now know you need. Endless rolling feature addition is risky.

Debug Deluge

It isn't uncommon to see a programmer writing pages of code before ever testing if it runs, or even compiles. This goes away with experience, but is often replaced by a related issue, of writing a bunch of stuff and then finding out that it doesn't quite work and has created a debugging nightmare. A better approach is to work in chunks, making each bit work, testing it, and only then integrating it into the whole. This way there's no deluge of debugging to be done, which can take a very long time; only small pieces at a time which are easier to plan for. This is related to the programming style called Test Driven Development (see below).

Premature Optimisation...

...is the root of all evil (Donald Knuth). Until your code works, there is nothing to optimise, as you'd just be doing the wrong thing faster. However, it is important to consider what the code needs to do in the end, and design and plan something that will be fast and efficient enough.

Refactor or Rewrite

Refactoring is rewriting pieces of a code so that the blocks (usually functions) give the same answers but are simpler, faster, more maintainable etc. Rewriting covers more substantial changes, such as changing core data structures or moving from hard-coded input values to a configuration file etc.

Getting Agile

If you have seen any of the Indiana Jones films, you may notice he applies something very like this. Rather than crafting a complex plan and trying to implement it in the face of changing circumstances, he adapts to things as they happen and is never more than a step, or a few steps ahead. This is basically the idea behind the 'Agile Methodology' of software development.

Since clients (or in our case, a research problem) change what they need over time, the idea is to plan only a few weeks ahead, adapting to changing needs as they arise. This is great for commercial development, which is all about happy customers, but does have a few pathologies for research software if done strictly. The main concerns are stopping yourself letting your scope creep out of control, and getting good enough time estimates to be sure you can complete a project, and that it will benefit you.

It's well worth reading a few articles about Agile to see what it's all about, and taking any bits you feel are useful for your own development.

Test Driven Developement (Lite)

'Test-driven-development' (TDD) means writing tests first, and considering a piece of code complete once it passes them, at which point it is done and work should stop. This has plenty of advantages, in particular that you avoid the pitfall of writing the test already knowing how the code works, which often leads to writing a test you know will pass.

The over-literal genie crops up all over the place and something very similar happens when you give clever people a set of rules. For a lot of scientific code, the natural test set is a handful of analytically soluble problems, for which exact answers are known. Using these as the test suite, the obvious, and strictly correct, TDD approach is just to hard-code their solutions. Every test passes, and the code is complete and usually completely useless too.

Rather than strict Test Driven Development, we'd recommend using writing tests for functions before or just after you finish them, making sure they work for all the parameter ranges etc that you need especially things like negative numbers, and then keeping the tests for when or if you change those functions. Rather than running everything every time you make changes you can just run the relevant tests and satisfy yourself that nothing is broken and then focus on integrating pieces together and making sure they all work in combination.

Planning for a Project

Everybody has their own way of planning, and really the only important thing is to do something so that you can estimate how long something will take and keep yourself honest along the way. For small things you only need to take an educated guess how many hours or weeks something will require. For more complex projects you may want to consider some kind of itemised planning process.

This file project_estimatorwexample.pdf is an example of such a planning worksheet, including an example based on my last project. This sheet tries to calculate a time requirement based on some common factors in research code. The multipliers and scalings are only a guess, and the last section adds an arbitrary buffer factor, but this would hopefully provide a starting point for making formal time estimates.

Last Word

This is quite an eclectic post, running through a few bits of useful knowledge that often get left out of programing introductions. The core point is simply

If you fail to plan, you are planning to fail!

Make a plan, stick to it as far as possible, but don't plan so far ahead that you're constantly wasting time re-forming your plans instead of getting work done!


April 04, 2018

Valgrind and Extended Precision

Valgrind is an invaluable tool for checking for memory errors in compiled code. Briefly, it runs your code inside its own runtime which includes checked versions of system functions like malloc, so can detect undefined variables, memory leaks and much more. Using this on one of my C++ codes, I found something rather odd. When I ran normally, I had no problems, but inside valgrind I got unexpected NaN results. The actual code, in Minimum Working Example form was

#include <boost/math/special_functions.hpp>
#include <iostream>
int main(){

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call "<<boost::math::cyl_bessel_j(5+1, (float) arg)<<'\n';
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';

}

Running normally, I got the expected answer of 0.196642 for both calls. Run inside valgrind, the float version was unchanged, but the double version became NaN! This was very strange. Until I constructed the example above, I assumed the bug was elsewhere within my code, and not with the Bessel functions themselves.

After a lot of head-scratching and searching, I found a link to the valgrind manualand the answer:

Precision: There is no support for 80 bit arithmetic. Internally, Valgrind represents all such "long double" numbers in 64 bits, and so there may be some differences in results. Whether or not this is critical remains to be seen.

After a bit more searching and then checking of DBL_MIN and LDBL_MIN from limits.h I had an answer:

#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';
}

Normally one gets a result like:

3.3621e-4932
2.22507e-308
Minimum value is 3.3621e-4932

In valgrind once again:

0
2.22507e-308
Minimum value is 0

So the limited 80-bit support referred to was, on my platform at least, breaking __LDBL_MIN__ which should be the smallest representable non zero value. Since the real problem is happening off inside boost::bessel somewhere, there's no simple fix. Most likely, the problem is that boost uses __LDBL_MIN__ to soften a division, preventing a divide by zero which later results in a NaN. The easier workaround requires that you detect running inside valgrind and adjust calculation. This is easy using the supplied macro:

#include "valgrind.h"
#ifdef RUNNING_ON_VALGRIND
  if(RUNNING_ON_VALGRIND){
    //Alternative path here
  }
#endif

Note valgrind.h is specially licensed, differently to the rest of Valgrind, so that including it in your code does not require you open-source your code. Inside the special path, I used a float version of the bessel call, since inside valgrind I am interested in general output, but don't need the full precision of a double.

Another alternative overrides __LDBL_MIN__ before including the boost header. I chose to set it equal to __DBL_MIN__ Note this only works with header-only libraries and is a bit risky. The code becomes something like:

#undef __LDBL_MIN__
#define __LDBL_MIN__ __DBL_MIN__
#include <boost/math/special_functions.hpp>
#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';
}

Sometimes, the bug really is in the compiler or tool!


March 21, 2018

Branch By Boolean

Once upon a time, all programming was for performance. With computer power at a premium, every instruction was a cost. Oddly, the cost of branching (acting differently based on a condition) has got both more and less expensive. Branches are evaluated more quickly now, but improvements in branch prediction (see below) mean branches can be more costly relative to the branchless option. Modern processors are in some ways more similar to the original supercomputers than ever, in particular, their use of vector operations. In this post we talk a bit about what branch prediction means and show a couple of tricks which can be quite useful in code in important loops.

What is Branch Prediction?

Branch prediction refers to the ability of the processor to predict (aha!) which branch is likely, and to get ready to take it. This can happen before the condition is actually evaluated. Once the condition has been checked, if the guess was correct, the operation stream continues, but if it was wrong the incorrect instructions in the queue must be dumped and the correct instructions for the actual branch assembled. This can sometimes be quite costly. Some systems instead execute both possible branches, and throw away the unneeded result, which can halve their possible throughput.

Generally branch predictors are dumb. The commonest option is to assume whatever track was taken previously will be taken again. This is great when you have one rare case (for example, an error condition) and a common one. Sometimes the predictor can 'understand' simple alternation (option 1, then option 2, then option 1...) or similar patterns if they are 'obvious enough'.

How much does it cost, really?

The costs of branch misprediction vary widely, depending on hardware, vectorisation, surrounding code etc. Most profiling tools will give you a miss rate, and reducing this in regularly repeated pieces of code can be worthwhile. A simple demo of branch misses is given here. I have tried to make all paths as similar as possible. The final print statement ensures that the total is used so the loop cannot be simply optimised away.

On my machine, the example compiled without optimisation takes about 0.8 seconds with compile-time option -DBR_3, i.e. the maximum possible number of addition operations, and the case which would take the longest if there was nothing more than simple looping and addition in play. Options 1 and 2 both take about 1 second, and the option with half true, half false, just over 0.8 seconds. This suggests something like the 'same as last time' prediction occuring.

Avoiding Branches

When it does matter, there are several ways to avoid true branches and greatly improve performance in a piece of code. Always profile first, though, to be sure this piece warrants improving.

A Simple Rejig

The first thing to consider is whether a simple rewrite of the code can eliminate a branch, or improve its prediction. Many branch predictors are quite dumb and assume whatever branch was taken the previous iteration will be taken again. If you can rewrite to ensure longer 'runs' of one branch, this may help. The previous section showed one example where this can be quite dramatic.

Do More to Do Less

Sometimes it may even be more effective to do more work, but without a condition, rather than check if the work needs to be done. A trivial example is something like

  DO i = 0, 10000
    IF (B[i] /= 0) A = A + B[i]
  ENDDO

where the check is redundant because if B[i] is zero, the addition does nothing.

Function Pointers

Another useful trick is to use function pointers (or named functions), as discussed in this previous Soup. Rather than putting an 'if' into a vital loop, first work out which function needs to be called and then call it. As given in the previous post, we can replace

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

with

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

As usual, this may or may not help: in some languages function pointers are slow themselves (not C or Fortran though, but e.g. std::function in C++ can be), and sometimes the called function is already complex enough to break any vectorisation or similar optimisations.

Boolean Branch

One particularly sneaky one is the Boolean Branch, most useful in languages which equate 'True' with 1 and False with '0', but still viable with integers otherwise. The trick is to note that

IF condition THEN
  result = value1
ELSE
  result = value2
ENDIF

is the same as

flag = condition !Flag may be an integer; in C it can be Boolean
result = value1*flag + value2*(1 - flag)

The conditional assignment in the first case is replaced with two simple assignments. This may or may not be faster! Profiling is essential to find out. It is more likely to pay off in a case where the two options alternate, especially when this is irregular, so the branch prediction fails regularly.


March 09, 2018

Odd MPI IO bug in Open–MPI

Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)

First you create a type representing the total array using MPI_Type_create_subarray

  sizes = (/nx_global, ny_global/)
  subsizes = (/nx_local, ny_local/)
  starts = (/x_cell_min, y_cell_min/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)

This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.

The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again

  sizes = (/nx_local+2, ny_local+2/)
  subsizes = (/nx_local, ny_local/)
  starts = (/1, 1/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)

When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.

The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.

Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.


There are a couple of take away messages from this

  1. If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
  2. Sometimes there really are bugs in well tested and widely used libraries
  3. It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.

March 2024

Mo Tu We Th Fr Sa Su
Feb |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV