May 15, 2019

Datastructures – Linked lists part 1

Back to data structures this month with the linked list. Linked lists are a way of holding data that allows you to add and remove items quickly and easily.

Why not arrays?

First question: why is adding and removing items from an array not quick and/or easy? The problem with adding items is quite simple - arrays have a fixed size so eventually you will run out of spaces in your array to store items. When this happens you have to do something to allocate additional space. Many languages have a function called "realloc" or similar that tries to extend the length of your array but it can only do that if there is unused memory space "above" the location of your array because the array elements have to be arranged one after the other in memory. The concept of "space above" is a bit complex in general and depends on details of your OS etc. but as a general idea if you allocate two arrays then they are placed one after the other in the computer's underlying memory so if you try to realloc the first array then there won't be any space between it and the second array to grow it in. If you can't grow your array like this then you have to allocate new memory to store the bigger array and copy the existing elements in. If you keep adding items then this continual growing of your array can be quite expensive, although this can be mitigated by always growing your array by more elements than you immediately need.

Removing items has the opposite problem. Since arrays are required to be contiguous (can't have gaps in them) you can't just "remove" an item you have to either flag it as empty and ignore it when going through your array in future or take all of the items above the removed element and move them down to pack everything up. The first approach has three problems

  1. You have to use additional memory to flag items as being empty or not
  2. If you are both adding and removing items from your array then since you don't actually recover memory when you remove an item your total memory requirements will grow without bounds
  3. Depending on your algorithm you might have more difficulty getting optimal performance if you have to do fundamentally different things for empty and non-empty array elements

The second approach avoids those problems but on average involves copying half of the elements in your array every time you remove an item which can also be quite expensive.

It is quite possible to build a container based on arrays that you can add and remove items from that has good general performance (C++ std::vector is a good example of one) but they always have to make tradeoffs and if you are doing a lot of adding and removing of arbitrary elements it might be better to use a data structure other than an array.

Linked lists

The idea of a linked list is quite simple. Each element in a linked list is like a link in a chain - linked to the item after them, so you go through the linked list by taking the first item then going to the next item and the next etc. until you reach the end. This is generally implemented using pointers in what are often called "self referential structures", that is structures that contain pointers to themselves. These are easy enough to implement in either C/C++ or Fortran.

struct llitem{
  struct llitem *prev, *next;

TYPE :: llitem
  TYPE(llitem), POINTER :: next, prev
END TYPE llitem

These are more or less normal types but there is one more important rule: self referential structures can contain only pointers to their own type, not actual instances of their own type (try removing the *s in C or the POINTER attribute in Fortran and it will fail to compile). This is because types, much like arrays, are laid out contiguously in memory so they can only contain things that the compiler knows the length of and if you have a type that contains an instance of itself then there would be an infinite regression problem because you don't know how big it is until you have finished creating it and you can't create it until you know how big it is. Pointers are all of a fixed size so they work OK.

The structure as given is for what is technically called a doubly linked list because it contains links both to the next item and the previous item in the list. A singly linked list has each item linked only to the next item in the list. Doubly linked lists have some substantial advantages over singly linked lists, notably that you can go through it from either end, but also you can remove an item from the list needing only the item itself (and the list that it is held in if you have several).

Creating linked lists

Creating a linked list is quite easy. You hold a simple pointer to the first element in the list (generally called the head item) and then you simply create the list going down from that. The key thing is that you have to hook up the prev and next links as you go. This isn't too difficult and looks like


struct llitem{
  int value;
  struct llitem *next;
  struct llitem *prev;

void init_ll(struct llitem * l)
  l-> value = -1;
  l->next = NULL;
  l->prev = NULL;

int main(int argc, char** argv)
  struct llitem *head, *current;
  int i;

  head = malloc(sizeof(struct llitem));
  head->value = 1;
  current = head;
  for (i=0;i<10;++i){
    current->next = malloc(sizeof(struct llitem)); /*Create the next element*/
    init_ll(current->next); /*Initialise it to nullify pointers*/
    current->next->value = current->value + 1; /*Simple counter*/
    current->next->prev = current; /*It's previous pointer should be the current item*/
    current = current->next; /*Now move onwards so the newly created particle is now current*/

  current = head;
    printf("%i\n", current->value);
    current = current->next;


  TYPE :: llitem
    INTEGER :: value = -1
    TYPE(llitem), POINTER :: next => NULL()
    TYPE(llitem), POINTER :: prev => NULL()
  END TYPE llitem

  TYPE(llitem), POINTER :: head, current
  INTEGER :: i

  ALLOCATE(head) !Create the head
  head%value = 1
  current => head
  DO i = 1, 10
    ALLOCATE(current%next) !Create the next element
    current%next%value = current%value + 1
    current%next%prev => current !The next element's previous is the current element
    current => current%next !Now move onwards so the newly created particle is now current

  current => head
    PRINT *,current%value
    current => current%next


This example also shows how you how to step through the linked list from the head, simply by having a "current" pointer that starts at head and is then incremented by setting current = current->next (or current => current%next in Fortran). This can look a bit odd but it isn't that hard to understand. I start by manually creating the "head" element, using either ALLOCATE or malloc. Once I have a head element I then loop through, each time using the same ALLOCATE or malloc command on the "current->next" pointer, creating a new item every time. In C I then call the ll_init function to setup the values of the struct (in Fortran this is done for me since I gave the elements of my TYPE default values). After this the prev and next pointers are both NULL. This is correct for the next pointer becuase my new item is the last item in the list (it won't be next iteration but right now it is), but I have to set the prev pointer. If my new item is the next element in the chain from my current element then the previous element in the chain from my new element must be my current element so I set that up. After that I just have to repeat until I have added enough items.

Part 2 of this will be in a couple of weeks and will describe how you remove and item from a linked list and how to add new items to the middle of a linked list.

May 01, 2019

The Numba "stencil" directive

After a bit of a delay we're getting the blog posts going again with a mention of a slightly odd bit of the Python Numba compiler - the stencil directive. The purpose of Numba is to produce compiled code from Python source that should run at a reasonable fraction of the speed of classical C/Fortran etc. codes. In general it produces codes that is about 20-40% as fast as C or Fortran code, so typically only about 1-2% as fast as the computer can theoretically operate. In general, the more that you can tell Numba in advance about how you are going to use your data the more options it has to optimise the code as it compiles it. The "stencil" directive is used to indicate to Numba that you are going to operate on an array by moving a stencil across it and updating each point using data from neighbouring points.

This is a fairly common thing to want to do and crops up in algorithms from image smoothing to numerical solutions to differential equations so this is a useful bit of the library. As a simple example consider the simplest possible image smoothing algorithm. For each pixel in the image P(i,j) replace the value with the average of the surrounding pixels, so

P'(i,j) = 1/4 * (P(i+1,j) + P(i-1,j) + P(i,j+1) + P(i,j-1))

Note that the left hand side of this equation is P' not P, so when we calculate the average for each pixel using the original values surrounding it not the values that have already been through the averaging process. After you have finished you copy P' to P. There is then one last thing that you have to worry about: what do you do when you reach the edge of the array? Since you are using adjacent cells you have to do something or you will read outside your array. Numba's stencil operator at present only has two options: set the outer cells to be zero or some other constant value. In general, this will mean that you want to have an outer strip of cells added around your image otherwise your image will get smaller as it smooths. These outer cells are often called ghost or guard cells and are also common in numerical solution of differential equations for representing boundary conditions. The code for doing all this in Python is quite simple

def blur(A):
    s = A.shape
    B = np.zeros((s[0],s[1]))
    for i in range(1,s[0]-1): #Range only over the inner cells
        for j in range(1,s[1]-1): #Range only over the inner cells
             B[i, j] = 0.25 * (A[i-1,j] + A[i+1,j] + A[i,j-1] + A[i,j+1])
    return B

This code takes a numpy array and iterates over all but the outer strip of cells in every direction, averages and returns the value. Each call to this function smooths your image a bit (by a radius of about 1 pixel) so in general you'll want to call it several times to smooth your image as much as you want. Running this algorithm on a stock image of 1529 x 2250 pixels on a 3.4GHz processor takes about 3.3 seconds per iteration using the pure Python implementation and 0.006 seconds by using the Numba @njit decorator. For testing we ran 100 iterations of the pure Python code and 1000 iterations of the Numba code. If run to the same number of iterations, the results in both cases is the same

Circuit board image before smoothingCircuit board image after smoothing 1000 times

The simple equivalent using stencil is

def blur(A):
    return 0.25 * (A[-1,0] + A[1,0] + A[0,-1] + A[0,1])

You can see how i-1 becomes just -1 and similarly for other parameters in the stencil, and you can also see how this operation now becomes a one-liner so it's much easier to write. Unfortunately the performance is much worse than the @njit version too taking about 0.22 seconds per iteration, although that is still some 10x faster than the native python performance. Fortunately performance can be improved by calling the stencil from a Numba jit-ed function, so for example

def inner_blur(A):
    return 0.25 * (A[-1,0] + A[1,0] + A[0,-1] + A[0,1])

def blur(A):
    return inner_blur(A)

The result from this method gives performance that is indistinguishable from that of the direct Numba jit version and is still rather shorter. Since these stencils lay out the data dependency you can also set parallel=True in the @njit call and this can be quite succesful but tends to work better for more complex stencils. In this particular case despite showing the Python interpreter apparently using 6 cores solidly the execution speed slows down by a factor of 3.

March 14, 2019

Multiprocessor Profiling

Super quick this time - a bit of background and then something I didn't know worked as well as it does - namely gprof on multi-core programs.


Once code works (and NEVER before) you can start worrying about performance. Often, you can tell using a bit of intuition and a few test runs what takes the most time, especially in simple cases. But this is always a bit risky, because you can easily miss the real performance limiting steps. Sod's law (if it can go wrong, it will) usually means that if something is completely unexpectedly slow, it's either trivial to optimise, because it really oughn't to be slow, or it's almost impossible.

The answer to this is not to guess, but to use one of the many available profiling tools, which tell you, function by function, and line by line, where your code spends it time. You then know where to focus your efforts.

Our upcoming training course (Accelerating Python, 27/3/19, see our calendar) will discuss profiling in detail for Python programs, and there'll be a more specific post after that, so watch this space!

Profiling Compiled Code

For compiled codes, profiling is usually pretty simple, but does require you compile code specially. For C/Fortran etc, there is a very good profiler as part of the gcc toolchain, called gprof. Using gcc, you simple compile your code with `-pg -g`(the second g adds debugging symbols and is only required on older gcc) and run it. This produces a machine-readable file of all the timings for that specific run, usually gmon.out. You can then turn this data into nicer formats using the gprof tool. Something like

gprof <path to program executable> <path to profile file>

You have to provide the program file too so that gprof can report to you in terms of program source lines. Note that strange things happen if the program you give here isn't what you ran (wrong line numbers etc). This is terribly infuriating, so don't do it.

Old Dog New Trick

I have used gprof plenty in the past, but I recently found a really useful way to apply it to multi-core codes (using MPI). This doesn't tend to work very well, because the output from different processes gets all mixed up and the timings end up wonky. It turns out that there's a really, really, simple fix, decribed e.g. hereand reproduced below.

Firstly, you need to get the profiling data on a per-process basis. As the link above says, this isn't very well documented, but is easy to do. Before running, set the environment variable GMON_OUT_PREFIX. This has to be done where the code is running, so if on a cluster etc, you'll probably need to add an export or setenv to your submission script, and/or use the '-x' argument to mpiexec (see below). So we do something like

export GMON_OUT_PREFIX=gmon.out-

When the code runs, it will produce one file per process (so be very careful on a cluster not to overwhelm your home space) named 'gmon.out-<PID>' where PID is the process ID. In theory it is possible to have the files named by MPI rank, but apparently this doesn't work well.

Now you run the code as usual, although on some systems you may need to make sure MPI run is fed the above environment variable. You can do this using the '-x <env var>' flag to mpiexec. On a local machine, as long as you exported the variable, this shouldn't be needed.

Now you want to produce the function or line level profiles using the sum of all of them. You can supply all of the files to gprof, or you can have gprof sum them into a single file, by default gmon.sum, using

gprof myprog -s gmon.out-*

Make sure the wildcard matches the right set of files - when I first tried this I forgot to clean out the files from previous runs and ended up summing across runs, with very odd results. Of course, you can use this to sum across runs to very useful effect if you want to see average profiles across multiple sets of parameters etc.

Now you can run gprof as usual, using either

gprof <path to executable> gmon.out-*


gprof <path to executable> gmon.sum

and voila, properly added up profiling information for multiple cores. Super.

February 27, 2019

Data structures – Stacks

Another fairly simple data structure this week - the stack. The stack is basically the reverse queue. Rather than the first item to arrive being the first to be served as with a queue in a stack the last item to arrive is the first to be served. The name comes from the idea of a stack of paper where you add pages at the top and then take them off the top again when you want them. Formally this behaviour is called LIFO (Last In First Out) as opposed to the FIFO (First In First Out) queue.

As you might imagine this is quite a restrictive data structure and the classical stack actually only has two operations

  1. push - add an element to the top of the stack
  2. pop - take the top element off the stack and give it to you

You'll notice that pop and push are inverses since pop removes the top element from the stack and gives it back to you. This isn't always what you want since you might find that you have to simply push the element back onto the stack if you can't deal with it yet. To make this easier a lot of stack implementations also have a snoop operation which tells you what is on the top of the stack without actually taking it off. What you never have in a true stack is a way of accessing an arbitrary element in the stack. If you can do that then you have an array that you are accessing in a "stack-like" way. This is also something that people quite often want to do, so you find that lists in Python have a "pop" method to remove the rightmost element (although the equivalent of "push" is called "append" for a Python list, probably because it is used for a lot more than making stacks).

Implementations of stacks are generally very simple. The simplest implementation is just an array with an index to the first free element in the stack. When an element is pushed it is added at that location and the index is moved on by one element. When a pop happens the element behind the index is given to the user and the index is moved back by one.

Stack Example

This will all work perfectly well until you run out of space in your array when, much as with the queue, you have to do something else. You can reallocate your stack to give yourself more space, copying the existing elements as you go, you can implement a stack using a linked list (which we will cover in a later post) which means that you won't run out of stack until you run out of memory or you can refuse to accept the new item an return an error. If you do nothing and accept the new item you access memory that you are not supposed to be using at all which can cause all kinds of problems from crashes to possible security exploits and this is called a stack overflow, which might be familiar as both the name of a tech website and as one of the most common things that are listed as the cause of security problems in software.

At this point it is worth drawing a distinction between the stack as a data structure and all of the other things that are called stacks. There are call stacks that were mentioned in the last blog post which described how a program got to be in the place where it is and there is an associated CPU feature called a "stack register" that holds this information. There is "stack memory" which together with "heap memory" make up the most common way of dealing with memory management in compiled codes and there are plenty of other stacks that turn up all over the place. In general they get the name because they implement a stack data structure (sometimes they used to be a stack back when the name was given to them but no longer do) and otherwise are not really related. You have to be rather careful because lots of things are called stacks and that name itself doesn't alway tell you very much. Always read the documentation!

A question that might come to mind is "what would I ever do with a stack?" They do seem rather less use than a queue and in practice they are mostly fairly "low-level" data structures that most people don't encounter much, but you can easily enough think of cases where they are useful. Their most useful feature is as a "memory" of a sequence of events that can then be undone. Many "undo" systems in software use stacks behind the scenes for exactly this reason. For the same reason they are often used in "depth-first search" algorithms which were first investigated (and still used!) as ways of solving mazes. They also crop up in algorithms to parse mathematical and programming languages (a good example being the Shunting Yard Algorithm, named after train shunting yards. You can easily see the similarity between a stack as described here and a train siding because you can only get carriages off a siding in the reverse of the order you put them in on). But in general you probably won't need to implement your own stacks very often but it is worth knowing about them because they are so ubiquitous in the lower layers underpinning all of modern software development.

February 13, 2019

Again and again and again

"Flow control" means all the ways you can control what your program does, both now and next. Conditionals, loops, function calls all count. Exceptions (throw, raise etc) may or may not - whether its OK to use exceptions as flow control or whether they're meant for, well, exceptional occurences (not necessarily rare, but something that can't be handled by the current piece of code) is a seriously vexed question.


Loops are usually the first or second control option to be taught, and take two general forms, the 'for' type loop and the 'while' type loop. Different languages use different words, but the first one is meant to do something a certain number of times, and this number is known when the loop starts. The second is meant to do something 'until told to stop'. This is not a hard distinction though. A while-type loop can always mimic a for-type loop, and the reverse is also true (although sometimes considered to be inelegant and/or error prone).

Usually, these loops look like

For index = start to end
   {loop body}
End for


While condition
  {loop body}
End while

The condition can be as complicated as you like, it just has to evaluate to either True or False. It can use all the variables you might be changing in the loop, etc. So we create the other kind of loop something like this:

For index = start to max_iterations
  {loop body}
  If condition exit_loop
End for


index = start
While true //This means keep looping forever
  {loop body}
  If index > end exit_loop
   index = index + 1
End while

Note that whether we get (end-start) iterations of a for loop, or (end-start+1) or (end-start-1) can vary by language, but we can easily adjust to match.


The other way of doing something many times, is 'recursion'. This often gets classed as 'super advanced and difficult' for some reason, but is mostly quite simple. First though, we need to know a tiny bit about functions and how they're called.


Scope is very, very important: every variable, function etc within a program has a scope. For variables this means "parts of your code which can use this variable (bit of memory) with this name". For functions, it means "parts of your code which can use this function with this name". There's a few subtleties beyond that, but for now, this will do.

So, what scopes are there? A variable defined inside a function is only usable within that function: it is scoped to that function, or has 'local scope'. A variable defined globally (outside any functions, including main) has 'global scope' and is available everywhere. Do note that a variable defined in 'main' is only available in 'main' and NOT in any functions 'main' may call, as main is a function like any other.

Most languages also have an idea of 'block scope' where 'blocks' (in C anything inside curly braces {}) can contain variable declarations, which are only available inside the block. This can cause some particularly confusing errors, such as when you try and do the following:

while i < 10 
  int i
end while

which will not compile unless there is already a variable, called i, and the one you declare inside the loop then 'shadows' this - inside the loop i refers to one variable, outside the loop, including the loop condition line, i refers to something different. If this isn't completely clear, the following example should help:

string name = 'Nobody'
int i
for i = 1 to 3
  string name = get_string_from_user() 
  print i, name
end for
print 'You entered', name

which gets 3 names from the user an outputs something like:

1 Bill Bailey
2 Madonna
3 Engelbert Humperdinck
You entered Nobody

Another tempting thing is to try

if condition then
  int i = 0
  long int i = 1
end if
print i

which again has either an undefined 'i' or a shadowing problem. There is no way to get a different type for i using an if like this, and with good reason. i's type could only be determined in general when the program runs - so how much storage should be given for it, and can it be passed to any given function?

Function scope has one more really important thing though - each call to a function is a new scope. The variables you used last time do not keep their values.


In Fortran there is one really important idea called the 'SAVE' attribute. A variable in a function (or module) can be given this, as e.g. "INTEGER, SAVE :: a" and the value of 'a' will be kept from call to call. This is very useful. BUT there is a catch. Any variable declared and defined in a single line in a Fortran function is given the SAVE attribute. So, if you do something like `INTEGER :: alpha = 0`, declaring an Integer alpha and in the same line defining it, alpha is set to zero ONLY the first time the function is called. Subsequent calls will inherit whatever value alpha had last time. This is rarely what you intended. Be careful!

Call stack

When you write code to call a function, the computer has to stop what its currently doing, and enter a new scope containing only the variables available inside the function. It also has to remember where it should go back to after the function ends. This is done using the 'call stack'.

We haven't talked about 'stacks' as a data structure yet (coming soon) but we did mention here that they're a "last-in-first-out" structure where the last thing you add to the stack (think of a stack of papers or books) is the first one you take off. Each time you call a function, you add an entry to the stack, and when you return this is 'popped off' and the stack shrinks. Each entry is called a 'frame'.

The stack frame usually contains the location to return to, and also memory for all of the local variables in a function. It often also has space to hold all the parameters passed to the function and sometimes a few other bits of operational stuff. When a function is called, a frame is created with all this in, and when it returns this is destroyed.

We mentioned above that variables inside a function are available only inside it, but we didn't ask what happens if we call a function from within itself. We've seen that between calls to a function the values are 'reset' or lost, and having read the previous paragraph you probably guess that this is both because and why the stack frame gets destroyed.

Now, you might suggest that you could always make sure every call to a given function shares the same variables, but if you've ever used a function pointer you know that you can call a function without ever using its name at the place the call actually happens, so this isn't practical. So, each call to a function, any function, creates a stack frame containing all its local variables, and calling a function from within itself makes two, independent, sets of all the local variables, that know nothing about each other.

My First Recursive function is My First Recursive function is My First....

So what is recursion then? "Recursion occurs when a thing is defined in terms of itself or of its type." (Wikipedia, Recursion) For a function, recursion means having the function call itself. In maths, the factorial function of a number which is the product of all positive integers up to it. So we see immediately that factorial(n) is n*factorial(n-1), which we'd code up something like this:

function factorial(integer n)
  return factorial(n-1)*n
end function

We can pretty immediately see a problem there though - how does the chain ever end? We need something which is not recursive or we'll go on calling forever. This is called the 'base case' and for factorial its obvious from how we said 'positive numbers'. The function above won't stop when n = 1 and it should. So what we actually want is:

function factorial(integer n)
  if n > 1
    return factorial(n-1)*n
    return 1
  end if
end function

Follow this by hand, on paper, for a starting n of say 4. We enter factorial(4), which enters factorial(3), which enters... until we reach factorial(1), which immediately returns '1' to the layer above, factorial(2), which multiplies this by '2' to get '2' and returns this to factorial(3) and so on.

Each Call is Its Own Scope

Remember when looking at recursive functions that each layer of call is a separate scope, with a separate copy of any variables you may define. Anything which needs to go between the layers has to be passed as an argument.

Step by step by step

So we have these two ideas that both let us keep going until we reach some condition, namely recursion and a while loop: what's the difference. There isn't one. Anything you can do with a loop you can do with recursion and vice versa. There are differences in elegance, and often one is a better choice, but not more. In fact some functional languages don't have any concept of the loop, relying solely on recursion. Mostly, elegantly recursive problems are better written as 'while' type loops and rarely as 'for' type loops, because the base case is the same as the loop-stop condition. Some recursive problems, usually those involving trees, are very hard to do elegantly with a loop.


Most of programming is about working out the sequence of steps to get from A to B, so that what you actually code is just a series of things, one after the other. Sometimes these steps are completely independent, and sometimes they aren't but they always (in a single-threaded program) run one after the other. We're always having to think about things, not in terms of the big picture, but just in terms of getting from here to there, ignoring how to get here in the first place.

In maths, one of the simplest methods of proof is called 'induction' which is closely related to recursion. Rather than try and prove the 'whole of a thing' we say 'if it was true for a smaller thing, can we show it's true for the next larger thing?' and then we say 'can we prove its true for the smallest thing?'. If we can do both of these, we've shown its true. As used here a common example is climbing a ladder. We say 'can we climb onto the bottom rung?' and we say 'can we climb onto the next higher rung than we're on?' and if so, we can climb any ladder.

Sometimes, a slightly stronger assumption is used where we instead say 'if we have reached every rung below and including the one we're on, can we reach the next one'. This is actually equivalent, but is sometimes a more useful phrasing.

Proofs by induction only work if it doesn't matter which rung we're on to climb to the next one. We don't ever have to reach rung 53 to know we can reach rung 54. If you have a problem which is easy to think about this way, then it is a prime candidate for programming recursively. The step is always the part going from 'here' to the next 'there', and the base case is how you get to the first 'here'.

Problems with Recursion

Stack Overflow

The 'call stack' we've been talking about is the inspiration for the programming forum Stack Overflow which is probably the most encountered error when programming recursively. Each function call creates a call stack frame and there is a limit to how much memory is available for this. If you forget or mis-program the base case, your recursive function never stops calling itself, until it has filled the call stack and your program crashes horribly with a stack overflow.

The other common way to get a stack overflow is creating large temporaries inside functions since these are all part of the stack. Hopefully more on that soon.

Function Parameters

Secondly, recursion can be a bit tricky to actually set up. Our factorial function was nice and simple, with a single parameter, and a single return value. But what if we have more than one parameter? For instance, a binary search can be done nicely in recursive fashion. Each step is about deciding which half, upper or lower, our target is in, and passing only this half on to the next step, and the base case is when this has length 1. Here though, you want to pass at least two items - the segment of list, and the target value, and you want to return either true or false, or the index the target was found at. If returning the index as an offset into the passed segment, you then have to adjust this at each step so that you end up with the index in the original, complete list. This can get mucky.

Excess Work

The other common example used for a recursive operation is the Fibonacci sequence where the nth value is the sum of the (n-1)th and the (n-2)th. Usually, the first two values are 1, so the sequence goes 1, 1, 2, 3, 5, 8, 13 etc. It's not hard to write a recursive version of this, e.g.

function fibonacci(n)
  if n eq 1 or n eq 2 
    return 1
    return fibonacci(n-1) + fibonacci(n-2)
  end if

but if we work through this on paper for say n =5 we find that we calculate the n=4 case once, the n=3 case twice, the n=2 case three times and the n=1 case twice.

A Python version of this and my model answers to the challenges are here.

Challenge: what's the rule in general for how many times fib(m) is called for each m < n?

Challenge 2: rewrite this recursively with exactly n-1 calls to calculate fibonacci(n). Hint after the post, or solutions at the Github link above.

If we're not careful, we might never notice all the extra work we're doing, which we could avoid. In this case, there's a big hint that something is funny at the point where we put two values of n into our base case.

Challenge 3:

Some image sharing sites now try to use a few real words to create memorable random urls. If you're given n lists of words, can you write loop based and recursive variants of the code to create every combination of one word from each list? Your code should work for any value of 'n'.

For example [large, small] [radiant, lame] [picnic, bobcat] should give (order not important) large-radiant-picnic, small-lame-bobcat, small-radiant-bobcat etc etc.

Small hint below the post

Moral of this Post

The takeaway from this one is that recursion isn't scary if you just think about getting from here to there, pretending all the business to get 'here' has been dealt with. Never mind the rest of the ladder, just think about the next rung. This is one of the vital skills to develop as a programmer, on every scale. Break things down into manageable steps and then build them into a program.

Keep scrolling for the hints....

Hint 1: can you return both the (n-1) and the (n-2) values?

Hint 2: the list created by combining lists 1 and 2 is itself a list of words. 3 lists is just 2 lists, and then another list.

February 07, 2019

New Training

New RSE training opportunity - intermediate level MPI. Following on from a basic MPI course, such as our Intro (Dec. 2018) or Warwick's PX425, we have a 1-day workshop on some trickier topics, such as MPI types. See here for details.

January 30, 2019

Data structures – Queues

This week we're back on data structures with another fundamental one: the queue. Simple data queues are pretty much the same as queues in real life, new items arrive at the back of the queue and data is removed from the front of the queue. They have obvious applications in any kind of program that gets data in from an external source and then has to process it in the order in which it is received. Queues are what are called FIFO data structures, first-in-first-out, because the first data to arrive is also the first data to leave. The alternative LIFO (last-in-first-out) data structure is generally called a "stack" because it is much more like stacking up paper in that the last item that you put down is the first that you pick up again.

The easiest way to implement a queue is using an array combined with a front and a back marker (I'm going to stick with the neutral "marker" description because the general idea doesn't care if you use array indices or pointers or any other mechanism that you want to). The idea is quite simple. You start with an array and when a data item turns up you insert it at the back marker and then move the back marker on one element. Adding an element to a queue is often called "pushing" or "enqueueing" so you might encounter those terms. Schematically, this looks like

Moving back marker in queue

When a piece of data is requested from the queue you return the item under the front marker and then move the front marker on by one. This operation is also known by the terms "pop" and "dequeue" so those are also worth remembering.

Queue moving front marker

And you continue doing this, moving the front and back markers as data arrives and leaves.

So far, so good but there are two obvious problems. The simpler one is what happens if data is requested when no data is in the queue. You have to have some mechanism for dealing with this case by indicating that there is no data available but that isn't too hard. The more troublesome one is what happens when the back marker goes off the end of your array? In the case as show, you can't really do anything at all and one way or another the queue will fail. This type of queue implementation is only useful if you have a fixed known number of items that you need to store as they come in and then deal with. It's not much use for getting data until your program stops and dealing with it.

If you want to do that, there are a few general solutions

1) You can get rid of the front marker entirely. Every time you dequeue an item you take the first item in the array and then simply shuffle the other items down one so that the first item is always populated so long as there are any items in the queue. This works well but it does potentially involve a lot of copying or moving of data if you have a large array that is mostly filled since every element of the array has to be shuffled up by one. If you are dealing with objects where a move/copy operation is expensive or are dealing with threaded code where some threads have to wait while this happens then it isn't the best solution

2) You can move to a circular queue. Circular queues are a bit strange and break the analogy with real world queues since queueing in a circle doesn't really work in reality. The idea is pretty simple though. Since you knowthat everything to the left of the front marker is unused, why not put new elements into there as they arrive and move the back marker around with them? Then when the front marker reaches the end of the array it just goes back to the beginning too. The effect is to mean that the front marker chases the back marker round in a circle. Shown schematically, this looks like

Circular queue

This works so long as back never catches up with front. If it does then either it has to stop adding data or it will overwrite data that is already there. It is possible to generalise these two examples to an arbitrarily long array - simply create a new array that is longer and copy the extant data into it and start again - but in most of the situations where you want a queue it should be possible to avoid this since that is another expensive operation (that will also render any pointers that you have to your data invalid so might involve more work to sort out pointers).

Generally queues are used in producer/consumer systems. There is something that is producing data and something else that is operating on it in some way. On average you have to be operating on the data at the same rate as you are producing it or the data will build up without limit, so the queue is only there to buffer data for brief periods while either your producer produces data faster than normal or your consumer consumes it slower than normal. In this case, all that you need is a large enough queue to deal with the largest expected difference in production and consumption rates. Of course, in a lot of systems you can only predict that "this queue is large enough for 99% of the expected variation" so for the other 1% of the time where you run out of space in your queue you have to decide whether it's worse to stall your entire system while you make your queue bigger (this might be OK for a web server for example where a page might take slightly longer to load but otherwise would work as expected) or simply throw away data (for example in a data logging system where any halt in the process would cause data loss but it might be lower if you just throw it away until there's space to store it than waiting while memory is allocated to hold it). Unfortunately, which is the right answer depends very much on the problem that you are working with.

As a final note, there is also a very similar data structure called the double ended queue or deque (pronounced "deck") that is similar but allows you to add and remove data from either end of a queue. They behave very similarly in general but the implementation gets a bit fiddlier because your back and front markers have to do double duty as each other.

January 16, 2019


A super quick post, mainly to point at somebody else's post from a while ago, namely

In case the link disappears, which it looks like the original post already has, the idea is as follows. This is something between a paraphrase and a direct quote, and all credit goes to the Original Poster quoted in the link above, but not named.

The memory space in the computer is like an area of beach, marked off for the building of sandcastles. To strain the analogy to breaking point, other parts of said beach might be marked off for other purposes (e.g. sunbathing, which has no useful relation to program space, and there is far less import to leaving your towel around).

  • When you want to build a sandcastle (create any object or variable) you ask for an area of beach, and one is selected and given to you.
  • What this area contains is a load of lumpy sand, usually containing the dregs of other peoples castles, possibly cutting across multiple. It's a mess, basically.
  • You, usually, either flatten the sand out, or immediately build a castle there, in both cases ignoring what might be there already.
  • When you're done, you surrender the area, but usually don't remove your castle.
  • The organisers are then free to give that area to somebody else.

So, what does this mean? Well:

  1. If you go back to your area (keep around a pointer, reference etc to it), your castle may well still be there, in all its glory. But somebody else can come along at any time and kick it down, and beach security are going to support their right to do so. So, you can't trust the shape of any of it.
  2. You also can't TELL whether anything has happened to your castle by examining it.
  3. If you accidentally released the area (delete, dealloc etc) and you've forgotten that, you will, sooner or later, find a bit that's not how you left it, or get your castle kicked down at a crucial moment. Do NOT DO THIS.
  4. You can't even trust what seems to be a patch of flat sand, unless you flattened it, or requested it pre-flattened (e.g. calloc in c).
  5. If you inherit somebody else's area, there might be a castle there! Since you can never trust the castle, you can't use it. Somebody might have already been along and undermined all the turrets. If you try and use it as a castle, all kinds of strangeness can ensue.
  6. BUT suppose somebody has a secret technique for building castles. If they leave one un-destroyed, you might be able to find out all kinds of "secret" things. So, any secret data has to be carefully destroyed. It's not enough just to surrender control of the area. This is just like the need to carefully destroy data on an old hard drive - by actually over-writing it, not just deleting it.
  7. If you, or somebody else, spills out of their designated area, you generally end up mucking up bits of each-other's castles, usually in such a way as to ruin them. DO NOT DO THIS either.

It's a simple analogy, but it is perfect to explain why it's so easy to miss uninitialised variables, and premature free/delete/dealloc operations. Often things still look perfectly fine! Often all the areas close to yours are unallocated, so you can apparently spill out of your area without consequence, until management changes policy on picking areas and suddenly things go wrong.

November 28, 2018

Experience Errors to Excel

No, not the Microsoft spreadsheet program. Warwick RSE don't think Excel is evil, or that it makes errors more likely, but most of what we work on or talk about is beyond the point where Excel, Numbers or any other spreadsheet software is really efficient. Anyway, this post is going to be a little bit of programming and research philosophy, as well as some practical bits about how to make errors in your code and research less likely.

Tools of the Trade

First, a few easy rules to make serious errors a lot less likely.

Have some Standards

Most of the "Old Guard" programming languages, the ones you find in banking software or aeroplanes, have very strict standards dictating what is valid in the language. They are usually revised every 3-5 years, and controlled by some professional body, often ISO. A working compiler or interpreter must obey these standards or it is wrong. In practice, support for new versions of standards takes some time to develop, but this is usually well documented. Some major packages, such as MPI, also have standards. If your language (or library) has standards, follow them! If it doesn't have formal standards (e.g. Python), be as conservative as you can in what you assume. For instance, try the calculation 2/3 in Python 2 versus Python 3.

7 Ps

Proper Planning and Preparation Prevents Piss Poor Performance. Supposedly a British Army adage, and one of my Mum's favorites. Plan before starting! Work out what you're going to do. Prepare - read up on any new techniques, do some "Hello World" examples with any new packages etc. Check all the bits of your plan can work in practice - can they be fast enough? Is your chosen language/package a good choice? Can you handle the amounts of data/computation/other needed?

Borrow your Wheels

There is a crucial balance between reinventing the wheel everytime you write some code on the one hand, and importing a package for every non trivial task on the other. Neither of these are a good idea, and both tend to lead to more and worse errors. Borrow widely (with proper attribution), but don't code yourself into a morass of dependencies that you can never hope to test. Especially don't borrow from things that are no better tested or verified than your own code - this is a recipe for disaster.

Document Everything

It's very easy to assume some feature or limitation of a code snippet is "obvious" when you're writing it, only tocompletely forget about it when you come back. The only way to avoid this is to document. Do this a few ways.

Restrictions on a code snippet, that don't impact anything else, so don't fit either of the next two paragraphs, should be commented as near the line as possible. Try and make it so that the comment still makes sense even if the code changes a bit - e.g. don't use line numbers, don't use "the following line" etc.

Restrictions on a function (parameter x must be +ve etc) are best done using one of the documentation libraries, so the restrictions are in comments near the function code.

Restrictions that run beyond a single function (don't use this code for more than 1 million elements at a time!, NaNs must be removed from data before input) should be mentioned both on the functions where they arise and in your examples and any user docs. User docs are vital if you're sharing you code with other people.

Check yourself before you Wreck yourself

Sometimes, you have known preconditions (things which must be true when a function is called), invariants (things which should stay the same throughout), and postconditions (things which are true after a function ends). You should consider adding code to check (some of) these things. Sometimes a compiler can do some of this for you, such as Fortran's array bounds checking, or C++'s ".at()" to get a vector element, rather than using "[]". Since these can be costly in run time, they may be better as a debugging option rather than a normal part of the code.

Test your Limits

The last essential element is testing. This is a very broad topic, and really isn't easy. At its core though, it means working out what parts of the code should do and checking that they do it. Smaller parts are easier to handle, but (remember your time is precious) can mean writing lots of trivial tests where you could easily absorb them all into one test. If it is at all possible, have at least one test case of the entire code - something you can run and know the answer to, that isn't too trivial. We talk a little about testing in our workshopsand there are millions of books, blog posts, papers etc on how to do it. Just make sure you do something. The rest of this post should convince you why.

The Ideal World

In an ideal world, any code you write has a well thought out, complete specification. Its purpose is fixed and perfectly known; there's a wealth of known results to compare to; the computations are all nice and self-contained; nobody else is really working on it, so there's no rush; and yet you can still do something novel and interesting with it. In this case, avoiding errors is as easy as it ever gets. You split the code into self-contained modules, each doing one thing and doing it well. You write careful error checking into every function, making sure they are never called with invalid parameters. You test each function of every module carefully in isolation and as you put the modules together, you test at various stages. Finally, you run some problems with known results and verify that you get the correct answer. As you write, you document all the functions and the assumptions they make. Finally, you write some user documentation describing what the code is and does, what its limitations are.

That's all great in theory, but for it to be enough it really is essential that everythingof importance be in the specs. I am not sure that has ever happened, anywhere, ever. It gets close in things like Aerospace and Medical engineering, but even there mistakes are made.

Code or Computation

The second best situation, from a coding point of view, is the one where the Code is trivial, but the Computations hard. For instance, a simple program to count which words occur near to some target word is fairly easy to write, although far form trivial (capitalisation, punctuation, hyphenations etc), but text concordance analysis is widely used and is able to find new, novel results if correctly analysed. Or imagine coding up a very sophisticated algorithm, with minimal support code. From a code perspective this is also quite easy - you just make sure the code contains the right equations (and hope that you wrote them down correctly).

Of course, in these case you still do all the things from the ideal-world - you modularise the code, you test it carefully, you document the assumptions. I'm not going to discuss all the details of this here: we cover a lot of it in our training, especially here. You do all this, and you'll catch some errors, but it probably wont be all of them.

Sadly though, with many pieces of code you may write, you don't even have this level of "simplicity". The code is hard - it uses difficult techniques, or has many interacting pieces, or there aren't any good test cases you can use. Testing, verification and documentation as important now, if not more so. But you wont find all the potential errors.

What's the Point?

But at this point, it seems a bit pointless. You make all this effort, spend all this time on testing and verification, and its still probably got bugs in. Your code still probably has errors somewhere no matter what you do.

Well firstly, and most practically, the more you check and test, the smaller your errors are hopefully getting. You catch any really big glaring errors, you fix lots of things, and you absolutely have a better piece of code. The time you're spending is never wasted, although its important to try and direct it to the places you get the most return. Keep the 80-20 rule in mind. You get 80% of the gains from the first 20% of the effort in many endeavours, but getting the final 20% takes the remaining 80% of the time. This doesn't mean you should fix 20% of the bugs though! It means you finish all the easy 80s before moving on to the more finicky checks.

Secondly, more philosophically, you will always know that you did it. You can point to the time you spent and say "I tried. I did the things I was supposed to. I wont make this mistake again." That is a very nice thing to be able to say when you find you've made a mistake that's going to cost you time and/or effort.

Sh*t Happens

So, if you've published a paper, you've probably published a mistake. Hopefully, this isn't a important msitakes, just some minor grammar, a misprinted formula, or a miswritten number that doesn't affect results. It's pretty important to keep perspective. Everybody makes these mistakes. The important thing is not to make them because you didn't do things how you know you should.

I like to read Retraction Watchon occasion. Much of what they publish is serious misconduct from serial bad actors, but sometimes there's an interesting post about honest mistakes, and interesting commentary on how it's meant to work. For instance, when should a paper be retracted rather than amended? How do you distinguish stock phrases from plagiarism? What's so bad about salami slicing?

I particularly like things like this retraction, where an error was found that needs more work to address, so a correction wasn't suitable; or this one where the error meant the result wasn't interesting. The errors probably happen a lot more often than the retractions or corrections. That's not really in scope of a blog post, but it's good to see the process working well.

$600,000 Mistakes

To summarise briefly, everybody makes mistakes. If you're lucky yours will be minor, caught before they go "into the wild", and make funny anecdotes rather than tragic tales. The worst thing you can do, as a researcher, is try to hide your errors. You should fix them. Whether this can be done in a follow-on paper, or needs a correction or retraction, it should be done!

Errors in code, rather than papers, are a tiny bit different - they're probably more likely to be harmless. But don't be fooled - if the error invokes "undefined behaviour" (anything outside of, or not conforming to, the standard) you will never know whether the result was affected unless you redo everything. It's very tempting to run a few test cases, find they're OK and assume all the others are fine too. The definition of undefined behaviour means you can't really know - its perfectly valid to work 99 times and fail the 100th. If you find any, you have to fix it and redo all results. Hopefully they're unchanged. That's great. You fix the code, you upload the fixed version - admitting the error and the fix, and you add the error to your list of "things to be careful of". You don't need to do anything about your results - they're still solid.

If results are affected, talk to your supervisor, or PI, or trusted colleagues and work out what's an appropriate solution. Make sure you've learned from this and you're not going to make the same mistake again. The former CEO of IBM, Thomas John Watson Snr said it best:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. “No”, I replied. “I just spent $600,000 training him – why would I want somebody to hire his experience?”

Don't be scared of bugs. They happen and they always will. But the more you find, the better you develop your "sixth sense" and can almost smell where they might be, and the better your code and the research you do with it, will become.

November 14, 2018

Now with Less NaN!

This week I'm at Supercomputing 2018 in Dallas, enjoying a wide variety of talks! Just a short post on a couple of the "Aha" moments from this talkon correctness and reproducibility of Floating point code by James Demmel (University of California, Berkeley). I had never noticed the perils of NaN and Inf propagation he mentioned and I think they're really worth knowing. I said a bit about IEEE 754which dictates what NaN and Inf are and how they behave in this post. This entry is mostly about unexpected ways the standard can be broken by accident, mainly due to optimisations that rely on multiplying by zero. Per the standard, anything except NaN, times by zero is zero, but any operation involving a NaNmust give NaN.

Max Value the Obvious Way

Imagine you're asked to write a routine to find the Max or Min value of an array. It's obvious. Start by assuming it's the first value, and then compare every other value to see if it's higher/lower, until you reach the end. Max/Min MUST look at every value in the array to get their answer (I am sure there's a mathematical word for that, but I don't know it), so this is the most efficient option there is in terms of comparisons. However, it has a major flaw once we account for IEEE behaviour: NaN compares as false to everything, even itself.

So, imagine running this algorithm on the arrays [NaN, 2, 1] and [2, NaN, 1]. We get NaN and 2 respectively, as we've made the first element "sticky"! Step-by-step:

Max of [NaN, 2, 1] :
Take first element : NaN
Compare to second element: 2 > NaN -> False : Max = NaN
Compare to third element: 1 > NaN -> False : Max = NaN
Max of [2, NaN, 1] :
Take first element: 2
Compare to second element: NaN > 2 -> False: Max = 2
Compare to third element: 1 > 2 -> False: Max = 2

This was an example given in the talk, and it's obvious when you look at it, but easy to not think of! There's two serious problems here: firstly the loss of the NaN value in the second case, which violates the standard and secondly the change of answer on array permutation, which is just generally bad for something like Max which should be invariant.

3x3 Determinant

Now imagine taking the determinant of a 3x3 Matrix. Generally, this is easy. You take each element of the first row, and multiply by the determinant of the 2x2 sub-matrix omitting the column at hand, with appropriate signs. Suppose a first row element is zero: obviously that term contributes 0 regardless of the 2x2 submatrix! So you can speed things up by testing for zero and omitting any such terms, right? Not once NaN gets involved. Now consider this matrix:

1 	0 	0
NaN	1	1
0	0	1		

We should get: 1 * 1 + 0 * NaN + 0 * NaN = NaN. But with our clever optimisation, we get 1 * 1 + 0 + 0 = 1. This is a pretty big problem! We're hiding away that NaN, by accident!

There's another problem too! Even if we checked for NaN in our array, what if we have an Inf instead? Inf * 0 should give us NaN. Effectively, Inf is "a number, but bigger than we can represent". Zero is "anything smaller than the smallest number we can represent". The product is then kind-of "any number, we don't know which". It might be zero, it might be Inf, it might be anything. So, the standard dictates it be NaN.

So, if we use the optimisation, we're not just hiding a NaN, we're stopping one ever being created! This is potentially an much more serious problem, because we might be happy using the Compiler flag to give us errors on NaN, but not willing to make under or overflows fatal. Then we'd completely miss this problem!

Optimisations in BLAS

The talk (at least it's first half) focused on the Linear Algebra library BLAS, which is heavily optimised, but as it turns out does have several bugs like this. For instance, a slightly complicated multiply op, alpha * A * B, might check for alpha = 0 in order to save the work of the matrix multiply when the end result should be zero anyway. Unless, of course, either A or B contains any NaN's when we must get back NaN.

But it's all gone wrong anyway...

One comment on the talk, which is worth a mention was, to paraphrase "If you're getting NaN's things have already gone wrong, so why should the Linear Algebra library bother to check? You should be checking." This is an interesting point and in some circumstances holds true. I tend to think of NaN or Inf as bugs and would prefer to write code so they can't arise, but I am not working with long linear algebra workflows where you want to run a series of operations in something like BLAS and don't examine the intermediate steps.

Reproducible BLAS

The problem above was largely the topic of the first half of the talk, which focused on how BLAS could manage the problem via errors/exceptions, and being careful with optimisations. For instance, use the optimisations to do the calculations, but also return an error code detailling any issues with the arguments.

Another large part of this talk was about reproducibility, in particular in parallel codes. Here, depending on the processor decomposition, you can get subtly different results for a sum over numbers of varying size. This is a really tricky problem, which is being addressed by things like ReproBLAS.


Dealing with Floating point numbers is hard. It gets harder once you demand reproducibility at a bitwise level. Both the Maxval issue above, and a simple sum with values of varying size may not give the same answers when terms/entries are reordered. In particular, parallel codes always effectively reorder terms whenever you chance the decomposition (e.g. run on more or less cores).

Hopefully, you never need to really worry about NaN (or the less tricky, but still tricky Inf), but do be careful! The problems I've discussed are mainly due to treating zero incorrectly per the IEEE standard, the problems with NaN or Inf following from this. Be careful with zero! It might not be as small as you think ;)

May 2019

Mo Tu We Th Fr Sa Su
Apr |  Today  |
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31      

Search this blog



Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder