April 20, 2018

Proper Planning and Preparation

There is an old, supposedly military adage, which reads

Proper Planning and Preparation Prevents Piss Poor Performance.

This applies to many things, including software development projects. There is also a popular idea known as Hofstadter's Law:

It always takes longer than you expect, even when you take into account Hofstadter's Law.

This is an amusing idea, but is sometimes allowed to apply to real life, which rapidly spirals out of control. There are many reasons a software project keeps taking longer the more you work on it, most of which are best avoided. This post is a brief introduction to how to plan, how to work, and how to keep things on track.

Software for Research

RSE is Research Software Engineering, which can mean many things, but ultimately means we are in the business of writing software used for doing research. Similarly, researchers-who-write-code should be aiming to for the least work which gets them reliable, verifiable, useable code. Note that usable includes being fast enough, and reliable includes doing enough testing.

This makes planning and scheduling essential, to make sure work gets done in time to produce research outputs, but also means it should be lightweight, simple and flexible.

Why Plans Fail

No plan survives contact with the enemy

(A well known paraphrase of a quote by H. von Moltke). Just like Hofstadter's law, one can add the corollary 'even if you include that in the plan'. Knowing the challenges ahead of time lets you plan around them, which gives the vast chance of something from the plan surviving. The following are the commonest sorts of issues to consider.

Shooting in the Dark

It's quite common to start a project not knowing exactly how to do it. This is fine, but has to be allowed for in the plan. For instance, allowing time to write a small test problem using a new library/tool before jumping into the main project, or allowing time at the end to rewrite/refactor (see below). Do not fall victim to the sunk costs fallacy and continue something unpromising just because you have already invested time and effort into it.

Scope Creep

Scope creep is probably the biggest problem many people face with research code. What starts as a simple script turns into a complex behemoth, and never seems to end. A vital part of planning, particularly if you are shooting in the dark, is to decide exactly what the code must do, and once that goal is reached, to stop. This is the perfect point to rejig things, debug well, document, optimise etc, before making a new plan to add all the things you now know you need. Endless rolling feature addition is risky.

Debug Deluge

It isn't uncommon to see a programmer writing pages of code before ever testing if it runs, or even compiles. This goes away with experience, but is often replaced by a related issue, of writing a bunch of stuff and then finding out that it doesn't quite work and has created a debugging nightmare. A better approach is to work in chunks, making each bit work, testing it, and only then integrating it into the whole. This way there's no deluge of debugging to be done, which can take a very long time; only small pieces at a time which are easier to plan for. This is related to the programming style called Test Driven Development (see below).

Premature Optimisation...

...is the root of all evil (Donald Knuth). Until your code works, there is nothing to optimise, as you'd just be doing the wrong thing faster. However, it is important to consider what the code needs to do in the end, and design and plan something that will be fast and efficient enough.

Refactor or Rewrite

Refactoring is rewriting pieces of a code so that the blocks (usually functions) give the same answers but are simpler, faster, more maintainable etc. Rewriting covers more substantial changes, such as changing core data structures or moving from hard-coded input values to a configuration file etc.

Getting Agile

If you have seen any of the Indiana Jones films, you may notice he applies something very like this. Rather than crafting a complex plan and trying to implement it in the face of changing circumstances, he adapts to things as they happen and is never more than a step, or a few steps ahead. This is basically the idea behind the 'Agile Methodology' of software development.

Since clients (or in our case, a research problem) change what they need over time, the idea is to plan only a few weeks ahead, adapting to changing needs as they arise. This is great for commercial development, which is all about happy customers, but does have a few pathologies for research software if done strictly. The main concerns are stopping yourself letting your scope creep out of control, and getting good enough time estimates to be sure you can complete a project, and that it will benefit you.

It's well worth reading a few articles about Agile to see what it's all about, and taking any bits you feel are useful for your own development.

Test Driven Developement (Lite)

'Test-driven-development' (TDD) means writing tests first, and considering a piece of code complete once it passes them, at which point it is done and work should stop. This has plenty of advantages, in particular that you avoid the pitfall of writing the test already knowing how the code works, which often leads to writing a test you know will pass.

The over-literal genie crops up all over the place and something very similar happens when you give clever people a set of rules. For a lot of scientific code, the natural test set is a handful of analytically soluble problems, for which exact answers are known. Using these as the test suite, the obvious, and strictly correct, TDD approach is just to hard-code their solutions. Every test passes, and the code is complete and usually completely useless too.

Rather than strict Test Driven Development, we'd recommend using writing tests for functions before or just after you finish them, making sure they work for all the parameter ranges etc that you need especially things like negative numbers, and then keeping the tests for when or if you change those functions. Rather than running everything every time you make changes you can just run the relevant tests and satisfy yourself that nothing is broken and then focus on integrating pieces together and making sure they all work in combination.

Planning for a Project

Everybody has their own way of planning, and really the only important thing is to do something so that you can estimate how long something will take and keep yourself honest along the way. For small things you only need to take an educated guess how many hours or weeks something will require. For more complex projects you may want to consider some kind of itemised planning process.

This file project_estimatorwexample.pdf is an example of such a planning worksheet, including an example based on my last project. This sheet tries to calculate a time requirement based on some common factors in research code. The multipliers and scalings are only a guess, and the last section adds an arbitrary buffer factor, but this would hopefully provide a starting point for making formal time estimates.

Last Word

This is quite an eclectic post, running through a few bits of useful knowledge that often get left out of programing introductions. The core point is simply

If you fail to plan, you are planning to fail!

Make a plan, stick to it as far as possible, but don't plan so far ahead that you're constantly wasting time re-forming your plans instead of getting work done!


April 04, 2018

Valgrind and Extended Precision

Valgrind is an invaluable tool for checking for memory errors in compiled code. Briefly, it runs your code inside its own runtime which includes checked versions of system functions like malloc, so can detect undefined variables, memory leaks and much more. Using this on one of my C++ codes, I found something rather odd. When I ran normally, I had no problems, but inside valgrind I got unexpected NaN results. The actual code, in Minimum Working Example form was

#include <boost/math/special_functions.hpp>
#include <iostream>
int main(){

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call "<<boost::math::cyl_bessel_j(5+1, (float) arg)<<'\n';
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';

}

Running normally, I got the expected answer of 0.196642 for both calls. Run inside valgrind, the float version was unchanged, but the double version became NaN! This was very strange. Until I constructed the example above, I assumed the bug was elsewhere within my code, and not with the Bessel functions themselves.

After a lot of head-scratching and searching, I found a link to the valgrind manualand the answer:

Precision: There is no support for 80 bit arithmetic. Internally, Valgrind represents all such "long double" numbers in 64 bits, and so there may be some differences in results. Whether or not this is critical remains to be seen.

After a bit more searching and then checking of DBL_MIN and LDBL_MIN from limits.h I had an answer:

#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';
}

Normally one gets a result like:

3.3621e-4932
2.22507e-308
Minimum value is 3.3621e-4932

In valgrind once again:

0
2.22507e-308
Minimum value is 0

So the limited 80-bit support referred to was, on my platform at least, breaking __LDBL_MIN__ which should be the smallest representable non zero value. Since the real problem is happening off inside boost::bessel somewhere, there's no simple fix. Most likely, the problem is that boost uses __LDBL_MIN__ to soften a division, preventing a divide by zero which later results in a NaN. The easier workaround requires that you detect running inside valgrind and adjust calculation. This is easy using the supplied macro:

#include "valgrind.h"
#ifdef RUNNING_ON_VALGRIND
  if(RUNNING_ON_VALGRIND){
    //Alternative path here
  }
#endif

Note valgrind.h is specially licensed, differently to the rest of Valgrind, so that including it in your code does not require you open-source your code. Inside the special path, I used a float version of the bessel call, since inside valgrind I am interested in general output, but don't need the full precision of a double.

Another alternative overrides __LDBL_MIN__ before including the boost header. I chose to set it equal to __DBL_MIN__ Note this only works with header-only libraries and is a bit risky. The code becomes something like:

#undef __LDBL_MIN__
#define __LDBL_MIN__ __DBL_MIN__
#include <boost/math/special_functions.hpp>
#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';
}

Sometimes, the bug really is in the compiler or tool!


March 21, 2018

Branch By Boolean

Once upon a time, all programming was for performance. With computer power at a premium, every instruction was a cost. Oddly, the cost of branching (acting differently based on a condition) has got both more and less expensive. Branches are evaluated more quickly now, but improvements in branch prediction (see below) mean branches can be more costly relative to the branchless option. Modern processors are in some ways more similar to the original supercomputers than ever, in particular, their use of vector operations. In this post we talk a bit about what branch prediction means and show a couple of tricks which can be quite useful in code in important loops.

What is Branch Prediction?

Branch prediction refers to the ability of the processor to predict (aha!) which branch is likely, and to get ready to take it. This can happen before the condition is actually evaluated. Once the condition has been checked, if the guess was correct, the operation stream continues, but if it was wrong the incorrect instructions in the queue must be dumped and the correct instructions for the actual branch assembled. This can sometimes be quite costly. Some systems instead execute both possible branches, and throw away the unneeded result, which can halve their possible throughput.

Generally branch predictors are dumb. The commonest option is to assume whatever track was taken previously will be taken again. This is great when you have one rare case (for example, an error condition) and a common one. Sometimes the predictor can 'understand' simple alternation (option 1, then option 2, then option 1...) or similar patterns if they are 'obvious enough'.

How much does it cost, really?

The costs of branch misprediction vary widely, depending on hardware, vectorisation, surrounding code etc. Most profiling tools will give you a miss rate, and reducing this in regularly repeated pieces of code can be worthwhile. A simple demo of branch misses is given here. I have tried to make all paths as similar as possible. The final print statement ensures that the total is used so the loop cannot be simply optimised away.

On my machine, the example compiled without optimisation takes about 0.8 seconds with compile-time option -DBR_3, i.e. the maximum possible number of addition operations, and the case which would take the longest if there was nothing more than simple looping and addition in play. Options 1 and 2 both take about 1 second, and the option with half true, half false, just over 0.8 seconds. This suggests something like the 'same as last time' prediction occuring.

Avoiding Branches

When it does matter, there are several ways to avoid true branches and greatly improve performance in a piece of code. Always profile first, though, to be sure this piece warrants improving.

A Simple Rejig

The first thing to consider is whether a simple rewrite of the code can eliminate a branch, or improve its prediction. Many branch predictors are quite dumb and assume whatever branch was taken the previous iteration will be taken again. If you can rewrite to ensure longer 'runs' of one branch, this may help. The previous section showed one example where this can be quite dramatic.

Do More to Do Less

Sometimes it may even be more effective to do more work, but without a condition, rather than check if the work needs to be done. A trivial example is something like

  DO i = 0, 10000
    IF (B[i] /= 0) A = A + B[i]
  ENDDO

where the check is redundant because if B[i] is zero, the addition does nothing.

Function Pointers

Another useful trick is to use function pointers (or named functions), as discussed in this previous Soup. Rather than putting an 'if' into a vital loop, first work out which function needs to be called and then call it. As given in the previous post, we can replace

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

with

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

As usual, this may or may not help: in some languages function pointers are slow themselves (not C or Fortran though, but e.g. std::function in C++ can be), and sometimes the called function is already complex enough to break any vectorisation or similar optimisations.

Boolean Branch

One particularly sneaky one is the Boolean Branch, most useful in languages which equate 'True' with 1 and False with '0', but still viable with integers otherwise. The trick is to note that

IF condition THEN
  result = value1
ELSE
  result = value2
ENDIF

is the same as

flag = condition !Flag may be an integer; in C it can be Boolean
result = value1*flag + value2*(1 - flag)

The conditional assignment in the first case is replaced with two simple assignments. This may or may not be faster! Profiling is essential to find out. It is more likely to pay off in a case where the two options alternate, especially when this is irregular, so the branch prediction fails regularly.


March 09, 2018

Odd MPI IO bug in Open–MPI

Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)

First you create a type representing the total array using MPI_Type_create_subarray

  sizes = (/nx_global, ny_global/)
  subsizes = (/nx_local, ny_local/)
  starts = (/x_cell_min, y_cell_min/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)

This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.

The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again

  sizes = (/nx_local+2, ny_local+2/)
  subsizes = (/nx_local, ny_local/)
  starts = (/1, 1/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)

When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.

The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.

Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.


There are a couple of take away messages from this

  1. If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
  2. Sometimes there really are bugs in well tested and widely used libraries
  3. It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.

February 21, 2018

SOUP: Function Pointers

Today's snippet demos function pointers (objects in Python), in particular an array of function pointers. We use them to print a table of arithmetic operations on all of the IEEE special values.

IEEE 754 defines the behaviour of floating point numbers, in particular what happens when numbers get unrepresentable, whether that is too large, too small or plain not-numbers.

Infinity is a familiar enough concept and in floating-point it mostly means a number which is too large to be represented. There's a positive and a negative infinity and most arithmetic works as expected.

Negative zero (-0.0) seems quite odd at a first glance. The sign-bit is set, but in every other way, and in general, it is equal to positive zero. Comparisons like `0.0 == -0.0` are defined to be true and `-0.0 < 0.0` is false. Most languages have a function to copy the sign from one number to another though, and -0.0 works as expected.

NaN, Not a Number, mostly appears as a lament, "Why is it NaN!?" or worse "Why is it NaN again!?" Any operations involving NaN also give NaN as do several others.

It seems strange that `Inf * 0.0` or `Inf + -Inf` both give NaN where they could both reasonably give zero. Philosophically, both are numbers, but completely unknown ones. Inf isn't mathematical infinity, it is just too large to represent, and 0.0 stands in for any number too small to represent. Their product then can be absolutely anything, hence defining it as NaN.

Code snippets in C, Fortran and Python are in our SOUP repo under 001_IEEE. All three use named functions to "dry out" the code: in fact they use an array of them. Note in Fortran this needs us to use a custom type to hold the function pointer, as we can't have an array of simply pointers.

The core of all snippets is the loop over operations and inputs

In C:

  float (*op)(float, float);/*Holds operation function*/
  float(*allOps[4])(float, float);

  allOps[0] = &add; allOps[1] = &sub; allOps[2] = &mul; allOps[3] = &div;

  for(opNum=0; opNum< 4; opNum++){
    op = allOps[opNum];
    
    for(rowNum = 0; rowNum < 7; rowNum++){
      row = allRows[rowNum];
      /*print result of op(row, column)*/
    }
  }

In Fortran (where f_ptr is our type holding a pointer):

  TYPE(f_ptr), DIMENSION(4) :: allOps
  TYPE(f_ptr) :: currOp

  allOps(1)%ptr => add
  allOps(2)%ptr => sub
  allOps(3)%ptr => mult
  allOps(4)%ptr => div
  DO opNum = 1, 4
    currOp%ptr => allOps(opNum)%ptr

    DO rowNum = 1, 7
      row = allRows(rowNum)
      !Print results of currOp%ptr applied to row, column
    ENDDO
  ENDDO

And in Python (using range-based for loops on lists):

  allOps = [add, sub, mul, div]

  for opNum in range(0, 4):
    op = allOps[opNum]

    for rowNum in range(0, 7):
      row = allRows[rowNum]
      #Print result of op(row, column)

Note all three are subtly different in how one assigns the active operation. In C you just get a pointer to a function with the usual address-of operator, &. In Fortran you point your pointer to the proper target using the points-to operator, =>. In Python we just set the operator to equal the function, omitting the brackets () turning this into an assignment, not a call.

Function pointers also have a powerful use in High-Performance code. A long if-else chain inside a loop, which calls a different function in each branch, can be replaced with the same if-chain outside the loop setting a function pointer and then a simple call inside the loop, eliminating the branch. As pseudo-code:

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

becomes

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

February 01, 2018

Upcoming training opportunity

February 26-27, University of Leicester. Software Carpentry. A 2-day rapid introduction to basic programming and tools, primarily focussed on programming in Python.


December 20, 2017

SOUP (Snippet of the Undefined Period)

Every-so-often we come across an interesting, imaginative, instructive or idiomatic code snippet. These can be amusing bugs, a task completed in several languages, or exemplars of how to do something well. Because we post them as and when we find them, they appear at unspecifed times, so we named them Snippet of the <Undefined Period> (SOUP). This gives us a nice acronym and is about as humorous as software engineering gets.

Often the interesting part of the code is buried in boilerplate etc, so we only post the interesting snippet directly. The full code is posted to the WarwickRSE github repository at WarwickRSE/SOUP. As a rule we try and produce a C, Fortran and a Python version of everything, with other languages on request. Comments and suggestions are welcome!


July 2024

Mo Tu We Th Fr Sa Su
Jun |  Today  |
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31            

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV