All entries for November 2018

November 28, 2018

Experience Errors to Excel

No, not the Microsoft spreadsheet program. Warwick RSE don't think Excel is evil, or that it makes errors more likely, but most of what we work on or talk about is beyond the point where Excel, Numbers or any other spreadsheet software is really efficient. Anyway, this post is going to be a little bit of programming and research philosophy, as well as some practical bits about how to make errors in your code and research less likely.

Tools of the Trade

First, a few easy rules to make serious errors a lot less likely.

Have some Standards

Most of the "Old Guard" programming languages, the ones you find in banking software or aeroplanes, have very strict standards dictating what is valid in the language. They are usually revised every 3-5 years, and controlled by some professional body, often ISO. A working compiler or interpreter must obey these standards or it is wrong. In practice, support for new versions of standards takes some time to develop, but this is usually well documented. Some major packages, such as MPI, also have standards. If your language (or library) has standards, follow them! If it doesn't have formal standards (e.g. Python), be as conservative as you can in what you assume. For instance, try the calculation 2/3 in Python 2 versus Python 3.

7 Ps

Proper Planning and Preparation Prevents Piss Poor Performance. Supposedly a British Army adage, and one of my Mum's favorites. Plan before starting! Work out what you're going to do. Prepare - read up on any new techniques, do some "Hello World" examples with any new packages etc. Check all the bits of your plan can work in practice - can they be fast enough? Is your chosen language/package a good choice? Can you handle the amounts of data/computation/other needed?

Borrow your Wheels

There is a crucial balance between reinventing the wheel everytime you write some code on the one hand, and importing a package for every non trivial task on the other. Neither of these are a good idea, and both tend to lead to more and worse errors. Borrow widely (with proper attribution), but don't code yourself into a morass of dependencies that you can never hope to test. Especially don't borrow from things that are no better tested or verified than your own code - this is a recipe for disaster.

Document Everything

It's very easy to assume some feature or limitation of a code snippet is "obvious" when you're writing it, only tocompletely forget about it when you come back. The only way to avoid this is to document. Do this a few ways.

Restrictions on a code snippet, that don't impact anything else, so don't fit either of the next two paragraphs, should be commented as near the line as possible. Try and make it so that the comment still makes sense even if the code changes a bit - e.g. don't use line numbers, don't use "the following line" etc.

Restrictions on a function (parameter x must be +ve etc) are best done using one of the documentation libraries, so the restrictions are in comments near the function code.

Restrictions that run beyond a single function (don't use this code for more than 1 million elements at a time!, NaNs must be removed from data before input) should be mentioned both on the functions where they arise and in your examples and any user docs. User docs are vital if you're sharing you code with other people.

Check yourself before you Wreck yourself

Sometimes, you have known preconditions (things which must be true when a function is called), invariants (things which should stay the same throughout), and postconditions (things which are true after a function ends). You should consider adding code to check (some of) these things. Sometimes a compiler can do some of this for you, such as Fortran's array bounds checking, or C++'s ".at()" to get a vector element, rather than using "[]". Since these can be costly in run time, they may be better as a debugging option rather than a normal part of the code.

Test your Limits

The last essential element is testing. This is a very broad topic, and really isn't easy. At its core though, it means working out what parts of the code should do and checking that they do it. Smaller parts are easier to handle, but (remember your time is precious) can mean writing lots of trivial tests where you could easily absorb them all into one test. If it is at all possible, have at least one test case of the entire code - something you can run and know the answer to, that isn't too trivial. We talk a little about testing in our workshopsand there are millions of books, blog posts, papers etc on how to do it. Just make sure you do something. The rest of this post should convince you why.

The Ideal World

In an ideal world, any code you write has a well thought out, complete specification. Its purpose is fixed and perfectly known; there's a wealth of known results to compare to; the computations are all nice and self-contained; nobody else is really working on it, so there's no rush; and yet you can still do something novel and interesting with it. In this case, avoiding errors is as easy as it ever gets. You split the code into self-contained modules, each doing one thing and doing it well. You write careful error checking into every function, making sure they are never called with invalid parameters. You test each function of every module carefully in isolation and as you put the modules together, you test at various stages. Finally, you run some problems with known results and verify that you get the correct answer. As you write, you document all the functions and the assumptions they make. Finally, you write some user documentation describing what the code is and does, what its limitations are.

That's all great in theory, but for it to be enough it really is essential that everythingof importance be in the specs. I am not sure that has ever happened, anywhere, ever. It gets close in things like Aerospace and Medical engineering, but even there mistakes are made.

Code or Computation

The second best situation, from a coding point of view, is the one where the Code is trivial, but the Computations hard. For instance, a simple program to count which words occur near to some target word is fairly easy to write, although far form trivial (capitalisation, punctuation, hyphenations etc), but text concordance analysis is widely used and is able to find new, novel results if correctly analysed. Or imagine coding up a very sophisticated algorithm, with minimal support code. From a code perspective this is also quite easy - you just make sure the code contains the right equations (and hope that you wrote them down correctly).

Of course, in these case you still do all the things from the ideal-world - you modularise the code, you test it carefully, you document the assumptions. I'm not going to discuss all the details of this here: we cover a lot of it in our training, especially here. You do all this, and you'll catch some errors, but it probably wont be all of them.

Sadly though, with many pieces of code you may write, you don't even have this level of "simplicity". The code is hard - it uses difficult techniques, or has many interacting pieces, or there aren't any good test cases you can use. Testing, verification and documentation as important now, if not more so. But you wont find all the potential errors.

What's the Point?

But at this point, it seems a bit pointless. You make all this effort, spend all this time on testing and verification, and its still probably got bugs in. Your code still probably has errors somewhere no matter what you do.

Well firstly, and most practically, the more you check and test, the smaller your errors are hopefully getting. You catch any really big glaring errors, you fix lots of things, and you absolutely have a better piece of code. The time you're spending is never wasted, although its important to try and direct it to the places you get the most return. Keep the 80-20 rule in mind. You get 80% of the gains from the first 20% of the effort in many endeavours, but getting the final 20% takes the remaining 80% of the time. This doesn't mean you should fix 20% of the bugs though! It means you finish all the easy 80s before moving on to the more finicky checks.

Secondly, more philosophically, you will always know that you did it. You can point to the time you spent and say "I tried. I did the things I was supposed to. I wont make this mistake again." That is a very nice thing to be able to say when you find you've made a mistake that's going to cost you time and/or effort.

Sh*t Happens

So, if you've published a paper, you've probably published a mistake. Hopefully, this isn't a important msitakes, just some minor grammar, a misprinted formula, or a miswritten number that doesn't affect results. It's pretty important to keep perspective. Everybody makes these mistakes. The important thing is not to make them because you didn't do things how you know you should.

I like to read Retraction Watchon occasion. Much of what they publish is serious misconduct from serial bad actors, but sometimes there's an interesting post about honest mistakes, and interesting commentary on how it's meant to work. For instance, when should a paper be retracted rather than amended? How do you distinguish stock phrases from plagiarism? What's so bad about salami slicing?

I particularly like things like this retraction, where an error was found that needs more work to address, so a correction wasn't suitable; or this one where the error meant the result wasn't interesting. The errors probably happen a lot more often than the retractions or corrections. That's not really in scope of a blog post, but it's good to see the process working well.

$600,000 Mistakes

To summarise briefly, everybody makes mistakes. If you're lucky yours will be minor, caught before they go "into the wild", and make funny anecdotes rather than tragic tales. The worst thing you can do, as a researcher, is try to hide your errors. You should fix them. Whether this can be done in a follow-on paper, or needs a correction or retraction, it should be done!

Errors in code, rather than papers, are a tiny bit different - they're probably more likely to be harmless. But don't be fooled - if the error invokes "undefined behaviour" (anything outside of, or not conforming to, the standard) you will never know whether the result was affected unless you redo everything. It's very tempting to run a few test cases, find they're OK and assume all the others are fine too. The definition of undefined behaviour means you can't really know - its perfectly valid to work 99 times and fail the 100th. If you find any, you have to fix it and redo all results. Hopefully they're unchanged. That's great. You fix the code, you upload the fixed version - admitting the error and the fix, and you add the error to your list of "things to be careful of". You don't need to do anything about your results - they're still solid.

If results are affected, talk to your supervisor, or PI, or trusted colleagues and work out what's an appropriate solution. Make sure you've learned from this and you're not going to make the same mistake again. The former CEO of IBM, Thomas John Watson Snr said it best:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. “No”, I replied. “I just spent $600,000 training him – why would I want somebody to hire his experience?”

Don't be scared of bugs. They happen and they always will. But the more you find, the better you develop your "sixth sense" and can almost smell where they might be, and the better your code and the research you do with it, will become.

November 14, 2018

Now with Less NaN!

This week I'm at Supercomputing 2018 in Dallas, enjoying a wide variety of talks! Just a short post on a couple of the "Aha" moments from this talkon correctness and reproducibility of Floating point code by James Demmel (University of California, Berkeley). I had never noticed the perils of NaN and Inf propagation he mentioned and I think they're really worth knowing. I said a bit about IEEE 754which dictates what NaN and Inf are and how they behave in this post. This entry is mostly about unexpected ways the standard can be broken by accident, mainly due to optimisations that rely on multiplying by zero. Per the standard, anything except NaN, times by zero is zero, but any operation involving a NaNmust give NaN.

Max Value the Obvious Way

Imagine you're asked to write a routine to find the Max or Min value of an array. It's obvious. Start by assuming it's the first value, and then compare every other value to see if it's higher/lower, until you reach the end. Max/Min MUST look at every value in the array to get their answer (I am sure there's a mathematical word for that, but I don't know it), so this is the most efficient option there is in terms of comparisons. However, it has a major flaw once we account for IEEE behaviour: NaN compares as false to everything, even itself.

So, imagine running this algorithm on the arrays [NaN, 2, 1] and [2, NaN, 1]. We get NaN and 2 respectively, as we've made the first element "sticky"! Step-by-step:

Max of [NaN, 2, 1] :
Take first element : NaN
Compare to second element: 2 > NaN -> False : Max = NaN
Compare to third element: 1 > NaN -> False : Max = NaN
Max of [2, NaN, 1] :
Take first element: 2
Compare to second element: NaN > 2 -> False: Max = 2
Compare to third element: 1 > 2 -> False: Max = 2

This was an example given in the talk, and it's obvious when you look at it, but easy to not think of! There's two serious problems here: firstly the loss of the NaN value in the second case, which violates the standard and secondly the change of answer on array permutation, which is just generally bad for something like Max which should be invariant.

3x3 Determinant

Now imagine taking the determinant of a 3x3 Matrix. Generally, this is easy. You take each element of the first row, and multiply by the determinant of the 2x2 sub-matrix omitting the column at hand, with appropriate signs. Suppose a first row element is zero: obviously that term contributes 0 regardless of the 2x2 submatrix! So you can speed things up by testing for zero and omitting any such terms, right? Not once NaN gets involved. Now consider this matrix:

1 	0 	0
NaN	1	1
0	0	1		

We should get: 1 * 1 + 0 * NaN + 0 * NaN = NaN. But with our clever optimisation, we get 1 * 1 + 0 + 0 = 1. This is a pretty big problem! We're hiding away that NaN, by accident!

There's another problem too! Even if we checked for NaN in our array, what if we have an Inf instead? Inf * 0 should give us NaN. Effectively, Inf is "a number, but bigger than we can represent". Zero is "anything smaller than the smallest number we can represent". The product is then kind-of "any number, we don't know which". It might be zero, it might be Inf, it might be anything. So, the standard dictates it be NaN.

So, if we use the optimisation, we're not just hiding a NaN, we're stopping one ever being created! This is potentially an much more serious problem, because we might be happy using the Compiler flag to give us errors on NaN, but not willing to make under or overflows fatal. Then we'd completely miss this problem!

Optimisations in BLAS

The talk (at least it's first half) focused on the Linear Algebra library BLAS, which is heavily optimised, but as it turns out does have several bugs like this. For instance, a slightly complicated multiply op, alpha * A * B, might check for alpha = 0 in order to save the work of the matrix multiply when the end result should be zero anyway. Unless, of course, either A or B contains any NaN's when we must get back NaN.

But it's all gone wrong anyway...

One comment on the talk, which is worth a mention was, to paraphrase "If you're getting NaN's things have already gone wrong, so why should the Linear Algebra library bother to check? You should be checking." This is an interesting point and in some circumstances holds true. I tend to think of NaN or Inf as bugs and would prefer to write code so they can't arise, but I am not working with long linear algebra workflows where you want to run a series of operations in something like BLAS and don't examine the intermediate steps.

Reproducible BLAS

The problem above was largely the topic of the first half of the talk, which focused on how BLAS could manage the problem via errors/exceptions, and being careful with optimisations. For instance, use the optimisations to do the calculations, but also return an error code detailling any issues with the arguments.

Another large part of this talk was about reproducibility, in particular in parallel codes. Here, depending on the processor decomposition, you can get subtly different results for a sum over numbers of varying size. This is a really tricky problem, which is being addressed by things like ReproBLAS.


Dealing with Floating point numbers is hard. It gets harder once you demand reproducibility at a bitwise level. Both the Maxval issue above, and a simple sum with values of varying size may not give the same answers when terms/entries are reordered. In particular, parallel codes always effectively reorder terms whenever you chance the decomposition (e.g. run on more or less cores).

Hopefully, you never need to really worry about NaN (or the less tricky, but still tricky Inf), but do be careful! The problems I've discussed are mainly due to treating zero incorrectly per the IEEE standard, the problems with NaN or Inf following from this. Be careful with zero! It might not be as small as you think ;)

November 2018

Mo Tu We Th Fr Sa Su
Oct |  Today  | Dec
         1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30      

Search this blog



Blog archive

Not signed in
Sign in

Powered by BlogBuilder