All 8 entries tagged Bugs

View all 9 entries tagged Bugs on Warwick Blogs | View entries tagged Bugs at Technorati | There are no images tagged Bugs on this blog

July 24, 2024

Testing the test suite

Testing is good! Or at least is can be. Tests increase our confidence that code is correct. However, writing bad tests is incredibly easy and assessing the quality of a test suite is a perfect example of the "crowbar inside the box" problem. If we can be sure we wrote the test correctly, why can't we make the same assertion about the code already?

Easy Cases

There are times when the tests truly are "easier", "simpler" or "more nailed down" and in these cases, we might feel safe.

For example, sometimes, the test is much simpler than the code, and sometimes the test really is externally prescribed, and in these cases, writing tests is pretty safe - we're unlikely to make the same errors in test and code.

Sometimes too, we can form extra, trivial-to-write, tests by not getting hung up on true "unit" testing and working with inverse pairs instead. For example we can verify our file writing and reading are the reverse of each other, and give us the correct object back. We can verify that our "plus" and "minus" are the reverse of each other, and that A + B - B == A. Making the test extremely simple means it has very little chance to be wrong, but it also limits how thoroughly we are testing things.

Code Coverage

You may have heard of coverage testing - adding confidence in your test suite by verifying that it actually "exercises" every line of your code. At least, that's what you hope. However, in general all coverage does is say that you ran a line - it can't say whether you verified the outcome.

But the Tests Passed!

To come to the important point: the tests are testing the code, and the code is testing the tests. If both agree, great. What if they don't? Are you really sure that the error is in the code, not the test? And on the flip side, if the tests pass, does that _really_ make you confident in the code?

Mutant Code

Enter "mutation testing" - a way to test our *test suite* using our *code*.

Mutation testing belongs near the end of development, once code is written, tests pass, coverage is good (all for a given function or set of functions, not necessarily the whole code), and you are wondering if there may still be bugs, or you are trying to get something ready for "prime time" and want to be confident it will work when used in ways you haven't been using it.

It's based on a very simple premise - if the code is broken, the test ought to fail. Sounds obvious, but have a good hard look at some tests you've written or worked with. Are you sure that will happen? Are there cases or paths that don't get checked? Are you catching all the edge cases and transition values in your test?

Once again, if just looking over the test code let you guarantee those things, then you could do the same to your code, and not need the tests at all.

Instead, let's break the code and see what happens. Swap operators, add off-by-one errors, remove checks on bad input! Force the tests to prove that they can fail! Then we know that passing means something.

Practical Mutation Testing

Obviously we want something systematic to do this, and that is actually pretty hard, and the topic of active Computer Science research still, in spite of having existed for ever already. But as always, "the perfect is the enemy of the good" or indeed the "good enough". The more holes we can close, the fewer will be left, so even something not great is worth a try.

The idea is as follows:

  • generate systematic "mutations" of the source code
  • check that these compile or run
  • try and systematically eliminate "equivalents" - code that, although different, has the same effect
  • run the test suite against every "mutant"
  • If the tests fail, the mutant has been "killed"
  • If the tests pass, there is a hole - an error they are missing
  • try and fix the holes in the test suite

NOTE: there are a few things this can turn up that do indicate changes to the code are in order, but in general you should primarily be altering the test suite. If that causes tests to fail that you expected to pass, then you need to alter the code. If that shows sections of "dead" code, then you might want to alter the code. Try and keep these two processes separate.

NOTE 2: BE VERY VERY wary about altering your code just to allow it to be tested.

Tools

Recommending a tool is tricky. Github right now has ~300 repos tagged mutation-testing covering a bunch of languages. It also has several repos just listing extant and abandoned tools. The UniversalMutator (https://github.com/agroce/universalmutator) deserves a mention, as a language agnostic and pretty usable option.

If you're not sure if this is worth doing, try flipping a bunch of operators in your code, adding some range errors, deleting a statement etc and running your tests. See if you have gaps to fill.

Brief Aside:

Note for the interested: while that all might sound pretty convincing, there's a lot we are skipping over. For instance, we have to assume that big errors tend to occur due to multiple small errors, and thus squashing the small bugs will squash the big ones too. This is known as the "coupling effect" e.g. https://dl.acm.org/doi/10.1145/75309.75324which also discusses the wonderfully descriptive "Competent Programmer Hypothesis" which posits that we normally only make small mistakes in things.

Conclusions

Some times it is tempting to talk about testing as infallible. This has come up a lot recently with some high profile failures of large products, where many people despair that "they obviously didn't test it". That's just not true, we only know they didn't test for the specific failure.

Actually implementing "quality testing" is about a lot more than just adding a few test cases, and in important software, library software etc, it is good to be aware of just how much moreis needed to be truly confident in code quality.

If you take one point away from this, let it be this one: A test that cannot fail is no better than no test at all.


March 20, 2024

It's sort of pass by reference, ish

When teaching Fortran, it's often tempting to describe it as a "pass by reference" language. This is almost true. Unfortunately as the phrase has it, "almost only counts in horseshoes and handgrenades" and "almost true" in programming terms isn't usually good enough. All too often, the sort of complicated technical thing that should only matter to an expert trips up plenty of beginners who, like the Inexperienced Swordsman "doesn't do the thing he ought to do"

So what's the truth? The truth is that most of the time things work just like pass-by-reference because they are expected to look just as if they were. But actually, they are allowed to do a "copy-in copy-back" process, as long as the final result has the expected changes. Copies can be expensive, so compilers don't tend to actually do this without a good reason.

The place this usually comes to our notice is when we use slices of arrays. For example, we can take every second element of an array in Fortran, very easily. Or we can take half of a 2-dimensional array (in the row direction). Both of these are valid "array slices" but both consist of data that is not contiguous in memory. Items do not follow each other one after the other in memory.

But in Fortran, we can pass these to some function that expects just "an array". Now imagine what would have to happen to do a true "pass by reference" here. The compiler would have to pass the array, and then enough information to "slice out" only the values we wanted. This would be pretty weird, and if we passed that value on to another function also expecting an array, it could easily get out of control. So instead, the compiler will tend to generate the code to copy the values in the slice into a temporary array, use them there, and copy them back when we are done. To us, everything will work seamlessly.

That is, as long as we do what we're supposed to, everything works. But if we start breaking rules, we can get some very odd behaviour from this. Compilers are like puppies - they trust us to keep our promises. In fact, they rely on this so much that they simply assume that we will! When we don't funny things happen.

The following code is modified from this old post https://groups.google.com/g/comp.lang.fortran/c/z11RW0ezojE?pli=1 to use modern Fortran.

MODULE mymod

CONTAINS

SUBROUTINE set_to_two(B, C)
REAL, DIMENSION(10) :: B
REAL, DIMENSION(5) :: C
INTEGER :: i
DO I = 1, 10
B(I) = 2.0
ENDDO
END

END MODULE

PROGRAM main
USE mymod
REAL, DIMENSION(10) :: A

A = 1.0
CALL set_to_two(A, A(1:10:2))
PRINT *, A

END

Pretty simple code - it takes an array and sets every value to 2.0. For some reason it also takes a second unused irrelevant parameter, but we can ignore that one, surely?

Run that code, as written with current gfortran, and this is the result:

1.00000000  2.00000000  1.00000000  2.00000000  1.00000000  2.00000000  1.00000000  2.00000000  1.00000000  2.00000000  

The code is bad - passing A as two different parameters to a function violates a promise we make as programmers not to alias variables (call them by two different names). But even knowing this we're pretty surprised to get that result! In particular, on my machine, if I call the function with CALL set_to_two(A, A(2:6)) which violates the aliasing rules just as badly, nothing odd happens at all, and all I get is A = 2.0. In this case, the compiler is able to avoid a potentially costly copy as the data is contiguous even though it's a slice.

It's pretty obvious what is actually happening once we know about the copy-back idea. Because the compiler trusted us not to have two names for the same piece of data (the names here being B and C, and the data being A) it happily copies data for the C argument, copying it back at the end. This copy is never affected by the update to B so its content remains 1.0. That gets copied back into A, overwriting the changes we'd made.

This can happen even though we never use C in the function, so nothing actually changes - that second irrelevant argument is not so irrelevant after all.

Take Home Point

The real take-home point here is not to upset your compiler - don't do things wrong and these sorts of details wont matter. But when things do go wrong, it can be pretty helpful to understand what is actually going on.

Honestly a lot of the bugs we write as programmers are a case of miscommunication - what we thought we were writing is not what we actually have said. This is expecially true with modern optimising compilers, which very liberally agree to produce a program that works "as if" it was the one we wrote, and will happily omit things that do not affect the results, or, as in this case, will assume that we are writing correct code and act accordingly.

Fortran is usually a lot nicer to us than C/C++ in terms of undefined behaviour, but as this example shows, it will still do strange things if we break the rules.


March 07, 2022

What's the time right now?

This is not one of my mistakes, this is one I read about second hand (you'll see the pun in a moment), but it is a great illustration of the importance of writing the code you actually mean to, thinking properly about what it is that you mean, and watching out for weasel words that might confuse you.

Consider these two lines of code and decide for youself whether they have the same effect:

if (DateTime.Now.Second == 0 && DateTime.Now.Minute == 0)

if (DateTime.Now.Minute == 0 && DateTime.Now.Second == 0)

Decided yet? If you think they are the same, take a second now (that's a hint!) to think about the meaning of each part, and what its value will be. Suppose the original author intended to run some process once per hour, so put one of these into a loop. Can you see why one of these might succeed, and the other fail occasionally (assume the rest of the code takes some tiny fraction of a second)?

By the way, if you're worried about not knowing what language this is in, don't be. This is a perfect example of good "self documenting code" which we've written a bit about before. Based on the names, we can infer (correctly) that DateTime is some sort of thing, which can supply the date-and-time Now through the "DateTime.Now" construct, and that we can then look at the Minutes and Seconds value of the time Now. That's all we need to know.

So, back to the question. Are the lines the same? Well, the order is different. There's a double-& AND operation. We don't strictly know whether the left or right condition will be checked first (it might depend on language), and we don't actually know if both will be (if you're not sure why that is, we've talked about short circuiting in logical operations here). But that shouldn't matter unless the DateTime object is having some weird side-effects and I promise it isn't. The problem is much more fundamental than that.

If you haven't spotted it, it's time to write down very clearly what we wanted to check. We want to know if, right now, the value of both the minutes and the seconds entries are 0. It looks like that's what either line does, but there's a nasty weasly word sneaking in here - the word "now". What can "now" actually mean in programming terms, where we know operations ultimately are carried out one by one in sequence? For instance, what would you expect the following code could print?

if (DateTime.Now.Second == 0) print(DateTime.Now.Second)

If you said 0 only, hold on a second (that's a hint, too) and look again, and think about what this code will actually do in terms of basic operations. It will get a value for DateTime.Now.Second. It will compare this to 0. If the comparison is true it will get a value for DateTime.Now.Second, and it will print this value. And there is the problem! The value we check and the value we print are not always the same.

If it still isn't clear to you why the original two lines are not equivalent, reread that last paragraph a few times and apply the same idea to the original question. If it's been obvious to you all along, great. You're unlikely to make the mistake this original programmer did. But why did they make this mistake? There could be several reasons:

  • They might simply not have spotted there could be a problem - this is pretty likely, but not very interesting
  • They might have thought somehow the compiler/runtime would recognise the values as "the same" - this is a pretty fundamental mistake, perhaps due to thinking of "Now" as a property rather than an operation. In some very imprecise sense, we should think of "Now" as having a side-effect on itself because it takes time, so multiple calls will not have the same result as a single one
  • They might have blindly removed a temporary variable containing DateTime.Now, without thinking deeply about the consequences. This is quite likely because it looks so simple, and is why refactoring like this needs, if anything, more attention than writing the code in the first place
  • Lastly, they might have thought carefully and precisely, but ultimately wrongly about it. If they confused the idea of seconds as a time point with seconds as a time interval, they may have thought that both the Seconds and the Minutes property would surely be evaluated within the same second, and thus work as expected, where actually even a microsecond between them could be the difference between 1:59.999 and 2:00.000

So, that explains why the lines don't work as intended, but there's one weird thing left - how come one of them (the first one, if evaluation occurs left to right) works, but the other doesn't? Effectively, the minutes value depends on the seconds value - the minute is incremented only when the seconds reads 59. We know the whole operation takes far far less than 1 second (ignoring interrupts or anything else suspending operation), so once we have confirmed the seconds value as 0, we have at least 59-60 seconds to check the minutes value, during which it cannot change. Certainly this will occur within that interval, and the code will work perfectly (again, ignoring external iterrupts, which our code can never account for anyway).

So, does that mean the first option is actually safe, or good code? Well, not without a comment it certainly isn't! If we're relying on this sort of subtlety, somebody is going to be tripped up by it, and it might even be us. It's also a bit delicate - suppose we were to need to use DateTime.Now in another part of the condition, or in the body? We'd have to think very hard about whether we're still "safe" and we simply don't need to. A temporary variable is much clearer, and the possible optimisation here is a) tiny and b) best left to our compiler anyway. Oh and c) possibly a de-optimisation anyway, depending on the cost of the DateTime.Now lookup. Also, it'd be easy to forget that the left-to-right ordering is important, and we might try this in a situation or language where that is not guaranteed.

Ultimately, I simply don't like relying on such a subtlety. It's too clever and it doesn't need to be. It's not that I don't like clever code, because I do. I like clever code that makes one exclaim "oh yes, that's so simple, it's brilliant", and this aint that. This is more "oh blimey, is that how it works". Keep it simple, and if you can't keep it simple, at least make it satisfying.

Postscript

Before we go, I hope something has been nagging at you throughout this post. I started off by saying we should think hard about what we mean and what we want to actually do. I introduced this idea of minute-0 and second-0 and then proposed it as a solution to doing something once per hour. Did it strike you as a pretty terrible solution to that problem? Because it is. Suppose we start running our code at 9:00:30 AM. Would we expect whatever this block controls to wait an entire hour to fire for the first time? What if the code is never running at the change of the hour? This block would never run. Suppose we start 1000 copies of this code. Do we really want all 1000 to execute this block at the same time? Suppose some other code took longer than expected and this block isn't reached once per second - some hours might be simply skipped. In nearly all cases we'd be better checking whether at least an hour has elapsed since last we did the thing, and we'd avoid all of these problems. We meant once per hour, not on the first second of every hour, and we should have coded that, not this.


This post is inspired by an article I read a while ago, namely https://thedailywtf.com/articles/an-hourly-rate In particular, a comment mentions the different behaviour of the two possible lines, which this post focuses on.


May 30, 2019

Black and white

Quick 'philosophy of programming' entry this time. General solutions to common programming probelms are often called 'design patterns' (e.g. the wiki articleor the book which started the name). The idea of these is to have language independent (as far as possible) 'patterns', like clothing patterns, which can be tweaked to fit a specific situation. A lot of these patterns seem obvious, which is good, and since they're developed and tested by many people they can be very valuable in showing you questions you hadn't even thought of.

This weeks topic is perhaps too simple to really call a pattern, but it is a very useful thing to keep in mind when doing anything that deals with restricting function which exists, but should not be allowed. For example, forms which take user information often disallow anything except numbers in a 'telephone' field. A code I work on has a lot of user-specifiable options, but as the programmer I know that some are incompatible where it might not be obvious to the user - and I want to either warn or abort if these are used together.

There are two general approaches to things like this, and which you choose depends on many things. You have to maintain some kind of list to check against, but you can choose to use either the "blacklist" or the "whitelist". The former, the "blacklist" is a list of the things which aren't allowed, and anything not in the list is OK. The "whitelist" approach means keeping a list of the things which are allowed, and anything not in the list is excluded.

Sometimes the choice is fairly easy, because one method is a much, much simpler list. For instance, in the phone number example, it is fair easier to use the whitelist, allowing only '1234567890', but nothing else. If you try the other way, you might think to exclude letters, but what about Greek or Cyrillic characters? On the other hand, this is a source of deep annoyance if you forget any needed character - in the example I just gave, one could not put any spaces in, which is annoying, nor brackets or the '+' symbol.

A classic example of the poorly-thought out whitelist is in name fields which often exclude characters like the apostrophe, annoying the Scots and the Dutch for eternity. And what about accented letters, or the German ess-tsett. With a whitelist, you need to be sure you've caught everything, or people will be, rightly, upset. For a user-name on a website, and for a password, it is probably fine to allow any ASCII or Unicode character and set up your systems to handle them, leaving far less upset without any real cost to you.

On the other hand, with a blacklist, anything not forbidden is permitted. These are generally used in cases where certain characters have a function and so must be excluded, even if this annoys. So, for instance, in most programming languages variable names may not start with a number, nor contain a comment character.

Apart from the length of the lists, the two methods trade off this annoyance to your user, who must wait until you fix the omission (with a whitelist) against potential unknown failures and security risk (with a blacklist). Imagine the 'incompatible features' problem with both methods. If I use a whitelist, and forget to allow some pairing, my worst case is that I will likely be asked (somewhat irately) why X and Y can't be used together. I realise they can be, I update the code and I make a new release version to fix the omission and everybody is happy. If I use a blacklist and forget that some X and Y don't work together, my worst case is that one day I have to tell somebody that their last n years of research is all invalid, because the simulation they ran didn't work as expected, and since nothing actually went wrong they didn't know. Worse still, would be having to tell them that their fascinating effect is just a code error, and it's my fault.

In some cases, only one or other list type is really viable. For instance, virus scanners keep a list of 'tells' for malicious code, because even though they let things slip through until their lists update, they could never describe all of the 'allowed' code. App permissions (on better, more granular systems) are a whitelist - you give an app the permissions you choose, and only those.

So as a general rules of thumb:

  • If only one method is viable, obviously use that
  • If one or other list is going to be much much shorter, you have better chances of getting it right, so use that method
  • If it is really important not to let things slip through, use a carefully managed, kept up to date, whitelist. If possible, put it into a file or something, so that updates just require sending out new definitions, not modifying the entire code
  • If it's really important not to get accidental exclusions (false positives) use a, similarly carefully managed, kept up to date etc, blacklist
  • In some cases, combine the two. Programming languages generally have a set of allowed characters (a whitelist) and small blacklists for specific contexts such as the first character of a name.

As well as the literal 'blacklist' and 'whitelist' there is a more general principle here - do I selectively forbid, or selectively allow? Do I stop somebody doing this thing here, here and perhaps here, or do I only permit them to do it there and there. If you find the 'here's' or 'there's' proliferating, re-examine whether you're doing it the right way around. In safety or security critical situations, you almost always must allow only what is permitted. If you find yourself trying to plug up security holes with ever growing blacklists, you should probably change tack and think about what should be allowed instead.


March 14, 2019

Multiprocessor Profiling

Super quick this time - a bit of background and then something I didn't know worked as well as it does - namely gprof on multi-core programs.

Profiling

Once code works (and NEVER before) you can start worrying about performance. Often, you can tell using a bit of intuition and a few test runs what takes the most time, especially in simple cases. But this is always a bit risky, because you can easily miss the real performance limiting steps. Sod's law (if it can go wrong, it will) usually means that if something is completely unexpectedly slow, it's either trivial to optimise, because it really oughn't to be slow, or it's almost impossible.

The answer to this is not to guess, but to use one of the many available profiling tools, which tell you, function by function, and line by line, where your code spends it time. You then know where to focus your efforts.

Our upcoming training course (Accelerating Python, 27/3/19, see our calendar) will discuss profiling in detail for Python programs, and there'll be a more specific post after that, so watch this space!

Profiling Compiled Code

For compiled codes, profiling is usually pretty simple, but does require you compile code specially. For C/Fortran etc, there is a very good profiler as part of the gcc toolchain, called gprof. Using gcc, you simple compile your code with `-pg -g`(the second g adds debugging symbols and is only required on older gcc) and run it. This produces a machine-readable file of all the timings for that specific run, usually gmon.out. You can then turn this data into nicer formats using the gprof tool. Something like

gprof <path to program executable> <path to profile file>

You have to provide the program file too so that gprof can report to you in terms of program source lines. Note that strange things happen if the program you give here isn't what you ran (wrong line numbers etc). This is terribly infuriating, so don't do it.

Old Dog New Trick

I have used gprof plenty in the past, but I recently found a really useful way to apply it to multi-core codes (using MPI). This doesn't tend to work very well, because the output from different processes gets all mixed up and the timings end up wonky. It turns out that there's a really, really, simple fix, decribed e.g. hereand reproduced below.

Firstly, you need to get the profiling data on a per-process basis. As the link above says, this isn't very well documented, but is easy to do. Before running, set the environment variable GMON_OUT_PREFIX. This has to be done where the code is running, so if on a cluster etc, you'll probably need to add an export or setenv to your submission script, and/or use the '-x' argument to mpiexec (see below). So we do something like

export GMON_OUT_PREFIX=gmon.out-

When the code runs, it will produce one file per process (so be very careful on a cluster not to overwhelm your home space) named 'gmon.out-<PID>' where PID is the process ID. In theory it is possible to have the files named by MPI rank, but apparently this doesn't work well.

Now you run the code as usual, although on some systems you may need to make sure MPI run is fed the above environment variable. You can do this using the '-x <env var>' flag to mpiexec. On a local machine, as long as you exported the variable, this shouldn't be needed.

Now you want to produce the function or line level profiles using the sum of all of them. You can supply all of the files to gprof, or you can have gprof sum them into a single file, by default gmon.sum, using

gprof myprog -s gmon.out-*

Make sure the wildcard matches the right set of files - when I first tried this I forgot to clean out the files from previous runs and ended up summing across runs, with very odd results. Of course, you can use this to sum across runs to very useful effect if you want to see average profiles across multiple sets of parameters etc.

Now you can run gprof as usual, using either

gprof <path to executable> gmon.out-*

or

gprof <path to executable> gmon.sum

and voila, properly added up profiling information for multiple cores. Super.


November 28, 2018

Experience Errors to Excel

No, not the Microsoft spreadsheet program. Warwick RSE don't think Excel is evil, or that it makes errors more likely, but most of what we work on or talk about is beyond the point where Excel, Numbers or any other spreadsheet software is really efficient. Anyway, this post is going to be a little bit of programming and research philosophy, as well as some practical bits about how to make errors in your code and research less likely.

Tools of the Trade

First, a few easy rules to make serious errors a lot less likely.

Have some Standards

Most of the "Old Guard" programming languages, the ones you find in banking software or aeroplanes, have very strict standards dictating what is valid in the language. They are usually revised every 3-5 years, and controlled by some professional body, often ISO. A working compiler or interpreter must obey these standards or it is wrong. In practice, support for new versions of standards takes some time to develop, but this is usually well documented. Some major packages, such as MPI, also have standards. If your language (or library) has standards, follow them! If it doesn't have formal standards (e.g. Python), be as conservative as you can in what you assume. For instance, try the calculation 2/3 in Python 2 versus Python 3.

7 Ps

Proper Planning and Preparation Prevents Piss Poor Performance. Supposedly a British Army adage, and one of my Mum's favorites. Plan before starting! Work out what you're going to do. Prepare - read up on any new techniques, do some "Hello World" examples with any new packages etc. Check all the bits of your plan can work in practice - can they be fast enough? Is your chosen language/package a good choice? Can you handle the amounts of data/computation/other needed?

Borrow your Wheels

There is a crucial balance between reinventing the wheel everytime you write some code on the one hand, and importing a package for every non trivial task on the other. Neither of these are a good idea, and both tend to lead to more and worse errors. Borrow widely (with proper attribution), but don't code yourself into a morass of dependencies that you can never hope to test. Especially don't borrow from things that are no better tested or verified than your own code - this is a recipe for disaster.

Document Everything

It's very easy to assume some feature or limitation of a code snippet is "obvious" when you're writing it, only tocompletely forget about it when you come back. The only way to avoid this is to document. Do this a few ways.

Restrictions on a code snippet, that don't impact anything else, so don't fit either of the next two paragraphs, should be commented as near the line as possible. Try and make it so that the comment still makes sense even if the code changes a bit - e.g. don't use line numbers, don't use "the following line" etc.

Restrictions on a function (parameter x must be +ve etc) are best done using one of the documentation libraries, so the restrictions are in comments near the function code.

Restrictions that run beyond a single function (don't use this code for more than 1 million elements at a time!, NaNs must be removed from data before input) should be mentioned both on the functions where they arise and in your examples and any user docs. User docs are vital if you're sharing you code with other people.

Check yourself before you Wreck yourself

Sometimes, you have known preconditions (things which must be true when a function is called), invariants (things which should stay the same throughout), and postconditions (things which are true after a function ends). You should consider adding code to check (some of) these things. Sometimes a compiler can do some of this for you, such as Fortran's array bounds checking, or C++'s ".at()" to get a vector element, rather than using "[]". Since these can be costly in run time, they may be better as a debugging option rather than a normal part of the code.

Test your Limits

The last essential element is testing. This is a very broad topic, and really isn't easy. At its core though, it means working out what parts of the code should do and checking that they do it. Smaller parts are easier to handle, but (remember your time is precious) can mean writing lots of trivial tests where you could easily absorb them all into one test. If it is at all possible, have at least one test case of the entire code - something you can run and know the answer to, that isn't too trivial. We talk a little about testing in our workshopsand there are millions of books, blog posts, papers etc on how to do it. Just make sure you do something. The rest of this post should convince you why.

The Ideal World

In an ideal world, any code you write has a well thought out, complete specification. Its purpose is fixed and perfectly known; there's a wealth of known results to compare to; the computations are all nice and self-contained; nobody else is really working on it, so there's no rush; and yet you can still do something novel and interesting with it. In this case, avoiding errors is as easy as it ever gets. You split the code into self-contained modules, each doing one thing and doing it well. You write careful error checking into every function, making sure they are never called with invalid parameters. You test each function of every module carefully in isolation and as you put the modules together, you test at various stages. Finally, you run some problems with known results and verify that you get the correct answer. As you write, you document all the functions and the assumptions they make. Finally, you write some user documentation describing what the code is and does, what its limitations are.

That's all great in theory, but for it to be enough it really is essential that everythingof importance be in the specs. I am not sure that has ever happened, anywhere, ever. It gets close in things like Aerospace and Medical engineering, but even there mistakes are made.

Code or Computation

The second best situation, from a coding point of view, is the one where the Code is trivial, but the Computations hard. For instance, a simple program to count which words occur near to some target word is fairly easy to write, although far form trivial (capitalisation, punctuation, hyphenations etc), but text concordance analysis is widely used and is able to find new, novel results if correctly analysed. Or imagine coding up a very sophisticated algorithm, with minimal support code. From a code perspective this is also quite easy - you just make sure the code contains the right equations (and hope that you wrote them down correctly).

Of course, in these case you still do all the things from the ideal-world - you modularise the code, you test it carefully, you document the assumptions. I'm not going to discuss all the details of this here: we cover a lot of it in our training, especially here. You do all this, and you'll catch some errors, but it probably wont be all of them.

Sadly though, with many pieces of code you may write, you don't even have this level of "simplicity". The code is hard - it uses difficult techniques, or has many interacting pieces, or there aren't any good test cases you can use. Testing, verification and documentation as important now, if not more so. But you wont find all the potential errors.

What's the Point?

But at this point, it seems a bit pointless. You make all this effort, spend all this time on testing and verification, and its still probably got bugs in. Your code still probably has errors somewhere no matter what you do.

Well firstly, and most practically, the more you check and test, the smaller your errors are hopefully getting. You catch any really big glaring errors, you fix lots of things, and you absolutely have a better piece of code. The time you're spending is never wasted, although its important to try and direct it to the places you get the most return. Keep the 80-20 rule in mind. You get 80% of the gains from the first 20% of the effort in many endeavours, but getting the final 20% takes the remaining 80% of the time. This doesn't mean you should fix 20% of the bugs though! It means you finish all the easy 80s before moving on to the more finicky checks.

Secondly, more philosophically, you will always know that you did it. You can point to the time you spent and say "I tried. I did the things I was supposed to. I wont make this mistake again." That is a very nice thing to be able to say when you find you've made a mistake that's going to cost you time and/or effort.

Sh*t Happens

So, if you've published a paper, you've probably published a mistake. Hopefully, this isn't a important msitakes, just some minor grammar, a misprinted formula, or a miswritten number that doesn't affect results. It's pretty important to keep perspective. Everybody makes these mistakes. The important thing is not to make them because you didn't do things how you know you should.

I like to read Retraction Watchon occasion. Much of what they publish is serious misconduct from serial bad actors, but sometimes there's an interesting post about honest mistakes, and interesting commentary on how it's meant to work. For instance, when should a paper be retracted rather than amended? How do you distinguish stock phrases from plagiarism? What's so bad about salami slicing?

I particularly like things like this retraction, where an error was found that needs more work to address, so a correction wasn't suitable; or this one where the error meant the result wasn't interesting. The errors probably happen a lot more often than the retractions or corrections. That's not really in scope of a blog post, but it's good to see the process working well.

$600,000 Mistakes

To summarise briefly, everybody makes mistakes. If you're lucky yours will be minor, caught before they go "into the wild", and make funny anecdotes rather than tragic tales. The worst thing you can do, as a researcher, is try to hide your errors. You should fix them. Whether this can be done in a follow-on paper, or needs a correction or retraction, it should be done!

Errors in code, rather than papers, are a tiny bit different - they're probably more likely to be harmless. But don't be fooled - if the error invokes "undefined behaviour" (anything outside of, or not conforming to, the standard) you will never know whether the result was affected unless you redo everything. It's very tempting to run a few test cases, find they're OK and assume all the others are fine too. The definition of undefined behaviour means you can't really know - its perfectly valid to work 99 times and fail the 100th. If you find any, you have to fix it and redo all results. Hopefully they're unchanged. That's great. You fix the code, you upload the fixed version - admitting the error and the fix, and you add the error to your list of "things to be careful of". You don't need to do anything about your results - they're still solid.

If results are affected, talk to your supervisor, or PI, or trusted colleagues and work out what's an appropriate solution. Make sure you've learned from this and you're not going to make the same mistake again. The former CEO of IBM, Thomas John Watson Snr said it best:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. “No”, I replied. “I just spent $600,000 training him – why would I want somebody to hire his experience?”

Don't be scared of bugs. They happen and they always will. But the more you find, the better you develop your "sixth sense" and can almost smell where they might be, and the better your code and the research you do with it, will become.


November 14, 2018

Now with Less NaN!

This week I'm at Supercomputing 2018 in Dallas, enjoying a wide variety of talks! Just a short post on a couple of the "Aha" moments from this talkon correctness and reproducibility of Floating point code by James Demmel (University of California, Berkeley). I had never noticed the perils of NaN and Inf propagation he mentioned and I think they're really worth knowing. I said a bit about IEEE 754which dictates what NaN and Inf are and how they behave in this post. This entry is mostly about unexpected ways the standard can be broken by accident, mainly due to optimisations that rely on multiplying by zero. Per the standard, anything except NaN, times by zero is zero, but any operation involving a NaNmust give NaN.

Max Value the Obvious Way

Imagine you're asked to write a routine to find the Max or Min value of an array. It's obvious. Start by assuming it's the first value, and then compare every other value to see if it's higher/lower, until you reach the end. Max/Min MUST look at every value in the array to get their answer (I am sure there's a mathematical word for that, but I don't know it), so this is the most efficient option there is in terms of comparisons. However, it has a major flaw once we account for IEEE behaviour: NaN compares as false to everything, even itself.

So, imagine running this algorithm on the arrays [NaN, 2, 1] and [2, NaN, 1]. We get NaN and 2 respectively, as we've made the first element "sticky"! Step-by-step:

Max of [NaN, 2, 1] :
Take first element : NaN
Compare to second element: 2 > NaN -> False : Max = NaN
Compare to third element: 1 > NaN -> False : Max = NaN
RESULT: NaN
Max of [2, NaN, 1] :
Take first element: 2
Compare to second element: NaN > 2 -> False: Max = 2
Compare to third element: 1 > 2 -> False: Max = 2
RESULT: 2

This was an example given in the talk, and it's obvious when you look at it, but easy to not think of! There's two serious problems here: firstly the loss of the NaN value in the second case, which violates the standard and secondly the change of answer on array permutation, which is just generally bad for something like Max which should be invariant.

3x3 Determinant

Now imagine taking the determinant of a 3x3 Matrix. Generally, this is easy. You take each element of the first row, and multiply by the determinant of the 2x2 sub-matrix omitting the column at hand, with appropriate signs. Suppose a first row element is zero: obviously that term contributes 0 regardless of the 2x2 submatrix! So you can speed things up by testing for zero and omitting any such terms, right? Not once NaN gets involved. Now consider this matrix:

1 	0 	0
NaN	1	1
0	0	1		

We should get: 1 * 1 + 0 * NaN + 0 * NaN = NaN. But with our clever optimisation, we get 1 * 1 + 0 + 0 = 1. This is a pretty big problem! We're hiding away that NaN, by accident!

There's another problem too! Even if we checked for NaN in our array, what if we have an Inf instead? Inf * 0 should give us NaN. Effectively, Inf is "a number, but bigger than we can represent". Zero is "anything smaller than the smallest number we can represent". The product is then kind-of "any number, we don't know which". It might be zero, it might be Inf, it might be anything. So, the standard dictates it be NaN.

So, if we use the optimisation, we're not just hiding a NaN, we're stopping one ever being created! This is potentially an much more serious problem, because we might be happy using the Compiler flag to give us errors on NaN, but not willing to make under or overflows fatal. Then we'd completely miss this problem!

Optimisations in BLAS

The talk (at least it's first half) focused on the Linear Algebra library BLAS, which is heavily optimised, but as it turns out does have several bugs like this. For instance, a slightly complicated multiply op, alpha * A * B, might check for alpha = 0 in order to save the work of the matrix multiply when the end result should be zero anyway. Unless, of course, either A or B contains any NaN's when we must get back NaN.

But it's all gone wrong anyway...

One comment on the talk, which is worth a mention was, to paraphrase "If you're getting NaN's things have already gone wrong, so why should the Linear Algebra library bother to check? You should be checking." This is an interesting point and in some circumstances holds true. I tend to think of NaN or Inf as bugs and would prefer to write code so they can't arise, but I am not working with long linear algebra workflows where you want to run a series of operations in something like BLAS and don't examine the intermediate steps.

Reproducible BLAS

The problem above was largely the topic of the first half of the talk, which focused on how BLAS could manage the problem via errors/exceptions, and being careful with optimisations. For instance, use the optimisations to do the calculations, but also return an error code detailling any issues with the arguments.

Another large part of this talk was about reproducibility, in particular in parallel codes. Here, depending on the processor decomposition, you can get subtly different results for a sum over numbers of varying size. This is a really tricky problem, which is being addressed by things like ReproBLAS.

Punchline

Dealing with Floating point numbers is hard. It gets harder once you demand reproducibility at a bitwise level. Both the Maxval issue above, and a simple sum with values of varying size may not give the same answers when terms/entries are reordered. In particular, parallel codes always effectively reorder terms whenever you chance the decomposition (e.g. run on more or less cores).

Hopefully, you never need to really worry about NaN (or the less tricky, but still tricky Inf), but do be careful! The problems I've discussed are mainly due to treating zero incorrectly per the IEEE standard, the problems with NaN or Inf following from this. Be careful with zero! It might not be as small as you think ;)


April 04, 2018

Valgrind and Extended Precision

Valgrind is an invaluable tool for checking for memory errors in compiled code. Briefly, it runs your code inside its own runtime which includes checked versions of system functions like malloc, so can detect undefined variables, memory leaks and much more. Using this on one of my C++ codes, I found something rather odd. When I ran normally, I had no problems, but inside valgrind I got unexpected NaN results. The actual code, in Minimum Working Example form was

#include <boost/math/special_functions.hpp>
#include <iostream>
int main(){

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call "<<boost::math::cyl_bessel_j(5+1, (float) arg)<<'\n';
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';

}

Running normally, I got the expected answer of 0.196642 for both calls. Run inside valgrind, the float version was unchanged, but the double version became NaN! This was very strange. Until I constructed the example above, I assumed the bug was elsewhere within my code, and not with the Bessel functions themselves.

After a lot of head-scratching and searching, I found a link to the valgrind manualand the answer:

Precision: There is no support for 80 bit arithmetic. Internally, Valgrind represents all such "long double" numbers in 64 bits, and so there may be some differences in results. Whether or not this is critical remains to be seen.

After a bit more searching and then checking of DBL_MIN and LDBL_MIN from limits.h I had an answer:

#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';
}

Normally one gets a result like:

3.3621e-4932
2.22507e-308
Minimum value is 3.3621e-4932

In valgrind once again:

0
2.22507e-308
Minimum value is 0

So the limited 80-bit support referred to was, on my platform at least, breaking __LDBL_MIN__ which should be the smallest representable non zero value. Since the real problem is happening off inside boost::bessel somewhere, there's no simple fix. Most likely, the problem is that boost uses __LDBL_MIN__ to soften a division, preventing a divide by zero which later results in a NaN. The easier workaround requires that you detect running inside valgrind and adjust calculation. This is easy using the supplied macro:

#include "valgrind.h"
#ifdef RUNNING_ON_VALGRIND
  if(RUNNING_ON_VALGRIND){
    //Alternative path here
  }
#endif

Note valgrind.h is specially licensed, differently to the rest of Valgrind, so that including it in your code does not require you open-source your code. Inside the special path, I used a float version of the bessel call, since inside valgrind I am interested in general output, but don't need the full precision of a double.

Another alternative overrides __LDBL_MIN__ before including the boost header. I chose to set it equal to __DBL_MIN__ Note this only works with header-only libraries and is a bit risky. The code becomes something like:

#undef __LDBL_MIN__
#define __LDBL_MIN__ __DBL_MIN__
#include <boost/math/special_functions.hpp>
#include <limits>
#include <iostream>
int main(){
  std::cout<<__LDBL_MIN__<<'\n';
  std::cout<<__DBL_MIN__<<'\n';
  std::cout<<"Minimum value is "<<std::numeric_limits<long double>::min()<<'\n';

  double arg = -5.58377179584844451;
  std::cout<<"Bessel call with double "<<boost::math::cyl_bessel_j(5+1, arg)<<'\n';
}

Sometimes, the bug really is in the compiler or tool!


November 2024

Mo Tu We Th Fr Sa Su
Oct |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30   

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV