February 07, 2019

New Training

New RSE training opportunity - intermediate level MPI. Following on from a basic MPI course, such as our Intro (Dec. 2018) or Warwick's PX425, we have a 1-day workshop on some trickier topics, such as MPI types. See here for details.


January 30, 2019

Data structures – Queues

This week we're back on data structures with another fundamental one: the queue. Simple data queues are pretty much the same as queues in real life, new items arrive at the back of the queue and data is removed from the front of the queue. They have obvious applications in any kind of program that gets data in from an external source and then has to process it in the order in which it is received. Queues are what are called FIFO data structures, first-in-first-out, because the first data to arrive is also the first data to leave. The alternative LIFO (last-in-first-out) data structure is generally called a "stack" because it is much more like stacking up paper in that the last item that you put down is the first that you pick up again.

The easiest way to implement a queue is using an array combined with a front and a back marker (I'm going to stick with the neutral "marker" description because the general idea doesn't care if you use array indices or pointers or any other mechanism that you want to). The idea is quite simple. You start with an array and when a data item turns up you insert it at the back marker and then move the back marker on one element. Adding an element to a queue is often called "pushing" or "enqueueing" so you might encounter those terms. Schematically, this looks like

Moving back marker in queue

When a piece of data is requested from the queue you return the item under the front marker and then move the front marker on by one. This operation is also known by the terms "pop" and "dequeue" so those are also worth remembering.

Queue moving front marker

And you continue doing this, moving the front and back markers as data arrives and leaves.

So far, so good but there are two obvious problems. The simpler one is what happens if data is requested when no data is in the queue. You have to have some mechanism for dealing with this case by indicating that there is no data available but that isn't too hard. The more troublesome one is what happens when the back marker goes off the end of your array? In the case as show, you can't really do anything at all and one way or another the queue will fail. This type of queue implementation is only useful if you have a fixed known number of items that you need to store as they come in and then deal with. It's not much use for getting data until your program stops and dealing with it.

If you want to do that, there are a few general solutions

1) You can get rid of the front marker entirely. Every time you dequeue an item you take the first item in the array and then simply shuffle the other items down one so that the first item is always populated so long as there are any items in the queue. This works well but it does potentially involve a lot of copying or moving of data if you have a large array that is mostly filled since every element of the array has to be shuffled up by one. If you are dealing with objects where a move/copy operation is expensive or are dealing with threaded code where some threads have to wait while this happens then it isn't the best solution

2) You can move to a circular queue. Circular queues are a bit strange and break the analogy with real world queues since queueing in a circle doesn't really work in reality. The idea is pretty simple though. Since you knowthat everything to the left of the front marker is unused, why not put new elements into there as they arrive and move the back marker around with them? Then when the front marker reaches the end of the array it just goes back to the beginning too. The effect is to mean that the front marker chases the back marker round in a circle. Shown schematically, this looks like

Circular queue

This works so long as back never catches up with front. If it does then either it has to stop adding data or it will overwrite data that is already there. It is possible to generalise these two examples to an arbitrarily long array - simply create a new array that is longer and copy the extant data into it and start again - but in most of the situations where you want a queue it should be possible to avoid this since that is another expensive operation (that will also render any pointers that you have to your data invalid so might involve more work to sort out pointers).

Generally queues are used in producer/consumer systems. There is something that is producing data and something else that is operating on it in some way. On average you have to be operating on the data at the same rate as you are producing it or the data will build up without limit, so the queue is only there to buffer data for brief periods while either your producer produces data faster than normal or your consumer consumes it slower than normal. In this case, all that you need is a large enough queue to deal with the largest expected difference in production and consumption rates. Of course, in a lot of systems you can only predict that "this queue is large enough for 99% of the expected variation" so for the other 1% of the time where you run out of space in your queue you have to decide whether it's worse to stall your entire system while you make your queue bigger (this might be OK for a web server for example where a page might take slightly longer to load but otherwise would work as expected) or simply throw away data (for example in a data logging system where any halt in the process would cause data loss but it might be lower if you just throw it away until there's space to store it than waiting while memory is allocated to hold it). Unfortunately, which is the right answer depends very much on the problem that you are working with.

As a final note, there is also a very similar data structure called the double ended queue or deque (pronounced "deck") that is similar but allows you to add and remove data from either end of a queue. They behave very similarly in general but the implementation gets a bit fiddlier because your back and front markers have to do double duty as each other.


January 16, 2019

Sandcastles

A super quick post, mainly to point at somebody else's post from a while ago, namely http://kera.name/articles/2013/01/the-dynamic-allocation-of-sandcastles/

In case the link disappears, which it looks like the original post already has, the idea is as follows. This is something between a paraphrase and a direct quote, and all credit goes to the Original Poster quoted in the link above, but not named.

The memory space in the computer is like an area of beach, marked off for the building of sandcastles. To strain the analogy to breaking point, other parts of said beach might be marked off for other purposes (e.g. sunbathing, which has no useful relation to program space, and there is far less import to leaving your towel around).

  • When you want to build a sandcastle (create any object or variable) you ask for an area of beach, and one is selected and given to you.
  • What this area contains is a load of lumpy sand, usually containing the dregs of other peoples castles, possibly cutting across multiple. It's a mess, basically.
  • You, usually, either flatten the sand out, or immediately build a castle there, in both cases ignoring what might be there already.
  • When you're done, you surrender the area, but usually don't remove your castle.
  • The organisers are then free to give that area to somebody else.

So, what does this mean? Well:

  1. If you go back to your area (keep around a pointer, reference etc to it), your castle may well still be there, in all its glory. But somebody else can come along at any time and kick it down, and beach security are going to support their right to do so. So, you can't trust the shape of any of it.
  2. You also can't TELL whether anything has happened to your castle by examining it.
  3. If you accidentally released the area (delete, dealloc etc) and you've forgotten that, you will, sooner or later, find a bit that's not how you left it, or get your castle kicked down at a crucial moment. Do NOT DO THIS.
  4. You can't even trust what seems to be a patch of flat sand, unless you flattened it, or requested it pre-flattened (e.g. calloc in c).
  5. If you inherit somebody else's area, there might be a castle there! Since you can never trust the castle, you can't use it. Somebody might have already been along and undermined all the turrets. If you try and use it as a castle, all kinds of strangeness can ensue.
  6. BUT suppose somebody has a secret technique for building castles. If they leave one un-destroyed, you might be able to find out all kinds of "secret" things. So, any secret data has to be carefully destroyed. It's not enough just to surrender control of the area. This is just like the need to carefully destroy data on an old hard drive - by actually over-writing it, not just deleting it.
  7. If you, or somebody else, spills out of their designated area, you generally end up mucking up bits of each-other's castles, usually in such a way as to ruin them. DO NOT DO THIS either.


It's a simple analogy, but it is perfect to explain why it's so easy to miss uninitialised variables, and premature free/delete/dealloc operations. Often things still look perfectly fine! Often all the areas close to yours are unallocated, so you can apparently spill out of your area without consequence, until management changes policy on picking areas and suddenly things go wrong.


November 28, 2018

Experience Errors to Excel

No, not the Microsoft spreadsheet program. Warwick RSE don't think Excel is evil, or that it makes errors more likely, but most of what we work on or talk about is beyond the point where Excel, Numbers or any other spreadsheet software is really efficient. Anyway, this post is going to be a little bit of programming and research philosophy, as well as some practical bits about how to make errors in your code and research less likely.

Tools of the Trade

First, a few easy rules to make serious errors a lot less likely.

Have some Standards

Most of the "Old Guard" programming languages, the ones you find in banking software or aeroplanes, have very strict standards dictating what is valid in the language. They are usually revised every 3-5 years, and controlled by some professional body, often ISO. A working compiler or interpreter must obey these standards or it is wrong. In practice, support for new versions of standards takes some time to develop, but this is usually well documented. Some major packages, such as MPI, also have standards. If your language (or library) has standards, follow them! If it doesn't have formal standards (e.g. Python), be as conservative as you can in what you assume. For instance, try the calculation 2/3 in Python 2 versus Python 3.

7 Ps

Proper Planning and Preparation Prevents Piss Poor Performance. Supposedly a British Army adage, and one of my Mum's favorites. Plan before starting! Work out what you're going to do. Prepare - read up on any new techniques, do some "Hello World" examples with any new packages etc. Check all the bits of your plan can work in practice - can they be fast enough? Is your chosen language/package a good choice? Can you handle the amounts of data/computation/other needed?

Borrow your Wheels

There is a crucial balance between reinventing the wheel everytime you write some code on the one hand, and importing a package for every non trivial task on the other. Neither of these are a good idea, and both tend to lead to more and worse errors. Borrow widely (with proper attribution), but don't code yourself into a morass of dependencies that you can never hope to test. Especially don't borrow from things that are no better tested or verified than your own code - this is a recipe for disaster.

Document Everything

It's very easy to assume some feature or limitation of a code snippet is "obvious" when you're writing it, only tocompletely forget about it when you come back. The only way to avoid this is to document. Do this a few ways.

Restrictions on a code snippet, that don't impact anything else, so don't fit either of the next two paragraphs, should be commented as near the line as possible. Try and make it so that the comment still makes sense even if the code changes a bit - e.g. don't use line numbers, don't use "the following line" etc.

Restrictions on a function (parameter x must be +ve etc) are best done using one of the documentation libraries, so the restrictions are in comments near the function code.

Restrictions that run beyond a single function (don't use this code for more than 1 million elements at a time!, NaNs must be removed from data before input) should be mentioned both on the functions where they arise and in your examples and any user docs. User docs are vital if you're sharing you code with other people.

Check yourself before you Wreck yourself

Sometimes, you have known preconditions (things which must be true when a function is called), invariants (things which should stay the same throughout), and postconditions (things which are true after a function ends). You should consider adding code to check (some of) these things. Sometimes a compiler can do some of this for you, such as Fortran's array bounds checking, or C++'s ".at()" to get a vector element, rather than using "[]". Since these can be costly in run time, they may be better as a debugging option rather than a normal part of the code.

Test your Limits

The last essential element is testing. This is a very broad topic, and really isn't easy. At its core though, it means working out what parts of the code should do and checking that they do it. Smaller parts are easier to handle, but (remember your time is precious) can mean writing lots of trivial tests where you could easily absorb them all into one test. If it is at all possible, have at least one test case of the entire code - something you can run and know the answer to, that isn't too trivial. We talk a little about testing in our workshopsand there are millions of books, blog posts, papers etc on how to do it. Just make sure you do something. The rest of this post should convince you why.

The Ideal World

In an ideal world, any code you write has a well thought out, complete specification. Its purpose is fixed and perfectly known; there's a wealth of known results to compare to; the computations are all nice and self-contained; nobody else is really working on it, so there's no rush; and yet you can still do something novel and interesting with it. In this case, avoiding errors is as easy as it ever gets. You split the code into self-contained modules, each doing one thing and doing it well. You write careful error checking into every function, making sure they are never called with invalid parameters. You test each function of every module carefully in isolation and as you put the modules together, you test at various stages. Finally, you run some problems with known results and verify that you get the correct answer. As you write, you document all the functions and the assumptions they make. Finally, you write some user documentation describing what the code is and does, what its limitations are.

That's all great in theory, but for it to be enough it really is essential that everythingof importance be in the specs. I am not sure that has ever happened, anywhere, ever. It gets close in things like Aerospace and Medical engineering, but even there mistakes are made.

Code or Computation

The second best situation, from a coding point of view, is the one where the Code is trivial, but the Computations hard. For instance, a simple program to count which words occur near to some target word is fairly easy to write, although far form trivial (capitalisation, punctuation, hyphenations etc), but text concordance analysis is widely used and is able to find new, novel results if correctly analysed. Or imagine coding up a very sophisticated algorithm, with minimal support code. From a code perspective this is also quite easy - you just make sure the code contains the right equations (and hope that you wrote them down correctly).

Of course, in these case you still do all the things from the ideal-world - you modularise the code, you test it carefully, you document the assumptions. I'm not going to discuss all the details of this here: we cover a lot of it in our training, especially here. You do all this, and you'll catch some errors, but it probably wont be all of them.

Sadly though, with many pieces of code you may write, you don't even have this level of "simplicity". The code is hard - it uses difficult techniques, or has many interacting pieces, or there aren't any good test cases you can use. Testing, verification and documentation as important now, if not more so. But you wont find all the potential errors.

What's the Point?

But at this point, it seems a bit pointless. You make all this effort, spend all this time on testing and verification, and its still probably got bugs in. Your code still probably has errors somewhere no matter what you do.

Well firstly, and most practically, the more you check and test, the smaller your errors are hopefully getting. You catch any really big glaring errors, you fix lots of things, and you absolutely have a better piece of code. The time you're spending is never wasted, although its important to try and direct it to the places you get the most return. Keep the 80-20 rule in mind. You get 80% of the gains from the first 20% of the effort in many endeavours, but getting the final 20% takes the remaining 80% of the time. This doesn't mean you should fix 20% of the bugs though! It means you finish all the easy 80s before moving on to the more finicky checks.

Secondly, more philosophically, you will always know that you did it. You can point to the time you spent and say "I tried. I did the things I was supposed to. I wont make this mistake again." That is a very nice thing to be able to say when you find you've made a mistake that's going to cost you time and/or effort.

Sh*t Happens

So, if you've published a paper, you've probably published a mistake. Hopefully, this isn't a important msitakes, just some minor grammar, a misprinted formula, or a miswritten number that doesn't affect results. It's pretty important to keep perspective. Everybody makes these mistakes. The important thing is not to make them because you didn't do things how you know you should.

I like to read Retraction Watchon occasion. Much of what they publish is serious misconduct from serial bad actors, but sometimes there's an interesting post about honest mistakes, and interesting commentary on how it's meant to work. For instance, when should a paper be retracted rather than amended? How do you distinguish stock phrases from plagiarism? What's so bad about salami slicing?

I particularly like things like this retraction, where an error was found that needs more work to address, so a correction wasn't suitable; or this one where the error meant the result wasn't interesting. The errors probably happen a lot more often than the retractions or corrections. That's not really in scope of a blog post, but it's good to see the process working well.

$600,000 Mistakes

To summarise briefly, everybody makes mistakes. If you're lucky yours will be minor, caught before they go "into the wild", and make funny anecdotes rather than tragic tales. The worst thing you can do, as a researcher, is try to hide your errors. You should fix them. Whether this can be done in a follow-on paper, or needs a correction or retraction, it should be done!

Errors in code, rather than papers, are a tiny bit different - they're probably more likely to be harmless. But don't be fooled - if the error invokes "undefined behaviour" (anything outside of, or not conforming to, the standard) you will never know whether the result was affected unless you redo everything. It's very tempting to run a few test cases, find they're OK and assume all the others are fine too. The definition of undefined behaviour means you can't really know - its perfectly valid to work 99 times and fail the 100th. If you find any, you have to fix it and redo all results. Hopefully they're unchanged. That's great. You fix the code, you upload the fixed version - admitting the error and the fix, and you add the error to your list of "things to be careful of". You don't need to do anything about your results - they're still solid.

If results are affected, talk to your supervisor, or PI, or trusted colleagues and work out what's an appropriate solution. Make sure you've learned from this and you're not going to make the same mistake again. The former CEO of IBM, Thomas John Watson Snr said it best:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. “No”, I replied. “I just spent $600,000 training him – why would I want somebody to hire his experience?”

Don't be scared of bugs. They happen and they always will. But the more you find, the better you develop your "sixth sense" and can almost smell where they might be, and the better your code and the research you do with it, will become.


November 14, 2018

Now with Less NaN!

This week I'm at Supercomputing 2018 in Dallas, enjoying a wide variety of talks! Just a short post on a couple of the "Aha" moments from this talkon correctness and reproducibility of Floating point code by James Demmel (University of California, Berkeley). I had never noticed the perils of NaN and Inf propagation he mentioned and I think they're really worth knowing. I said a bit about IEEE 754which dictates what NaN and Inf are and how they behave in this post. This entry is mostly about unexpected ways the standard can be broken by accident, mainly due to optimisations that rely on multiplying by zero. Per the standard, anything except NaN, times by zero is zero, but any operation involving a NaNmust give NaN.

Max Value the Obvious Way

Imagine you're asked to write a routine to find the Max or Min value of an array. It's obvious. Start by assuming it's the first value, and then compare every other value to see if it's higher/lower, until you reach the end. Max/Min MUST look at every value in the array to get their answer (I am sure there's a mathematical word for that, but I don't know it), so this is the most efficient option there is in terms of comparisons. However, it has a major flaw once we account for IEEE behaviour: NaN compares as false to everything, even itself.

So, imagine running this algorithm on the arrays [NaN, 2, 1] and [2, NaN, 1]. We get NaN and 2 respectively, as we've made the first element "sticky"! Step-by-step:

Max of [NaN, 2, 1] :
Take first element : NaN
Compare to second element: 2 > NaN -> False : Max = NaN
Compare to third element: 1 > NaN -> False : Max = NaN
RESULT: NaN
Max of [2, NaN, 1] :
Take first element: 2
Compare to second element: NaN > 2 -> False: Max = 2
Compare to third element: 1 > 2 -> False: Max = 2
RESULT: 2

This was an example given in the talk, and it's obvious when you look at it, but easy to not think of! There's two serious problems here: firstly the loss of the NaN value in the second case, which violates the standard and secondly the change of answer on array permutation, which is just generally bad for something like Max which should be invariant.

3x3 Determinant

Now imagine taking the determinant of a 3x3 Matrix. Generally, this is easy. You take each element of the first row, and multiply by the determinant of the 2x2 sub-matrix omitting the column at hand, with appropriate signs. Suppose a first row element is zero: obviously that term contributes 0 regardless of the 2x2 submatrix! So you can speed things up by testing for zero and omitting any such terms, right? Not once NaN gets involved. Now consider this matrix:

1 	0 	0
NaN	1	1
0	0	1		

We should get: 1 * 1 + 0 * NaN + 0 * NaN = NaN. But with our clever optimisation, we get 1 * 1 + 0 + 0 = 1. This is a pretty big problem! We're hiding away that NaN, by accident!

There's another problem too! Even if we checked for NaN in our array, what if we have an Inf instead? Inf * 0 should give us NaN. Effectively, Inf is "a number, but bigger than we can represent". Zero is "anything smaller than the smallest number we can represent". The product is then kind-of "any number, we don't know which". It might be zero, it might be Inf, it might be anything. So, the standard dictates it be NaN.

So, if we use the optimisation, we're not just hiding a NaN, we're stopping one ever being created! This is potentially an much more serious problem, because we might be happy using the Compiler flag to give us errors on NaN, but not willing to make under or overflows fatal. Then we'd completely miss this problem!

Optimisations in BLAS

The talk (at least it's first half) focused on the Linear Algebra library BLAS, which is heavily optimised, but as it turns out does have several bugs like this. For instance, a slightly complicated multiply op, alpha * A * B, might check for alpha = 0 in order to save the work of the matrix multiply when the end result should be zero anyway. Unless, of course, either A or B contains any NaN's when we must get back NaN.

But it's all gone wrong anyway...

One comment on the talk, which is worth a mention was, to paraphrase "If you're getting NaN's things have already gone wrong, so why should the Linear Algebra library bother to check? You should be checking." This is an interesting point and in some circumstances holds true. I tend to think of NaN or Inf as bugs and would prefer to write code so they can't arise, but I am not working with long linear algebra workflows where you want to run a series of operations in something like BLAS and don't examine the intermediate steps.

Reproducible BLAS

The problem above was largely the topic of the first half of the talk, which focused on how BLAS could manage the problem via errors/exceptions, and being careful with optimisations. For instance, use the optimisations to do the calculations, but also return an error code detailling any issues with the arguments.

Another large part of this talk was about reproducibility, in particular in parallel codes. Here, depending on the processor decomposition, you can get subtly different results for a sum over numbers of varying size. This is a really tricky problem, which is being addressed by things like ReproBLAS.

Punchline

Dealing with Floating point numbers is hard. It gets harder once you demand reproducibility at a bitwise level. Both the Maxval issue above, and a simple sum with values of varying size may not give the same answers when terms/entries are reordered. In particular, parallel codes always effectively reorder terms whenever you chance the decomposition (e.g. run on more or less cores).

Hopefully, you never need to really worry about NaN (or the less tricky, but still tricky Inf), but do be careful! The problems I've discussed are mainly due to treating zero incorrectly per the IEEE standard, the problems with NaN or Inf following from this. Be careful with zero! It might not be as small as you think ;)


October 31, 2018

Changing your Identity

A pretty quick entry for Halloween, that comes up pretty often - if you're accessing several remote machines with `ssh`, how can you have custom "identities", i.e. custom key pairs? There's a few reasons to do this, including history (already set up some keys for different remotes on two different machines) and security (perhaps one set of keys needs to be longer (more bits), or have a better passphrase).

SSH key pairs

First up, what are SSH keys? SSH (secure shell) is a way of getting a command prompt or terminal on a remote machine, such that all communication between the machine you're sat at and the remote is encrypted, and can't be spied on. This uses Public Keycryptography. What this means, is that on your local machine, you have some private key that you keep safe. You give out the public key to any remotes you want to access. With the public half, they cannot pretend to be you, but can verify that you are you. If they use the public half to encrypt some data, it can only be decrypted using the private key you hold.

You generate these key pairs with a command like `ssh-keygen -t rsa -b 4096` which then gives the option of using a custom name for the pair, and adding a passphrase. In general you should always use a secure but memorable passphrase. This is an extra layer of security - even if the private key somehow escapes you, it still cannot be used with out obtaining or cracking this phrase.

Custom Pairs in SSH

Lets assume, for whatever reasons, you have created several named key pairs. Perhaps you need a longer key pair for an extra-secure system, but don't want to have to update every single remote machine you may use, or perhaps some system has asked you to change your keys. Alternately, you may have generated new keys on some other machine, but want to add them as an extra set without overwriting what you already have. Regardless of how or why, you have a key pair called something that is NOT id_rsa, and want to use this to access some system.

The simplest way to do this, and it is really simple, is to use the `-i` or "identity" flag to ssh. You run something like `ssh -i ~/.ssh/custom_rsa username@hostname` where custom_rsa is the private key you wish to use. You can use exactly the same idea to transfer files using scp, which just passed the given filename on to ssh.

Custom keys with ssh-agent

A simpler, but less efficient option is to simply add all the keys you ever use to the ssh-agent on a linux/osx machine, and ssh will try each in turn. This is great if you have one or two, but gets to be a pain if you have many. ssh-agent can be a bit fiddly, and is perhaps overkill for this simpletask, but for more details see here.

Custom Identities with config

If, as well as custom keys you also have a lot of different usernames to deal with (for instance, an SCRTP username, a Github username etc), it can be helpful to set up an ssh config file (~/.ssh/config). This lets you define usernames, Authentication type (password, key etc) and identities for multiple remote machines. It's also handy to create shortcut names, if for instance, you have a very long or complex named remote you access often.

Doing this is really easy in general. Just create a file, "config" in the (hidden) ".ssh" directory. For each machine you want to use, add an entry that looks like

Host shortcutName
  HostName actualNameOrIP
  User userName
  PreferredAuthentications {"publickey" and/or "password"}
  IdentityFile pathToPrivateKey

The Host line is essential, all the rest can be omitted if you don't want them. For instance, I have a section like

Host homeserver
  HostName myname.ddns.com
  User me
  IdentityFile ~/.ssh/id_rsa_server

to use a special key when I use my home server. This uses a web service to link some name to the actual IP address, hence the ddns hostname. I could then ssh to that by doing just `ssh homeserver` and it'll use the given hostname, username, and identity file. The Hostname line can also contain an IP address which is also very useful and saves you having to remember it. E.g.

Host docserver
  Hostname 192.0.0.5
  User me

so that I don't have to always remember the IP.

Custom Identities with Programs wrapping SSH

Finally, several programs use SSH behind the scenes, and you might want these to use a custom identity. For example, git can access a server using SSH, and I have a custom identity which I use for github. I use a config entry where the "host" is github.com: if I push to a github remote repo it will identify that it is using this host and use the corresponding config entry. Or I could use ssh-agent.

There is one more option with things like git, which is to override their actual SSH command. This needs a tiny bit of caution, but is very useful. For git, you simply set the environment variable `GIT_SSH_COMMAND` to be `ssh -i path/to/private/key`. This is useful for custom identities, although perhaps not as good as the other options.

But while we're here, I'll mention when this is very very useful - when you're having key trouble. You can set the GIT_SSH_COMMAND to `ssh -vv` to get verbose ssh output, which is very useful for debugging.

Other programs may have their own analogue - usually an enviroment variable or config option. See their docs for details.


October 17, 2018

Searching in data

One of the most common things that you'll want to do in programming is to look in a list of items to find out if a given item is in there. The obvious way of doing it is by going through every item in the list of items and comparing it to see if the items are the same. On average, searching for a random item in a random list of items you'll have to look through half of them before you find the item you want (obviously sometimes you'll find your item quicker than this, sometimes you'll have to look through all of them before you find the one that you want. But on average over a very large number of searches you'll compare half of the list on each search) This algorithm is generally called linear searchboth because you go through your items in a line, one after the other, and because the time that it takes is linearin the number of elements. By this I mean that if you double the number of elements in your list it will take twice as long (on average) to find a given element. In the normal notation of algorithms this is called O(n) "order n" scaling (this is one use of so called "Big O notation"). If doubling the number of elements would take four times as long (quadratic scaling) you'd say that your algorithm would scale as O(n2) "order n squared".

In general, there is no faster way of finding a random element in an unordered list than using linear search. On the other hand, if you have an ordered list then you can speed things up by using bisection. The idea is quite simple. Imagine that you have the first seven entries in the International Radiotelephony Spelling Alphabet (IRSA, don't worry it's probably more familiar than it's name!) in order

Alfa
Bravo
Charlie
Delta
Echo
Foxtrot
Golf

and you want to find the item "Bravo". Obviously, by eye this is easy but you want an algorithm that a computer can follow. You know that you have 7 items, so you compare the item that you want "Bravo" with the one at the half way point "Delta". You know that "Bravo" is before "Delta" in the IRSA and because you know that your list is ordered you can then throw away everything from and after "Delta".You now have

Alfa
Bravo
Charlie

,three items. Once again, check the middle one, which this time is "Bravo" so yay! You've found your target in your list, and it only took two tests despite there being seven items. This is because every time you make a comparison you can throw away half of your list, so doubling the length of your list doesn't double the number of operations that you need, it actually only (on average over all possible lists) adds a single additional operation. Technically this algorithm is O(log(n)) "order log n", which makes it very, very useful if you have a large number of items. Imagine that you have 256 items, you'd have to compare all 128 of them (on average) in linear search compared to 8 (on average) in a bisection search. If you go to 4096 items then it gets even more extreme with 2048 comparisions for linear search compared to 12 for bisection. Similarly if your target item was "Foxtrot" and after "Delta" then you could throw away everything before and including "Delta". By alternatively throwing away either the top or the bottom of your list you always get to the item that you want.

As always though, there are problems. First you have to be able to access any element of your list freely. I quite happily said 'compare the item that you want "Bravo" with the one at the half way point "Delta"' without asking how you go from "half way point" to "Delta". While we haven't covered them yet there are some forms of data structure (notably linked lists which we'll cover soon) where you can't easily do this kind of hopping around inside your list and bisection is usually much less efficient or impractical in these kinds of data structure.

Second, this really does onlywork on ordered lists as you can clearly see by the way in which I just threw away half of my list based on the property of the middle item. You might be tempted to say "ah! I can just sort my list" and indeed you can, but even the best algorithm for sorting a list is O(n log(n)) "order n log n", which in general means that it is just a bit slower than a linear search through your data. What this means is that it's only worth sorting your data and then using bisection to search in it if you're searching in your data much more than you are adding to it, so you can sort your list once and then do many searches on it before it becomes unsorted again when you put new data in. This general idea (although not always the details) is the idea behind indexesin databases. You create ordered lists of data that you want to search on a lot so that you can find the data that you want quickly.

Bisection is a very general approach that you find in all sorts of problems, not just finding items in lists. The general idea is the same : throw away half of you items because you know that the item that you want cannot be in that half.


October 03, 2018

Pretty please print

The possible impossible

A pretty short entry this week, on a likely familiar theme. You have some buggy code. You try and trace the bug by inserting 'print' statements. Eventually you have prints wrapped around the troublesome piece and you may find yourself saying something like:

Aha! It reaches line A, but never line B! But thats IMPOSSIBLE! There's nothing there but simple code that can't just not execute! In fact there's only blank lines!

As usual, once you find yourself uttering a line like that, it's time to step back. Computers only rarely do the impossible.

Many things can have happened here, but a common, very confusing one, is that the second print has run, but you're not seeing the result. This is common because most languages default to bufferredoutput to stdout. We mentioned this a little bit in context of output streams in a previous entry, here we give a bit more detail.

Aside - stdout

If stdout and stderr aren't already familiar, they are the short names for standard out and standard error. There's a little bit about those herebut basically they are the designated way of writing output (information) and errors to the terminal. They're separate because sometimes you only want to see one or the other. You can direct them to separate files if you want (see the link for how). It is very good practice to respect the designation, and use the correct stream when you write information. It's acceptable to write everything to stdout in your code, but you really shouldn't be writing anything except actual errors to stderr. Warnings are acceptable, but only if they're "important". Save it for things that a user really needs to know.

Back to the Print Problem

The conversation goes something like this:

"So, what's going on with my prints? If it's printing but I don't see it, where is it?"

"It's in an output buffer, waiting until there's enough stuff to be worth printing. "

"Hang on, worth printing? I think it's worth printing! Print it!"

However, there is a good reason. The easiest way to show this is with some code. The following Python3 code uses both buffered output and unbuffered output where we force each print statement to flush the buffer so each print happens exactly as it is reached:

import time
_iters = 100000

def many_prints(flush):

  for i in range(0, _iters):
    print(i, flush=flush)

start_time = time.time()
many_prints(False)
timef = time.time() - start_time

start_time = time.time()
many_prints(True)
timet = time.time() - start_time

print("No flushing --- %s seconds ---" % timef)
print("Flushing --- %s seconds ---" % timet)

Run this a few times and you should see that flushing is slower. It's not always by much, but it can be 20%, or more, depending on system, which is sometimes significant.

Why so Slow?

So why is unbuffered output slow? In this case, mainly because interacting with the terminal and redrawing the screen takes time. Writing to a file can be even slower, because now we have to interact with the file system and disks, and maybe even wait for a success signal to come back. It's much better if these events can be collected into larger bursts, so we only have to do that rigamarole once.

Unbuffered Output and Flushing Options

We showed Python3 above, where we can simply use the flush Keyword. In Python 2 we have to do a little more. This linkshows the main options. We can force the Python interpreter to not use buffering at all, or we can override it for given output streams. Or we can write custom objects that flush exactly when we want.

In Fortran, we just call the FLUSH subroutine either on all outputs or the unit of our choice.

In C, we can flush stdout with fflush, set stdout to be unbuffered, or use stderr which is unbuffered by default.

In C++ there is a slight wrinkle as we can use either C-style IO via stdio, or C++ streamio. std::endl is a command to flush the output stream, not just a new-line, so it's usually recommended to use `'\n' to get just a newline, and std::endl only when required.

C++ IO

While we're here, we should mention on thing about mixed IO in C++. By default, a program can mix stdio via printf and stream-io (using <<), and similarly for input. BUT these two methods have to be force to share any buffering, or else the would not work together. This means they are forced to stay in sync, so inputs and outputs work as expected. In some circumstances, especially if a lot is written to stdout, it may be useful to only use one or other of these, and disable the syncing. See here for details, but be very very careful.

Bottom Line?

The bottom line of all of this is pretty simple. When doing print-style debugging, or printing error messages, either use stderr (making sure that's not buffered in your language or system), or flush after every essential stage of your prints, or disable buffering completely while debugging. You don't need all three. And finally, remember the golden rule. If it looks impossible, you're probably assuming something that isn't true. Not terribly snappy, but a useful thing to remember.


October 02, 2018

Upcoming Training Opportunities

Warwick RSE's autummn term training is now available for signup for any University of Warwick Staff or students, and anybody from the HPC-Midlands-Plus consortium.

This time we have two options.

The first is aimed mainly at Warwick Researchers who wish to use HPC facilities. We'll go through getting access and some essential info you'll want to know, as well as briefly mention where else you can get computing resources.

Secondly, we have a short 3-hour seminar going over all the bits of Software Development researchers should know about. This will be a pretty rapid spin through a lot of tools and words you'll need to know. Hopefully, you'll then spot when you should go and learn more about these things as they come up in your research etc.

For dates, signup etc, see our calendar


September 19, 2018

Scheduling and processor affinity

We're taking a brief break from data structures to talk about something a bit different : scheduling and processor affinity. This is dealing with a problem that a user or developer doesn't usually think about, but is crucial to the performance of modern computers. Describing the problem is quite simple : if you have a computer that is doing multiple tasks what order should they happen in and what processor (if you have more than one) should take on each task? Working this out is the preserve of the very core part of the operating system (OS), generalled called the kernel (after the part of the nut). While the term is most familiar from the Linux/Unix world (technically Linux is the name for just the kernel and not the rest of the OS), macOS has the Mach microkernel at it's core and ntoskrnl.exe (the Windows New Technology Kernel! It was new in the mid 90s at least) is at the core of modern versions of Windows.

The simplest solution is for the kernel to just create a queue of programs that have to be done and then run them one after the other, with each new program being handed to the next free processor and programs running until they are finished. This is typically called First In, First Out scheduling (FIFO, which is a very common acronym in computing. Do not confuse this with GIGO (garbage in, garbage out) which is also a very common acronym in computing). FIFO scheduling works very well for computers that are working on lists of tasks having equal priority. So batch processing of data for example works very well with FIFO scheduling. But you can immediately see that it won't work well for all computers that humans actually work with interactively. If all of your processors are working on long, slow processes then your computer would stop working completely and you wouldn't even be able to stop the processes because the computer wouldn't be processing your keyboard or mouse input.

Normal computers tend to work using a system of preemptive scheduling. This works by taking a program and letting it run for a while on a processor and then saving its state, stopping it and then giving the processor another program to run on that processor for a while until stopping it etc. etc. Eventually you get back to a process that has been running before and you restore it's state and start it running again. Because the state of the program is stored completely it is entirely unaware that this has happened to it, so it just runs on regardless. (as a historical note, this is called preemptive scheduling because the OS kernel preempts the programs running on it. It basically says "you're done, get out of the way" and lets other programs run. Older systems used cooperative scheduling where a program would have code in it where it said to the kernel "I'm ready to be switched away from". This worked fine until a badly behaved program didn't reach this point whereupon every other program on the computer stopped working.)

There are disadvantages to preemptive scheduling. The saving and restoring of the state of programs is generally called "context switching" and it does take time for this to happen. (FIFO schedulers don't have this problem because programs run until they complete, but even now you can't really use FIFO schedulers because there tend to be dozens of programs running on a computer at once and some of the effectively nevercomplete until the computer shuts down). If you only have one processor then context switching is the inevitable price of allowing multiple programs to run at once, but if you have two processors and two programs then they can happily run alongside each other so long as they are each running on their own processor.

But what if you have four programs and two processors? As you'd guess, in general you'll put two programs on each processor. But what happens if the processors aren't running through their jobs at quite the same rate? This can happen because real schedulers are rather more complex than I suggested here. In that case, you can come to a situation where the program that was running on processor 1 reaches the point where it should run again but processor 1 is still busy. Should it then be run on processor 2? The answer, is "maybe". Sometimes programs benefit a lot from data being held in the fast CPU cache memory and if it now runs on a different processor then that benefit is lost. Under that circumstance it would be better to keep that program running on the same processor even if it has to wait a bit longer until it runs again.

All common operating systems let you do this and it is called "processor" affinity. You tell the kernel which processor (or processors) you want a given program to run on and it guarantees that it will respect that rather than giving that program to the next available processor.

I've glossed over a lot of things about scheduling here (in particular Intel's Hyperthreading technology is a nightmare for scheduling because it creates "virtual" processors to try and take advantage of the fact that most programs don't use all of the parts of a processor at once. The problem is that these virtual processors don't behave the same way as real processors so the scheduler has to be much more careful about what programs it should run on them.), but this is a decent overview of how scheduling and process affinity works on modern computers and generally it works quite well. General purpose programs don't set CPU affinity and just run as quickly as possible when a processor is free, and programs that do heavy numerical computation set CPU affinity to try and maximise cache performance. But there are interesting cases where it can fail.

Recently we had an interesting discovery using the OpenMPI parallel programming library. This is a distributed programming library intended to write programs to run on large "cluster" computers that consist of lots of single compute nodes connected by a network. But it does also work to run programs on a single computer. We were testing a computer with 16 cores and found that running a single 16 processor job was about 16 times faster than the same problem on a single core. When we tried the same problem with 4 simultaneous 4 processor jobs things were very different. The 4 processor jobs were all slower than they were when running on a single core. After a couple of hours we found out why: OpenMPI sets processor affinity for all of the programs that it runs, but it always sets them for the lowest numbered processors. So running a 16 core job ran them on cores 0 to 15, but running 4 sets of 4 core jobs ran them allon processors 0 to 3, so they all basically got a quarter of 4 cores. Add in some overhead from context switching and you get that it's slower than running on a single core. One quick parameter to the library to tell it not to use processor affinity and we were back to where we expected to be, but the lesson to take away is that you can get far to used to things like schedulers just working. If you're finding that things are running slower than you expected, check whether or not there's something odd going on with how your processors are being used.


May 2024

Mo Tu We Th Fr Sa Su
Apr |  Today  |
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31      

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV