All 20 entries tagged Programming

View all 127 entries tagged Programming on Warwick Blogs | View entries tagged Programming at Technorati | There are no images tagged Programming on this blog

June 27, 2019

"And then it just clicked

A bit more of the general "philosophy of programming" today, based on a quote I found on the brilliant "C FAQ", currently here and hopefully there to stay. The quote is from Q 18.9b, on learning resources and says:

A word of warning: there is some excellent code out there to learn from, but there is plenty of truly bletcherous code, too. If you find yourself perusing some code which is scintillatingly clear and which accomplishes its task as easily as it ought to (if not more so), do learn everything you can from that code. But if you come across some code that is unmanageably confusing, that seems to be operating with ten bandaged thumbs and boxing gloves on, please do not imagine that that's the way it has to be; if nothing else, walk away from such code having learned only that you're not going to commit any such atrocities yourself.

It's a very good idea to read other people's code when learning - either completely in the wild, or in the form of snippets on sites like Stack Overflow. But always keep in mind that there is some truly terrible code out there, even in commercial packages. It's easy to fall into the trap of thinking that your problem is just "soooo hard" that it just can't be done elegantly, or readably, or even conforming to standards, at all, and mostly that just isn't the case.

Obviously, different people will find different things to be readable, clear, elegant etc. I have an Undergrad degree including Maths, and I rather like the shorthand notation of sums and sets and implication. To some people though, the symbols are simply "all Greek" (pun thoroughly intended). Writing things in a way everybody will find clear is thus often a losing battle. But there is one very very important thing to avoid at all possible costs : the anti-click.

The Click

The 'click' is that feeling of clarity when you jump from grasping the parts of the thing to understanding the whole. I get it a lot with things like Anagram word puzzles, or even song intro quizzes (the kind where you're meant to guess the title). One moment you're seeing a jumble of letters, trying to hold all dozen in your head, and the next you can 'see' the word it must be. Similarly, one moment you're following along the notes and words and the next the key lyric pops into your head and the song is obvious. Optical illusions do it too - and once you've seen it you struggle to "un-see" it again.

This is the 'click' and it's a real asset that you can develop with practice, by reading plenty of code, and writing plenty of code, until you can 'get the gist' of the thing easily.

Symptoms and Diagnosis

It also comes up a lot in debugging - you just 'get the hang' of the way an error pops up, and know what must have happened. That's why reading issue trackers or mailing lists can feel so confusing - the people answering the questions seem to have some mysterious knack for guessing what's wrong from the most un-intuitive errors.

For example, writing in C you may use a function like 'sqrt', and get a 'undefined symbol' error from the linker, which is saying that it can't work out what that function is. So, being logical, you go back and check that you have absolutely definitely included the "math.h" header, which absolutely definitely contains a sqrt function. What did you do wrong? Well probably you forgot to compile using the '-lm' flag, which means "link against the maths library", because that library contains a lot of compiled code (the code isn't in the math.h file). This error can seem very weird, but actually, once you know the cause, it's "obvious".

We had a go at writing a catalogue of programming bugs a while ago, which tried to sort of encapsulate the 'symptoms' of a bug (available here, PDF download). Any bugs, errors or omissions please let us know (rse {@} warwick.ac.uk). Once you see enough bugs, you'll probably start to 'click' and 'just know' what to look for.

The Anti-Click

I mentioned optical illusions up there, and how you can't seem to "unsee" them. Sometimes code has this same sort of thing - an illusory apparent action or cause which is actually not real. This can be awkward and dangerous, because you, and others, will struggle to "unsee" it. It's hard to give an example that isn't really contrived, so I shall fall back onto a classic bit of C: pointer declarations.

In C, you declare pointers using a '*', such as `int * p;`. This is fine. But what about `int * p, q;`? Is q a pointer? No, it's not. Some people argue that the "proper style" will prevent this anti-click, that is, writing that as `int *p, q` and associating closely the * and the p. This is surprisingly hotly debated (e.g. hereand here) with some really tempting anti-clicks (such as here- do read the answers, as the OP there is wrong) but ultimately does need careful thought. You're probably best using whatever you personally find clearest, and changing that only after consideration.

The golden rule - DO NOT write misleading code, docs etc etc. Make it wordy if you have to, but it's not worth the hassle of an obvious, yet wrong, interpretation. As a last resort, just put in a comment explaining what this does not mean, or does not do.

And the flip side - NEVER trust the click - always verify. Always step through line by line and be sure you're right!

Click Immunity

One last thing - there are people for whom some things just never do 'click'. It's very hard to describe, but easy to spot. It's sort of like trying to explain computers to that one relative or quadratic equations to somebody who can't do maths. They're clearly not stupid, but they just don't seem to 'get it'. You explain one thing, like how to print, and it works, but now they need to print from a different program and isn't it just obvious that you do the same thing?

Some people seem to have a knack even beyond that - they don't just "not get it", it's worse! Bug reports display this sometimes - somebody who always gives a lot of information but somehow never includes the vital pieces. They'll post a 100MB log file, but forget to mention which OS. They'll explain exactly what commands they ran but omit some vital parameter.

There's two sides to this - dealing with people who don't click, and dealing when you can't click.

Hell is Other People

Usually you only have one real option to deal with other people who don't get it - guide them through it. Mostly, it just takes longer, not happens never. You probably want to try explaining things different ways, because what feels neat and obvious to you isn't to them. The Socratic style is handy - ask them what they're trying to do, or what they feel they should do, rather than telling.

I just don't GET IT

I expect everybody, sooner or later, finds something that just doesn't click right. You don't get it - it's unintuitive, your gut feeling is always wrong etc. So what do you do? Treat yourself the same as you would somebody else - read different explanations until you find one that fits you. Ask yourself "what am I trying to achieve". Ask yourself "when I ask this question, what might somebody need to answer it" - and if in doubt, ask them! Stop assuming anything, and walk through every step of the problem.

Walk away and do something else - give your brain time to work things through. Never stop asking questions and always learn from the answers. You don't need to 'click' to do something, you just need to watch for wrong intuition. Keep notes - add comments in your code for you to come back to.

Remember - intution is a skill like any other. It can be wrong. And while you might not have a talent for this thing, but with effort, you can make it work.


June 12, 2019

Datastructures – Linked lists part 2

Follow-up to Datastructures – Linked lists part 1 from Research Software Engineering at Warwick

This entry is back to the subject of linked lists. In the previous post on linked lists we described the idea of a linked list and how you created one. In this post I'm going to talk about how you remove items. The idea is pretty simple, to remove item "n" you want item "n-1" to believe that the item after it is now item "n+1" and you want item "n+1" to believe that the item before it is item "n-1". There are a few wrinkles to deal with if the item is at the start of the list (recall that this is usually called the `head` element) but mostly the idea is this simple. This is most easily shown by a code example recalling the definition of linked list items from part 1.

void remove_item(llitem *item, llitem *head)
  { 
    if (item->prev) {
    /*Item has an item before it in the list*/
    item->prev->next = item->next;
  } else {
    /*Item does not have an item before it. It is the head*/
    head = item->next;
    item->next->prev = NULL;
  }

  if(item->next) {
    item->next->prev = item->prev;
  }

  free(item);
}


SUBROUTINE remove_item(item, head)
  TYPE(llitem), POINTER, INTENT(INOUT) :: item !Item to be removed
  TYPE(llitem), POINTER, INTENT(INOUT) :: head !Head of linked list

  IF (ASSOCIATED(item%prev)) THEN
    !Item has an item before it in the list
    item%prev%next => item%next
  ELSE
    !Item does not have an item before it. It is the head
    head => item%next
    item%next%prev => NULL()
  END IF

  IF (ASSOCIATED(item%next)) THEN
    item%next%prev => item%prev
  END IF

  DEALLOCATE(item)

END SUBROUTINE remove_item

This routine removes an item from a linked list and deals with changing the head of the list to match if needed. After the item is removed from the list it is deallocated. You might not want to do this automatically when an item is removed from a list (you can have multiple lists and freely move items between them) but you do have to do it when you are finished with the item or you will have a memory leak. One of the downsides of linked lists is that since they involve direct pointer operations in normal use memory leaks are a greater risk than in many data structures.

You can also see what's going on in a diagram showing how you remove both a "normal" and a "head" item from the linked list. In both cases links shown in green are new links and links shown in orange are links that are being deleted or changed.

Normal Linked List delete

Removing head element of a linked list

Hopefully the idea is quite clear between the code and the diagrams. If "item" is item "n" in the list then "n-1" is item->prev and "n+1" is item->next. Either or both of these items may not exist if "item" is either the start or the end of the list so you have to cope with these cases. So to update the next element of "n-1" you have to set "item->prev->next" which looks a bit odd but makes sense if you unpack it. Similarly the "prev" element of "n+1" is "item->next->prev".

You will notice that the links on the item being removed aren't touched and the code simply relies on that item being deleted immediately. In a system where you want to retain an item (perhaps to add it to another linked list) you'll probably want to nullify it's prev and next pointers after you remove it from the list.

That code will remove any item from the linked list that it is in. There is however a problem if you have multiple lists, each with it's own head element. Imagine that I tried to use that "remove_item" function where I passed it an element from one list but the head from a different list. So long as "item" isn't the first item of the list that it is in then nothing would go wrong but as soon as it is it would replace the head of the second list. This would cause real problems

  1. The list that "item" was in will become invalid. Because it's "head" element is never updated it will still point to the now deleted "item" and will sooner or later fail when you try to use it. It might, for a while, look as though "item" hasn't been deleted because of this
  2. The original list that you specified the "head" from will now be lost completely. The only reference that you have to a linked list is the head element so if anything happens to this the memory is lost forever. You have both lost data and caused a memory leak
  3. You have also moved data. The list that you specified the head from will now be a valid list, but it will be the list that "item" was originally part of, not the list that is meant to be there.

This type of error in a linked list thus causes memory leaks, segmentation faults and data corruption all in one nice package so you have to be careful to avoid it. You can program in safeguards to prevent doing things like this but they do slow down the operation of the linked list and require extra memory (as well as making rather harder some nice tricks that you can do with linked lists that I'm not going to talk about here) so many working linked list implementations don't bother. You can see graphically what's happening here too.

Error deleting an item in a linked list

Mostly linked lists in operation are stable and reliable but it is possible to get yourself into a bit of a tangle. But you can immediately see that this makes is a lot easier to remove an element from the middle of the linked list than if you had an array. If you had an array then you'd either have to flag that item as "invalid" somehow to know that you shouldn't use it anymore or failing that actually copy all of the elements in your array above the item being removed down to fill the gap. The first one adds complexity and means that you can't actually reduce your memory footprint when removing items (not quite true if you use an array of pointers to items but that's even more complicated) while the second is quite often very slow. This is especially true if you use a library that guarantees contiguity of the elements of your array after every delete operation when you are deleting multiple items (e.g. std::vector in C++. This behaviour is generally an advantage but here is a cost. You can get around it in other ways in C++ but other libraries are less flexible). The cost is managable if you delete many items and then pack the array up but packing after every deletion can be very costly.

The next entry in this section will describe adding items to a linked list which, as you might imagine, is quite similar to deletion but in reverse.


May 30, 2019

Black and white

Quick 'philosophy of programming' entry this time. General solutions to common programming probelms are often called 'design patterns' (e.g. the wiki articleor the book which started the name). The idea of these is to have language independent (as far as possible) 'patterns', like clothing patterns, which can be tweaked to fit a specific situation. A lot of these patterns seem obvious, which is good, and since they're developed and tested by many people they can be very valuable in showing you questions you hadn't even thought of.

This weeks topic is perhaps too simple to really call a pattern, but it is a very useful thing to keep in mind when doing anything that deals with restricting function which exists, but should not be allowed. For example, forms which take user information often disallow anything except numbers in a 'telephone' field. A code I work on has a lot of user-specifiable options, but as the programmer I know that some are incompatible where it might not be obvious to the user - and I want to either warn or abort if these are used together.

There are two general approaches to things like this, and which you choose depends on many things. You have to maintain some kind of list to check against, but you can choose to use either the "blacklist" or the "whitelist". The former, the "blacklist" is a list of the things which aren't allowed, and anything not in the list is OK. The "whitelist" approach means keeping a list of the things which are allowed, and anything not in the list is excluded.

Sometimes the choice is fairly easy, because one method is a much, much simpler list. For instance, in the phone number example, it is fair easier to use the whitelist, allowing only '1234567890', but nothing else. If you try the other way, you might think to exclude letters, but what about Greek or Cyrillic characters? On the other hand, this is a source of deep annoyance if you forget any needed character - in the example I just gave, one could not put any spaces in, which is annoying, nor brackets or the '+' symbol.

A classic example of the poorly-thought out whitelist is in name fields which often exclude characters like the apostrophe, annoying the Scots and the Dutch for eternity. And what about accented letters, or the German ess-tsett. With a whitelist, you need to be sure you've caught everything, or people will be, rightly, upset. For a user-name on a website, and for a password, it is probably fine to allow any ASCII or Unicode character and set up your systems to handle them, leaving far less upset without any real cost to you.

On the other hand, with a blacklist, anything not forbidden is permitted. These are generally used in cases where certain characters have a function and so must be excluded, even if this annoys. So, for instance, in most programming languages variable names may not start with a number, nor contain a comment character.

Apart from the length of the lists, the two methods trade off this annoyance to your user, who must wait until you fix the omission (with a whitelist) against potential unknown failures and security risk (with a blacklist). Imagine the 'incompatible features' problem with both methods. If I use a whitelist, and forget to allow some pairing, my worst case is that I will likely be asked (somewhat irately) why X and Y can't be used together. I realise they can be, I update the code and I make a new release version to fix the omission and everybody is happy. If I use a blacklist and forget that some X and Y don't work together, my worst case is that one day I have to tell somebody that their last n years of research is all invalid, because the simulation they ran didn't work as expected, and since nothing actually went wrong they didn't know. Worse still, would be having to tell them that their fascinating effect is just a code error, and it's my fault.

In some cases, only one or other list type is really viable. For instance, virus scanners keep a list of 'tells' for malicious code, because even though they let things slip through until their lists update, they could never describe all of the 'allowed' code. App permissions (on better, more granular systems) are a whitelist - you give an app the permissions you choose, and only those.

So as a general rules of thumb:

  • If only one method is viable, obviously use that
  • If one or other list is going to be much much shorter, you have better chances of getting it right, so use that method
  • If it is really important not to let things slip through, use a carefully managed, kept up to date, whitelist. If possible, put it into a file or something, so that updates just require sending out new definitions, not modifying the entire code
  • If it's really important not to get accidental exclusions (false positives) use a, similarly carefully managed, kept up to date etc, blacklist
  • In some cases, combine the two. Programming languages generally have a set of allowed characters (a whitelist) and small blacklists for specific contexts such as the first character of a name.

As well as the literal 'blacklist' and 'whitelist' there is a more general principle here - do I selectively forbid, or selectively allow? Do I stop somebody doing this thing here, here and perhaps here, or do I only permit them to do it there and there. If you find the 'here's' or 'there's' proliferating, re-examine whether you're doing it the right way around. In safety or security critical situations, you almost always must allow only what is permitted. If you find yourself trying to plug up security holes with ever growing blacklists, you should probably change tack and think about what should be allowed instead.


May 15, 2019

Datastructures – Linked lists part 1

Back to data structures this month with the linked list. Linked lists are a way of holding data that allows you to add and remove items quickly and easily.

Why not arrays?

First question: why is adding and removing items from an array not quick and/or easy? The problem with adding items is quite simple - arrays have a fixed size so eventually you will run out of spaces in your array to store items. When this happens you have to do something to allocate additional space. Many languages have a function called "realloc" or similar that tries to extend the length of your array but it can only do that if there is unused memory space "above" the location of your array because the array elements have to be arranged one after the other in memory. The concept of "space above" is a bit complex in general and depends on details of your OS etc. but as a general idea if you allocate two arrays then they are placed one after the other in the computer's underlying memory so if you try to realloc the first array then there won't be any space between it and the second array to grow it in. If you can't grow your array like this then you have to allocate new memory to store the bigger array and copy the existing elements in. If you keep adding items then this continual growing of your array can be quite expensive, although this can be mitigated by always growing your array by more elements than you immediately need.

Removing items has the opposite problem. Since arrays are required to be contiguous (can't have gaps in them) you can't just "remove" an item you have to either flag it as empty and ignore it when going through your array in future or take all of the items above the removed element and move them down to pack everything up. The first approach has three problems

  1. You have to use additional memory to flag items as being empty or not
  2. If you are both adding and removing items from your array then since you don't actually recover memory when you remove an item your total memory requirements will grow without bounds
  3. Depending on your algorithm you might have more difficulty getting optimal performance if you have to do fundamentally different things for empty and non-empty array elements

The second approach avoids those problems but on average involves copying half of the elements in your array every time you remove an item which can also be quite expensive.

It is quite possible to build a container based on arrays that you can add and remove items from that has good general performance (C++ std::vector is a good example of one) but they always have to make tradeoffs and if you are doing a lot of adding and removing of arbitrary elements it might be better to use a data structure other than an array.

Linked lists

The idea of a linked list is quite simple. Each element in a linked list is like a link in a chain - linked to the item after them, so you go through the linked list by taking the first item then going to the next item and the next etc. until you reach the end. This is generally implemented using pointers in what are often called "self referential structures", that is structures that contain pointers to themselves. These are easy enough to implement in either C/C++ or Fortran.

struct llitem{
  struct llitem *prev, *next;
};

TYPE :: llitem
  TYPE(llitem), POINTER :: next, prev
END TYPE llitem

These are more or less normal types but there is one more important rule: self referential structures can contain only pointers to their own type, not actual instances of their own type (try removing the *s in C or the POINTER attribute in Fortran and it will fail to compile). This is because types, much like arrays, are laid out contiguously in memory so they can only contain things that the compiler knows the length of and if you have a type that contains an instance of itself then there would be an infinite regression problem because you don't know how big it is until you have finished creating it and you can't create it until you know how big it is. Pointers are all of a fixed size so they work OK.

The structure as given is for what is technically called a doubly linked list because it contains links both to the next item and the previous item in the list. A singly linked list has each item linked only to the next item in the list. Doubly linked lists have some substantial advantages over singly linked lists, notably that you can go through it from either end, but also you can remove an item from the list needing only the item itself (and the list that it is held in if you have several).

Creating linked lists

Creating a linked list is quite easy. You hold a simple pointer to the first element in the list (generally called the head item) and then you simply create the list going down from that. The key thing is that you have to hook up the prev and next links as you go. This isn't too difficult and looks like


#include 
#include 

struct llitem{
  int value;
  struct llitem *next;
  struct llitem *prev;
};

void init_ll(struct llitem * l)
{
  l-> value = -1;
  l->next = NULL;
  l->prev = NULL;
}

int main(int argc, char** argv)
{
  struct llitem *head, *current;
  int i;

  head = malloc(sizeof(struct llitem));
  init_ll(head);
  head->value = 1;
  current = head;
  for (i=0;i<10;++i){
    current->next = malloc(sizeof(struct llitem)); /*Create the next element*/
    init_ll(current->next); /*Initialise it to nullify pointers*/
    current->next->value = current->value + 1; /*Simple counter*/
    current->next->prev = current; /*It's previous pointer should be the current item*/
    current = current->next; /*Now move onwards so the newly created particle is now current*/
  }

  current = head;
  while(current){
    printf("%i\n", current->value);
    current = current->next;
  }

}
PROGRAM test

  IMPLICIT NONE
  TYPE :: llitem
    INTEGER :: value = -1
    TYPE(llitem), POINTER :: next => NULL()
    TYPE(llitem), POINTER :: prev => NULL()
  END TYPE llitem

  TYPE(llitem), POINTER :: head, current
  INTEGER :: i

  ALLOCATE(head) !Create the head
  head%value = 1
  current => head
  DO i = 1, 10
    ALLOCATE(current%next) !Create the next element
    current%next%value = current%value + 1
    current%next%prev => current !The next element's previous is the current element
    current => current%next !Now move onwards so the newly created particle is now current
  END DO

  current => head
  DO WHILE (ASSOCIATED(current))
    PRINT *,current%value
    current => current%next
  END DO

END PROGRAM test

This example also shows how you how to step through the linked list from the head, simply by having a "current" pointer that starts at head and is then incremented by setting current = current->next (or current => current%next in Fortran). This can look a bit odd but it isn't that hard to understand. I start by manually creating the "head" element, using either ALLOCATE or malloc. Once I have a head element I then loop through, each time using the same ALLOCATE or malloc command on the "current->next" pointer, creating a new item every time. In C I then call the ll_init function to setup the values of the struct (in Fortran this is done for me since I gave the elements of my TYPE default values). After this the prev and next pointers are both NULL. This is correct for the next pointer becuase my new item is the last item in the list (it won't be next iteration but right now it is), but I have to set the prev pointer. If my new item is the next element in the chain from my current element then the previous element in the chain from my new element must be my current element so I set that up. After that I just have to repeat until I have added enough items.

Part 2 of this will be in a couple of weeks and will describe how you remove and item from a linked list and how to add new items to the middle of a linked list.


May 01, 2019

The Numba "stencil" directive

After a bit of a delay we're getting the blog posts going again with a mention of a slightly odd bit of the Python Numba compiler - the stencil directive. The purpose of Numba is to produce compiled code from Python source that should run at a reasonable fraction of the speed of classical C/Fortran etc. codes. In general it produces codes that is about 20-40% as fast as C or Fortran code, so typically only about 1-2% as fast as the computer can theoretically operate. In general, the more that you can tell Numba in advance about how you are going to use your data the more options it has to optimise the code as it compiles it. The "stencil" directive is used to indicate to Numba that you are going to operate on an array by moving a stencil across it and updating each point using data from neighbouring points.

This is a fairly common thing to want to do and crops up in algorithms from image smoothing to numerical solutions to differential equations so this is a useful bit of the library. As a simple example consider the simplest possible image smoothing algorithm. For each pixel in the image P(i,j) replace the value with the average of the surrounding pixels, so

P'(i,j) = 1/4 * (P(i+1,j) + P(i-1,j) + P(i,j+1) + P(i,j-1))

Note that the left hand side of this equation is P' not P, so when we calculate the average for each pixel using the original values surrounding it not the values that have already been through the averaging process. After you have finished you copy P' to P. There is then one last thing that you have to worry about: what do you do when you reach the edge of the array? Since you are using adjacent cells you have to do something or you will read outside your array. Numba's stencil operator at present only has two options: set the outer cells to be zero or some other constant value. In general, this will mean that you want to have an outer strip of cells added around your image otherwise your image will get smaller as it smooths. These outer cells are often called ghost or guard cells and are also common in numerical solution of differential equations for representing boundary conditions. The code for doing all this in Python is quite simple

def blur(A):
    s = A.shape
    B = np.zeros((s[0],s[1]))
    for i in range(1,s[0]-1): #Range only over the inner cells
        for j in range(1,s[1]-1): #Range only over the inner cells
             B[i, j] = 0.25 * (A[i-1,j] + A[i+1,j] + A[i,j-1] + A[i,j+1])
    return B

This code takes a numpy array and iterates over all but the outer strip of cells in every direction, averages and returns the value. Each call to this function smooths your image a bit (by a radius of about 1 pixel) so in general you'll want to call it several times to smooth your image as much as you want. Running this algorithm on a stock image of 1529 x 2250 pixels on a 3.4GHz processor takes about 3.3 seconds per iteration using the pure Python implementation and 0.006 seconds by using the Numba @njit decorator. For testing we ran 100 iterations of the pure Python code and 1000 iterations of the Numba code. If run to the same number of iterations, the results in both cases is the same

Circuit board image before smoothingCircuit board image after smoothing 1000 times

The simple equivalent using stencil is

@numba.stencil
def blur(A):
    return 0.25 * (A[-1,0] + A[1,0] + A[0,-1] + A[0,1])

You can see how i-1 becomes just -1 and similarly for other parameters in the stencil, and you can also see how this operation now becomes a one-liner so it's much easier to write. Unfortunately the performance is much worse than the @njit version too taking about 0.22 seconds per iteration, although that is still some 10x faster than the native python performance. Fortunately performance can be improved by calling the stencil from a Numba jit-ed function, so for example

@stencil
def inner_blur(A):
    return 0.25 * (A[-1,0] + A[1,0] + A[0,-1] + A[0,1])

@njit(parallel=False)
def blur(A):
    return inner_blur(A)


The result from this method gives performance that is indistinguishable from that of the direct Numba jit version and is still rather shorter. Since these stencils lay out the data dependency you can also set parallel=True in the @njit call and this can be quite succesful but tends to work better for more complex stencils. In this particular case despite showing the Python interpreter apparently using 6 cores solidly the execution speed slows down by a factor of 3.


February 27, 2019

Data structures – Stacks

Another fairly simple data structure this week - the stack. The stack is basically the reverse queue. Rather than the first item to arrive being the first to be served as with a queue in a stack the last item to arrive is the first to be served. The name comes from the idea of a stack of paper where you add pages at the top and then take them off the top again when you want them. Formally this behaviour is called LIFO (Last In First Out) as opposed to the FIFO (First In First Out) queue.

As you might imagine this is quite a restrictive data structure and the classical stack actually only has two operations

  1. push - add an element to the top of the stack
  2. pop - take the top element off the stack and give it to you

You'll notice that pop and push are inverses since pop removes the top element from the stack and gives it back to you. This isn't always what you want since you might find that you have to simply push the element back onto the stack if you can't deal with it yet. To make this easier a lot of stack implementations also have a snoop operation which tells you what is on the top of the stack without actually taking it off. What you never have in a true stack is a way of accessing an arbitrary element in the stack. If you can do that then you have an array that you are accessing in a "stack-like" way. This is also something that people quite often want to do, so you find that lists in Python have a "pop" method to remove the rightmost element (although the equivalent of "push" is called "append" for a Python list, probably because it is used for a lot more than making stacks).

Implementations of stacks are generally very simple. The simplest implementation is just an array with an index to the first free element in the stack. When an element is pushed it is added at that location and the index is moved on by one element. When a pop happens the element behind the index is given to the user and the index is moved back by one.

Stack Example

This will all work perfectly well until you run out of space in your array when, much as with the queue, you have to do something else. You can reallocate your stack to give yourself more space, copying the existing elements as you go, you can implement a stack using a linked list (which we will cover in a later post) which means that you won't run out of stack until you run out of memory or you can refuse to accept the new item an return an error. If you do nothing and accept the new item you access memory that you are not supposed to be using at all which can cause all kinds of problems from crashes to possible security exploits and this is called a buffer overrun, which might be familiar with as one of the most common things that are listed as the cause of security problems in software.

At this point it is worth drawing a distinction between the stack as a data structure and all of the other things that are called stacks. There are call stacks that were mentioned in the last blog post which described how a program got to be in the place where it is and there is an associated CPU feature called a "stack register" that holds this information. There is "stack memory" which together with "heap memory" make up the most common way of dealing with memory management in compiled codes and there are plenty of other stacks that turn up all over the place. In general they get the name because they implement a stack data structure (sometimes they used to be a stack back when the name was given to them but no longer do) and otherwise are not really related. You have to be rather careful because lots of things are called stacks and that name itself doesn't alway tell you very much. Always read the documentation!

A question that might come to mind is "what would I ever do with a stack?" They do seem rather less use than a queue and in practice they are mostly fairly "low-level" data structures that most people don't encounter much, but you can easily enough think of cases where they are useful. Their most useful feature is as a "memory" of a sequence of events that can then be undone. Many "undo" systems in software use stacks behind the scenes for exactly this reason. For the same reason they are often used in "depth-first search" algorithms which were first investigated (and still used!) as ways of solving mazes. They also crop up in algorithms to parse mathematical and programming languages (a good example being the Shunting Yard Algorithm, named after train shunting yards. You can easily see the similarity between a stack as described here and a train siding because you can only get carriages off a siding in the reverse of the order you put them in on). But in general you probably won't need to implement your own stacks very often but it is worth knowing about them because they are so ubiquitous in the lower layers underpinning all of modern software development.


February 13, 2019

Again and again and again

"Flow control" means all the ways you can control what your program does, both now and next. Conditionals, loops, function calls all count. Exceptions (throw, raise etc) may or may not - whether its OK to use exceptions as flow control or whether they're meant for, well, exceptional occurences (not necessarily rare, but something that can't be handled by the current piece of code) is a seriously vexed question.

Loops

Loops are usually the first or second control option to be taught, and take two general forms, the 'for' type loop and the 'while' type loop. Different languages use different words, but the first one is meant to do something a certain number of times, and this number is known when the loop starts. The second is meant to do something 'until told to stop'. This is not a hard distinction though. A while-type loop can always mimic a for-type loop, and the reverse is also true (although sometimes considered to be inelegant and/or error prone).

Usually, these loops look like

For index = start to end
   {loop body}
End for

and

While condition
  {loop body}
End while

The condition can be as complicated as you like, it just has to evaluate to either True or False. It can use all the variables you might be changing in the loop, etc. So we create the other kind of loop something like this:

For index = start to max_iterations
  {loop body}
  If condition exit_loop
End for

and

index = start
While true //This means keep looping forever
  {loop body}
  If index > end exit_loop
   index = index + 1
End while

Note that whether we get (end-start) iterations of a for loop, or (end-start+1) or (end-start-1) can vary by language, but we can easily adjust to match.

Recursion

The other way of doing something many times, is 'recursion'. This often gets classed as 'super advanced and difficult' for some reason, but is mostly quite simple. First though, we need to know a tiny bit about functions and how they're called.

Scope

Scope is very, very important: every variable, function etc within a program has a scope. For variables this means "parts of your code which can use this variable (bit of memory) with this name". For functions, it means "parts of your code which can use this function with this name". There's a few subtleties beyond that, but for now, this will do.

So, what scopes are there? A variable defined inside a function is only usable within that function: it is scoped to that function, or has 'local scope'. A variable defined globally (outside any functions, including main) has 'global scope' and is available everywhere. Do note that a variable defined in 'main' is only available in 'main' and NOT in any functions 'main' may call, as main is a function like any other.

Most languages also have an idea of 'block scope' where 'blocks' (in C anything inside curly braces {}) can contain variable declarations, which are only available inside the block. This can cause some particularly confusing errors, such as when you try and do the following:

while i < 10 
  int i
end while

which will not compile unless there is already a variable, called i, and the one you declare inside the loop then 'shadows' this - inside the loop i refers to one variable, outside the loop, including the loop condition line, i refers to something different. If this isn't completely clear, the following example should help:

string name = 'Nobody'
int i
for i = 1 to 3
  string name = get_string_from_user() 
  print i, name
end for
print 'You entered', name

which gets 3 names from the user an outputs something like:

1 Bill Bailey
2 Madonna
3 Engelbert Humperdinck
You entered Nobody

Another tempting thing is to try

if condition then
  int i = 0
else
  long int i = 1
end if
print i

which again has either an undefined 'i' or a shadowing problem. There is no way to get a different type for i using an if like this, and with good reason. i's type could only be determined in general when the program runs - so how much storage should be given for it, and can it be passed to any given function?

Function scope has one more really important thing though - each call to a function is a new scope. The variables you used last time do not keep their values.

FORTRAN PROGRAMMERS - READ THIS!!!

In Fortran there is one really important idea called the 'SAVE' attribute. A variable in a function (or module) can be given this, as e.g. "INTEGER, SAVE :: a" and the value of 'a' will be kept from call to call. This is very useful. BUT there is a catch. Any variable declared and defined in a single line in a Fortran function is given the SAVE attribute. So, if you do something like `INTEGER :: alpha = 0`, declaring an Integer alpha and in the same line defining it, alpha is set to zero ONLY the first time the function is called. Subsequent calls will inherit whatever value alpha had last time. This is rarely what you intended. Be careful!

Call stack

When you write code to call a function, the computer has to stop what its currently doing, and enter a new scope containing only the variables available inside the function. It also has to remember where it should go back to after the function ends. This is done using the 'call stack'.

We haven't talked about 'stacks' as a data structure yet (coming soon) but we did mention here that they're a "last-in-first-out" structure where the last thing you add to the stack (think of a stack of papers or books) is the first one you take off. Each time you call a function, you add an entry to the stack, and when you return this is 'popped off' and the stack shrinks. Each entry is called a 'frame'.

The stack frame usually contains the location to return to, and also memory for all of the local variables in a function. It often also has space to hold all the parameters passed to the function and sometimes a few other bits of operational stuff. When a function is called, a frame is created with all this in, and when it returns this is destroyed.

We mentioned above that variables inside a function are available only inside it, but we didn't ask what happens if we call a function from within itself. We've seen that between calls to a function the values are 'reset' or lost, and having read the previous paragraph you probably guess that this is both because and why the stack frame gets destroyed.

Now, you might suggest that you could always make sure every call to a given function shares the same variables, but if you've ever used a function pointer you know that you can call a function without ever using its name at the place the call actually happens, so this isn't practical. So, each call to a function, any function, creates a stack frame containing all its local variables, and calling a function from within itself makes two, independent, sets of all the local variables, that know nothing about each other.

My First Recursive function is My First Recursive function is My First....

So what is recursion then? "Recursion occurs when a thing is defined in terms of itself or of its type." (Wikipedia, Recursion) For a function, recursion means having the function call itself. In maths, the factorial function of a number which is the product of all positive integers up to it. So we see immediately that factorial(n) is n*factorial(n-1), which we'd code up something like this:

function factorial(integer n)
  return factorial(n-1)*n
end function

We can pretty immediately see a problem there though - how does the chain ever end? We need something which is not recursive or we'll go on calling forever. This is called the 'base case' and for factorial its obvious from how we said 'positive numbers'. The function above won't stop when n = 1 and it should. So what we actually want is:

function factorial(integer n)
  if n > 1
    return factorial(n-1)*n
  else
    return 1
  end if
end function

Follow this by hand, on paper, for a starting n of say 4. We enter factorial(4), which enters factorial(3), which enters... until we reach factorial(1), which immediately returns '1' to the layer above, factorial(2), which multiplies this by '2' to get '2' and returns this to factorial(3) and so on.

Each Call is Its Own Scope

Remember when looking at recursive functions that each layer of call is a separate scope, with a separate copy of any variables you may define. Anything which needs to go between the layers has to be passed as an argument.

Step by step by step

So we have these two ideas that both let us keep going until we reach some condition, namely recursion and a while loop: what's the difference. There isn't one. Anything you can do with a loop you can do with recursion and vice versa. There are differences in elegance, and often one is a better choice, but not more. In fact some functional languages don't have any concept of the loop, relying solely on recursion. Mostly, elegantly recursive problems are better written as 'while' type loops and rarely as 'for' type loops, because the base case is the same as the loop-stop condition. Some recursive problems, usually those involving trees, are very hard to do elegantly with a loop.

Induction

Most of programming is about working out the sequence of steps to get from A to B, so that what you actually code is just a series of things, one after the other. Sometimes these steps are completely independent, and sometimes they aren't but they always (in a single-threaded program) run one after the other. We're always having to think about things, not in terms of the big picture, but just in terms of getting from here to there, ignoring how to get here in the first place.

In maths, one of the simplest methods of proof is called 'induction' which is closely related to recursion. Rather than try and prove the 'whole of a thing' we say 'if it was true for a smaller thing, can we show it's true for the next larger thing?' and then we say 'can we prove its true for the smallest thing?'. If we can do both of these, we've shown its true. As used here a common example is climbing a ladder. We say 'can we climb onto the bottom rung?' and we say 'can we climb onto the next higher rung than we're on?' and if so, we can climb any ladder.

Sometimes, a slightly stronger assumption is used where we instead say 'if we have reached every rung below and including the one we're on, can we reach the next one'. This is actually equivalent, but is sometimes a more useful phrasing.

Proofs by induction only work if it doesn't matter which rung we're on to climb to the next one. We don't ever have to reach rung 53 to know we can reach rung 54. If you have a problem which is easy to think about this way, then it is a prime candidate for programming recursively. The step is always the part going from 'here' to the next 'there', and the base case is how you get to the first 'here'.

Problems with Recursion

Stack Overflow

The 'call stack' we've been talking about is the inspiration for the programming forum Stack Overflow which is probably the most encountered error when programming recursively. Each function call creates a call stack frame and there is a limit to how much memory is available for this. If you forget or mis-program the base case, your recursive function never stops calling itself, until it has filled the call stack and your program crashes horribly with a stack overflow.

The other common way to get a stack overflow is creating large temporaries inside functions since these are all part of the stack. Hopefully more on that soon.

Function Parameters

Secondly, recursion can be a bit tricky to actually set up. Our factorial function was nice and simple, with a single parameter, and a single return value. But what if we have more than one parameter? For instance, a binary search can be done nicely in recursive fashion. Each step is about deciding which half, upper or lower, our target is in, and passing only this half on to the next step, and the base case is when this has length 1. Here though, you want to pass at least two items - the segment of list, and the target value, and you want to return either true or false, or the index the target was found at. If returning the index as an offset into the passed segment, you then have to adjust this at each step so that you end up with the index in the original, complete list. This can get mucky.

Excess Work

The other common example used for a recursive operation is the Fibonacci sequence where the nth value is the sum of the (n-1)th and the (n-2)th. Usually, the first two values are 1, so the sequence goes 1, 1, 2, 3, 5, 8, 13 etc. It's not hard to write a recursive version of this, e.g.

function fibonacci(n)
  if n eq 1 or n eq 2 
    return 1
  else
    return fibonacci(n-1) + fibonacci(n-2)
  end if
end

but if we work through this on paper for say n =5 we find that we calculate the n=4 case once, the n=3 case twice, the n=2 case three times and the n=1 case twice.

A Python version of this and my model answers to the challenges are here.

Challenge: what's the rule in general for how many times fib(m) is called for each m < n?

Challenge 2: rewrite this recursively with exactly n-1 calls to calculate fibonacci(n). Hint after the post, or solutions at the Github link above.

If we're not careful, we might never notice all the extra work we're doing, which we could avoid. In this case, there's a big hint that something is funny at the point where we put two values of n into our base case.

Challenge 3:

Some image sharing sites now try to use a few real words to create memorable random urls. If you're given n lists of words, can you write loop based and recursive variants of the code to create every combination of one word from each list? Your code should work for any value of 'n'.

For example [large, small] [radiant, lame] [picnic, bobcat] should give (order not important) large-radiant-picnic, small-lame-bobcat, small-radiant-bobcat etc etc.

Small hint below the post

Moral of this Post

The takeaway from this one is that recursion isn't scary if you just think about getting from here to there, pretending all the business to get 'here' has been dealt with. Never mind the rest of the ladder, just think about the next rung. This is one of the vital skills to develop as a programmer, on every scale. Break things down into manageable steps and then build them into a program.

Keep scrolling for the hints....








Hint 1: can you return both the (n-1) and the (n-2) values?

Hint 2: the list created by combining lists 1 and 2 is itself a list of words. 3 lists is just 2 lists, and then another list.


January 30, 2019

Data structures – Queues

This week we're back on data structures with another fundamental one: the queue. Simple data queues are pretty much the same as queues in real life, new items arrive at the back of the queue and data is removed from the front of the queue. They have obvious applications in any kind of program that gets data in from an external source and then has to process it in the order in which it is received. Queues are what are called FIFO data structures, first-in-first-out, because the first data to arrive is also the first data to leave. The alternative LIFO (last-in-first-out) data structure is generally called a "stack" because it is much more like stacking up paper in that the last item that you put down is the first that you pick up again.

The easiest way to implement a queue is using an array combined with a front and a back marker (I'm going to stick with the neutral "marker" description because the general idea doesn't care if you use array indices or pointers or any other mechanism that you want to). The idea is quite simple. You start with an array and when a data item turns up you insert it at the back marker and then move the back marker on one element. Adding an element to a queue is often called "pushing" or "enqueueing" so you might encounter those terms. Schematically, this looks like

Moving back marker in queue

When a piece of data is requested from the queue you return the item under the front marker and then move the front marker on by one. This operation is also known by the terms "pop" and "dequeue" so those are also worth remembering.

Queue moving front marker

And you continue doing this, moving the front and back markers as data arrives and leaves.

So far, so good but there are two obvious problems. The simpler one is what happens if data is requested when no data is in the queue. You have to have some mechanism for dealing with this case by indicating that there is no data available but that isn't too hard. The more troublesome one is what happens when the back marker goes off the end of your array? In the case as show, you can't really do anything at all and one way or another the queue will fail. This type of queue implementation is only useful if you have a fixed known number of items that you need to store as they come in and then deal with. It's not much use for getting data until your program stops and dealing with it.

If you want to do that, there are a few general solutions

1) You can get rid of the front marker entirely. Every time you dequeue an item you take the first item in the array and then simply shuffle the other items down one so that the first item is always populated so long as there are any items in the queue. This works well but it does potentially involve a lot of copying or moving of data if you have a large array that is mostly filled since every element of the array has to be shuffled up by one. If you are dealing with objects where a move/copy operation is expensive or are dealing with threaded code where some threads have to wait while this happens then it isn't the best solution

2) You can move to a circular queue. Circular queues are a bit strange and break the analogy with real world queues since queueing in a circle doesn't really work in reality. The idea is pretty simple though. Since you knowthat everything to the left of the front marker is unused, why not put new elements into there as they arrive and move the back marker around with them? Then when the front marker reaches the end of the array it just goes back to the beginning too. The effect is to mean that the front marker chases the back marker round in a circle. Shown schematically, this looks like

Circular queue

This works so long as back never catches up with front. If it does then either it has to stop adding data or it will overwrite data that is already there. It is possible to generalise these two examples to an arbitrarily long array - simply create a new array that is longer and copy the extant data into it and start again - but in most of the situations where you want a queue it should be possible to avoid this since that is another expensive operation (that will also render any pointers that you have to your data invalid so might involve more work to sort out pointers).

Generally queues are used in producer/consumer systems. There is something that is producing data and something else that is operating on it in some way. On average you have to be operating on the data at the same rate as you are producing it or the data will build up without limit, so the queue is only there to buffer data for brief periods while either your producer produces data faster than normal or your consumer consumes it slower than normal. In this case, all that you need is a large enough queue to deal with the largest expected difference in production and consumption rates. Of course, in a lot of systems you can only predict that "this queue is large enough for 99% of the expected variation" so for the other 1% of the time where you run out of space in your queue you have to decide whether it's worse to stall your entire system while you make your queue bigger (this might be OK for a web server for example where a page might take slightly longer to load but otherwise would work as expected) or simply throw away data (for example in a data logging system where any halt in the process would cause data loss but it might be lower if you just throw it away until there's space to store it than waiting while memory is allocated to hold it). Unfortunately, which is the right answer depends very much on the problem that you are working with.

As a final note, there is also a very similar data structure called the double ended queue or deque (pronounced "deck") that is similar but allows you to add and remove data from either end of a queue. They behave very similarly in general but the implementation gets a bit fiddlier because your back and front markers have to do double duty as each other.


October 17, 2018

Searching in data

One of the most common things that you'll want to do in programming is to look in a list of items to find out if a given item is in there. The obvious way of doing it is by going through every item in the list of items and comparing it to see if the items are the same. On average, searching for a random item in a random list of items you'll have to look through half of them before you find the item you want (obviously sometimes you'll find your item quicker than this, sometimes you'll have to look through all of them before you find the one that you want. But on average over a very large number of searches you'll compare half of the list on each search) This algorithm is generally called linear searchboth because you go through your items in a line, one after the other, and because the time that it takes is linearin the number of elements. By this I mean that if you double the number of elements in your list it will take twice as long (on average) to find a given element. In the normal notation of algorithms this is called O(n) "order n" scaling (this is one use of so called "Big O notation"). If doubling the number of elements would take four times as long (quadratic scaling) you'd say that your algorithm would scale as O(n2) "order n squared".

In general, there is no faster way of finding a random element in an unordered list than using linear search. On the other hand, if you have an ordered list then you can speed things up by using bisection. The idea is quite simple. Imagine that you have the first seven entries in the International Radiotelephony Spelling Alphabet (IRSA, don't worry it's probably more familiar than it's name!) in order

Alfa
Bravo
Charlie
Delta
Echo
Foxtrot
Golf

and you want to find the item "Bravo". Obviously, by eye this is easy but you want an algorithm that a computer can follow. You know that you have 7 items, so you compare the item that you want "Bravo" with the one at the half way point "Delta". You know that "Bravo" is before "Delta" in the IRSA and because you know that your list is ordered you can then throw away everything from and after "Delta".You now have

Alfa
Bravo
Charlie

,three items. Once again, check the middle one, which this time is "Bravo" so yay! You've found your target in your list, and it only took two tests despite there being seven items. This is because every time you make a comparison you can throw away half of your list, so doubling the length of your list doesn't double the number of operations that you need, it actually only (on average over all possible lists) adds a single additional operation. Technically this algorithm is O(log(n)) "order log n", which makes it very, very useful if you have a large number of items. Imagine that you have 256 items, you'd have to compare all 128 of them (on average) in linear search compared to 8 (on average) in a bisection search. If you go to 4096 items then it gets even more extreme with 2048 comparisions for linear search compared to 12 for bisection. Similarly if your target item was "Foxtrot" and after "Delta" then you could throw away everything before and including "Delta". By alternatively throwing away either the top or the bottom of your list you always get to the item that you want.

As always though, there are problems. First you have to be able to access any element of your list freely. I quite happily said 'compare the item that you want "Bravo" with the one at the half way point "Delta"' without asking how you go from "half way point" to "Delta". While we haven't covered them yet there are some forms of data structure (notably linked lists which we'll cover soon) where you can't easily do this kind of hopping around inside your list and bisection is usually much less efficient or impractical in these kinds of data structure.

Second, this really does onlywork on ordered lists as you can clearly see by the way in which I just threw away half of my list based on the property of the middle item. You might be tempted to say "ah! I can just sort my list" and indeed you can, but even the best algorithm for sorting a list is O(n log(n)) "order n log n", which in general means that it is just a bit slower than a linear search through your data. What this means is that it's only worth sorting your data and then using bisection to search in it if you're searching in your data much more than you are adding to it, so you can sort your list once and then do many searches on it before it becomes unsorted again when you put new data in. This general idea (although not always the details) is the idea behind indexesin databases. You create ordered lists of data that you want to search on a lot so that you can find the data that you want quickly.

Bisection is a very general approach that you find in all sorts of problems, not just finding items in lists. The general idea is the same : throw away half of you items because you know that the item that you want cannot be in that half.


October 02, 2018

Upcoming Training Opportunities

Warwick RSE's autummn term training is now available for signup for any University of Warwick Staff or students, and anybody from the HPC-Midlands-Plus consortium.

This time we have two options.

The first is aimed mainly at Warwick Researchers who wish to use HPC facilities. We'll go through getting access and some essential info you'll want to know, as well as briefly mention where else you can get computing resources.

Secondly, we have a short 3-hour seminar going over all the bits of Software Development researchers should know about. This will be a pretty rapid spin through a lot of tools and words you'll need to know. Hopefully, you'll then spot when you should go and learn more about these things as they come up in your research etc.

For dates, signup etc, see our calendar


March 2024

Mo Tu We Th Fr Sa Su
Feb |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV