All 30 entries tagged Programming

View all 127 entries tagged Programming on Warwick Blogs | View entries tagged Programming at Technorati | There are no images tagged Programming on this blog

January 22, 2020

Magic numbers

We've mentioned magic number before but they are one of those things that are always worth mentioning again. Magic numbers are literal numbers that you write into your code that impact the clarity of the code. The alternative to magic numbers is to store the values into variables or preprocessor directives or other ways of associating a number with a name. Exactly what impacts the clarity of your code is often a bit subjective and is certainly research field specific but there are a few rules of thumb

  1. Is the number only an approximation? You don't want to risk using different approximations in different places in your code. Imagine what would happen if you defined Pi to have different values when using it in different contexts?
  2. Is the number immediately recognizable? As a physicist I'd recognize "0.5*m*v**2" when I see it in code as being the kinetic energy of an object despite the 0.5 but I'd have more trouble with "4.0e-7 * pi". It's probably the pre-2018 definition of the vacuum permeability but it isn't entirely clear.
  3. Is the number arbitrary? It's quite common to use a number to specify things like which problem to run or which optional package to use. You might remember what problem 3 is right now but you'll have to look if you ever come back to this code. Even worse, if you lose discipline then you might change what the numbers do if one of the test cases becomes redundant. Replacing your arbitrary numbers with named constants severely reduces these problems. If you give your problem number a sensible name then you'll have a much better chance of remembering what it does, and similarly if you ever remove a problem then if you remember to remove the named constant as well then you won't be able to compile the code when trying to run the old test. A lot of languages provide "enumerated types" or "enums" that allow you to automatically map numbers to names and can also prevent a user from chosing to supply a simple integer instead.

It can feel like this is unnecessary and slows you down when you are just writing a quick code, but one of the major problems with academic software is that it tends to grow. You write a code to solve a problem that you encounter during your research and you aren't terribly careful because you aren't going to keep it. But it is quite likely that you won't put it to one side and never touch it again. You might encounter a similar problem later and modify the code. At that point elements like magic numbers are annoying but you can usually work out what your own code is doing. The major problem comes if you move on and your code is inherited by someone else. In this case they might work everything out perfectly (which is good), they might find that they can't understand it and have to start again (which is annoying but not too troublesome) but worst of all is that they might misunderstand what your code is doing and make changes that are incorrect.


January 09, 2020

Finding The Solution

New Year, New blog post. Just a short one this time, following on my my post on FizzBuzza few months ago. Even a problem as simple as that can be solved in myriad ways, and as I program more and in more languages, I find myself less often wondering how I can solve a problem, and more often how I should.

Most of the languages I work with let me solve problems using the basic command structures, and, as I wrote about last time in The X-Y problemit can be hard, but is vitally important, not to get confused by your partial solution and miss a better one.

Recently I've been learning Perl to do some complex text-processing, and find it to be a drastically different way of thinking to my C/Fortran background. It's tricky to think in terms of text matches and substitutions when I am so used to thinking of the position of each character in a string and working in terms of "index-of-character-X plus 1" (similar to working in terms of for-each loops when one has only used for). For the processing being done, the proper Perl solution is much shorter, easier to understand, etc, although it takes me a bit longer to produce initially.

A recent Stack Exchange post I saw had somebody asking why his boss didn't appreciate his brilliant coding techniques, because design patternswere second nature to him, and his boss wanted to use far simpler solutions. He probably came back to Earth with a bit of a bump when it was firmly pointed out that "patterns being second nature" was actually a bad thing, because it rather sounded like he trotted out the first "pattern" he could think of, instead of actually thinking about the problem he was solving. Nothing wrong with the patterns themselves, but critical thinking is required to decide whether they are suitable, optimal etc.

The other common mistake people make is demonstrated in Terry Pratchett's description of "Death's Swing" (e.g. https://en.wikipedia.org/wiki/Death_(Discworld)#Home), which mirrors the Sunk Costs fallacy. Trying to build a swing for his Grandaughter, the character of Death plows forward inspite of all problems. He hangs the swing from the two strongest branches. These being on opposite sides of the tree, he cuts away the trunk, shores it up and ... This can easily happen when programming and the trick is never to be afraid to throw away (or file away for future use) a solution, even a good one which took a long time, if it stops fitting the problem. In Death's case, it is less because he is unwilling to throw away the work already done, and more an issue of very linear thinking, but the effect is the same.

Hopefully this is already obvious, and you always think before you code, happily refactor or rework your own code, and have an ever-growing solution bank to call on. I suspect very few people are willing or able to throw away all the false starts they probably should though. Just keep in mind that there is a crucial difference between "what solves the immediate problem" and "the code I should probably actually write", and strive for the latter.


December 05, 2019

The "XY Problem" or how to ask so somebody can answer

There's a lot of things that I feel ought to have pithy names or words, and this is one of them. The classic "XY problem" goes: Person wants to solve problem X. They don't know how. They think about it, and conclude that they will need to do "Y" as part of the solution. They don't know how to do this either. So they ask for help, but omit, or forget, to mention the larger problem. This can lead to bad or not-applicable answers, and an inability of the (presumably) skilled responders to explain how to do X. Since the original asker doesn't know how to do X, there's a good chance "Y" was a poor approach, or irrelevant, or harder than it needs to be etc.

I managed to commit a classic XY swap today while knocking up a very simple bash script. I wanted to take a filename, and strip its extension off. I knew this was going to be ".f90" in context, so I searched for how to strip the last 4 characters of a string in bash. I found this questionand used the substring solution given, before reading a bit more of the page and realising that I (and the original asker) were asking the wrong question. We both actually wanted this solutionto remove the dot and the extension. Funnily enough, this is also the archetype XY problem, as described at e.g. hereor here.

In this case, it isn't a big problem. Both solutions work, and for my actual use I don't need to gracefully handle a string without the ".f90" extension at all. But in more "interesting" (i.e potentially serious and causing of harm) cases on some of the forums I read (e.g. DIY, powerlifting, cybersecurity) the asked question hides a serious misunderstanding and goes off down an unhelpful path. For example, thankfully this guyasked his question openly (why are is breakers tripping after he removed a cooker fan and tied the live to the neutral wire) even if it was rather too late! Somewhere in the hundreds of questions about tripping breakers on that forum, there is probably somebody who's done something just as bad, but hidden it.

Another amusing type is the one where the question as asked can be answered effectively, but the asker would be surprised by the extra information. For instance (summarised from a real exchange):

A: I need a workout program that takes 3 hours every day!
B: Here are some! They're awesome!
C: Hold on, why do you want this?
A: Because I'm bored and I need to occupy 3 hours.
C: Jeepers! Find a hobby buddy! Forcibly spending a fixed amount of time is not going to lead to productive training...


But it Works, Doesn't it?

Even when the question isn't the right one, it is possible to get a solution which "works". But assuming you're trying to improve your programming skills, there's a lot wrong with "good enough" and the "wrong" solutions often have a whole host of problems:

  • They're too specific. For example, the substring solution for my file-extension problem works, but it's less general than it should be. It only works for 3 character extensions (plus one for the '.') and gets the wrong answer if there is no extension, wheras the actual solution works in both of those cases.
  • They're misleading when read back. Again, with the file-extension problem it's not clear what I'm doing by taking 4 characters off - the purpose is much clearer if I split on the dot.
  • They're not actually any quicker/simpler than the "proper" solution. The file-extension thing is a good illustration again - bash substrings are kinda inelegant, while pattern substitutions aren't. The "proper" solution is shorter and much clearer in general.
  • You still don't know how to solve the "actual problem" where you could have learned a much broader-applicable thing and improved your problem solving ability. In a lot of cases, you end up writing "Fortran in every language" rather than actually learning the approch of the one you're in. Again, the file-extension example is a good one. Doing it with a string slice is something familiar to me from other languages, but not a good move in bash, and it makes the problem more fiddly (I need to find the string length because of how substringing works).
  • Last and perhaps most egregiously, it hides from YOU how well you understand what you're doing, and how hard it actually is. My hacky solutions can end up being far more complex than they need to be, or they can end up making a genuinely hard problem look simple, because they don't actually solve it. See, for example, the difference between a genuine AI chatbot and the ELIZA model.

So what's the solution?

Mostly, when asking a question, try to ask the actual real one. This is easy when asking a question (in person or in text), but gets a bit tricky when using a search engine, since you're having to break things down into just a few key words. But the general idea is to look for the overall problem, as well as the partial solution you've thought up, to think very hard about what that actual problem is, taking a step back if required, and to not get too attached to your approach.

Always bear in mind that a complete change of tactic might be required to actually achieve your goal, and that you might not have quite nailed down what that goal is yet. And keep in mind the quote from Henry Ford:

Failure is simply the opportunity to begin again, this time more intelligently.

or as I've heard it paraphrased but can't find a source for:

Finding out you're wrong is great, because you get the chance to become more right.


November 20, 2019

So you've Nobbled your Git

Git protects you from some sorts of errors, by letting you create a history which you can then roll back through. On the other hand, it gives you some very powerful commands with which an incautious person can wreak havok. In fact, everybody who uses git has probably done something catastrophic at least once. It happens. So what can you do about it?

Luckily, Git gives you some powerful tools for fixing things that it lets you do, and there are some sneaky tricks for things it doesn't deliberately help with, but does by coincidence. For some things, you just have to hope your backups have been keeping up. But how do you tell which is which, and what do you do?

Things Wot I have done to my Repos

  • Checked out a file and lost local changes (git checkout <filename>)
  • Done `git pull` and got a merge commit I didn't want
  • Done `git reset HEAD~<N>` to go back before the merge commit, and gone too far
  • Run `git reset --hard` instead of `git reset`
  • Checked out a feature branch under the name of master and then pushed it

If you're thinking I might just be a bit of an idiot, well maybe, but not because of these things. They happen. Even the last one, although that was a bit of a perfect storm of errors combining into horror. They happen to everybody, even the experienced. Sites like DangitGit exist for a reason! The trick is to do them less and fix them better.

Why these things are Good but also Bad

In order:

Checked out a file and lost local changes.

`Git checkout` for a single file means "restore this file to the state in the index" which, in simpler terms means "put this file back to what git thinks it is" (in most circumstances). For a file with staged changes, these are kept, i.e. you get the state from the last commit AND any staged changes.

This is handy when you're making prospective changes and want a way to undo parts that didn't work. You stage the bits that did work, and you checkout, and you try again.

I used it a lot recently because I was writing a script to find-replace things in code. I wanted changes to the script to be kept, but it had gone wrong, so I wanted all the code to be put back to how it had been. `git add <script>` and `git checkout ./`and I could easily undo the horrors my script had wrought.

Why you should BEWARE: you are asking git to do something with things it doesn't know about (your unstaged, uncomitted changes). Git happily does this and it doesn't care. Unstaged changes are none of its business and it throws them away. Git can't help you get them back. It never had them.

Done `git pull` and got a merge commit I didn't want

You try to `git push` and are told your branch is behind the remote. The instructions say to do git pull` first. Git pull is nice and simple, and mostly you don't have to do much more. You pull, you push, job done. This is nice when you really did make distinct changes and are happy with a merge commit.

Why you should BEWARE: merge commits are ugly. Sometimes they are a tolerable evil, but they make the history more complicated and can be tricky when you're trying to go back in time. Don't just blindly `git pull`. Sometimes you need to take the extra time and rebase your local changes onto the remote.

Luckily, this is all doing "gitty" things, so git will help you! You can simply go back (git reset) to before the merge, and fix things properly.

Done `git reset HEAD~<n>` to go back before the merge commit, and gone too far

Suppose you accidentally did that blind git pull, and realised you now have an ugly merge commit you don't want. Ah, you think, I can just reset back to before I did that. This is really handy - I can go back to any of the old states and I can branch from it, or I can go back "undoing the commits" but keeping the changes, so I can refactor what went into what commit etc.

Why you should BEWARE: Counting is a real downfall of mine, (how many days ago was Sunday again?), and I got N wrong and went back too far. Now I've "lost my commits". I have the changes, but I don't know which of them went into what commit and I don't want to have to recreate that! If I was even dumbed and did a hard reset, I don't even have the changes anymore, and I really don't want to redo all of those. But, the changes had been committed, so maybe there's a fix?

Again, this was a "gitty" thing, and I can fix it. I didn't delete the commit entirely (yet), I just took it out of local history. At worst, it should still exist on my remote (hopefully I pushed it before I messed up, but in this case, I had not). See below for the solution.

Run `git reset --hard` instead of `git reset`

I've made some changes, I've staged some things, and I realise actually it was all bunk. Maybe it was a failed experiment, and became clear it wasn't going to work only partway. Maybe I was just messing about. Regardless, there are circumstances where I want to go back to a clean slate, and put everything back into the state of the last commit. This is the task of `git reset --hard`. It means "wipe it all clean. Put me back as though I had just cloned this/hadn't done anything since my last commit.

Why you should BEWARE: Well... read again carefully what this does. All your changes, staged or not, trash them. Reset it all!. Now compare to what `git reset` (without --hard) does. That's a bit different, isn't it. Hard resets are very brutal. They have their place, but must be handled with care. Recently I did a hard reset I didn't mean to. I had a few hours of work, staged but not committed, and it had vanished. Bugger.

So can I fix it? Well... kinda. While I did a "gitty" thing, git obeyed me, and threw my work away. I asked it to. However, all is not lost. I can't just ask git to undo my mistake, but some of it might remain, if I can work out how to find it. See below!

Checked out a feature branch under the name of master and then pushed it

Being able to have a local branch with one name map to a remote branch of another name is handy. For instance, say my remote repo like to separate hotfix from feature branches, or use a developers name as part of the branch. I don't need to remember this when I make local branches, and I don't have to fiddle around renaming branches locally. I just push like `git push origin local_name:remote_name` This is handy.

I can also add more than one remote ("origin" is not special, it's just a default name) and push, pull etc from the one I want.

Why you should BEWARE: Mostly these features are handy, but I managed to make a real mess for myself. I added a new remote and tried to checkout a specific branch. I accidentally pulled a branch, and didn't notice that locally it was now called master. Later, I pushed the branch. I forgot to specify that I wanted to push to the remote, and I forgot to type the branch name. Had I remembered the latter, I would at least have seen an error that that name didn't seem to exist. Had I remembered the former, at least I'd only have nobbled the copy on my personal remote. The remote configuration let me push because I had high priviledge. Oops!

Can I fix this? Luckily, yes. I didn't FORCE push, so all I have to do is take a deep breath, checkout back to what should be the tip of master, and forcibly push that. I should be very, very cautious here though. If I get this wrong, I can do a lot of damage. Force pushing is not something to take lightly. If this is not your project, buy the maintainer a stiff drink (or a cookie) and ask them politely to fix it for you.

HELP, What do I do NOW?

So, assume you've done something similar to one of the above. You've "lost" some work that was staged, or committed so that git knew about it. Can it help me? First, some background.

Every Repository is Equal

Introductions to Git often talk about its character as a "distributed" system. Rather than the older style where there was one "canonical" repo, and people could locally have a subordinate copy, in git "all repositories are equal". This is true for most stuff. Commits, history, objects etc are stored in every copy of the repo, and none of them are special.

However, it is not true of absolutely everything. If you have worked with "git tags" you may recall these aren't pushed by default with the rest of the content. You have to push them specially. Also, (obviously), none of the untracked files in your repo are part of anybody else's repo. You can also have git configured on your machine to e.g. use a global .gitignore file.

Local Specialities

There are a few other things which exist in your local git repo, but aren't part of your remotes, or anybody else's copy. This is one of the ways in which git is not a backup - pushing to a remote is about sharing stuff, not preserving it. A git remote is a copy of some of your stuff, but not uncomitted changes, or the local parts of your repo.

For us, here, the things we're interested in are the git "reflog" which is sort of a local history of some of the "gitty" things you've done, and the git "object directory". Git turns everything into these "objects" which is why you see messages like "Counting objects"" when you push or pull. Things that you undo, or never confirm (like staged-but-never-comitted changes) might still exist in the objects database, but in an "orphaned" state - not part of the repo, but not yet gone forever.

Taking out the Trash

Git is strictly a garbage collected system. "Garbage collected" languages are a class where, when something is freed or deleted, instead of it being immediately wiped, it is merely flagged somehow as "done with". At some point, often on schedule, or when the program isn't doing much else, the garbage collector comes and cleans things up, returning the memory to the program.

If you've watched things like "CSI" or other tech-jargon TV, you have probably seen somebody restoring apparently deleted files on a computer harddrive - this is a bit like garbage collection. When you delete a file from disk in the normal fashion, it would take time to overwrite every byte with a 0-byte, so instead the space is merely flagged as empty. For some time the file may be effectively still there, just "lost" and can be restored. Proper drive wiping involves overwriting (more than once, because magnets are complicated) with 0s. The recycle bin won't cut it.

The reflog and the object database for git are garbage collected. This means that objects created because git thought you were going to do something, even if you didn't, might still be there days, years, commits etc later.

Don't Push your Luck

In most cases, any attempts to use these features to save you are a last ditch, and the better approach is not to make the mistake in the first place. Since the object database is garbage collected, files hiding in it are living on borrowed time, and can be irreparably lost at any time.

So, the Solutions

I am only going to talk about the solutions to two of the messes in the list above, namely the accidental hard reset to an older commit, and the accidental hard reset of staged changes. The things where you might need to do something not entirely "gitty" to fix them. You might notice both things involve resets...

The first "solution" for FUTURE occurences of these problems, is never to run a reset without first stashing everything, or backing it up, or similar, and to be certain the reset is the correct command first. But if you've already messed up, you want to fix it now. There is hope!

Fixing an "undone" commit

The first error, resetting back to HEAD~<N> and going to far, isn't that bad. For a start, if you've been pushing regularly, the work in the "lost" commits isn't gone, as you could always clone afresh. Luckily though, even if you never pushed, you can fix this. Git knew about those changes, and those commits. This can be reconstructed. Note that if you did a hard reset, the changes are apparently "gone" wheras a soft one leaves them there, but your commit details are gone (such as any picking of lines, or separation into multiple commits).

But the gist of the solution is as follows. Whenever I commit, checkout, reset, rebase or pull, the Git reflog stores a "commit" going between the states. These commits are "real" but they aren't part of my core repo. You can see a "history" using the command "git reflog". You should see things like "commit: <commit msg>" and possibly "checkout: moving from <branch> to <branch>".

These states don't persist forever but they are real long enough for this! I can simply backup my work (just in case), take a deep breath, and broach the "reset" command again, but more carefully this time. I carefullyidentify the point I made the error, and I ask git politely but firmly to undo it. In this case, I am looking for the line which says " <id> HEAD@{N}: reset: moving to <blah>" where "blah" might be "HEAD~2" or might be a commit-id or similar. I want to go back to just before this, so I pick the previous line and make a note of the part "HEAD@{N}".

If I have any unstaged changes, or I have anything else I am not sure about I double check that I backed it up! We are about to run another reset. Getting this wrong can make things worse. Backup. Now.

Now, again carefully, I run `git reset --hard HEAD@{N}` where N is the thing I just worked out. Hopefully I am back where I wanted to be! Since the commits I had orphaned are no longer dangling, as they're now part of my branch again, they wont be cleaned up, and I have got them back. I breathe a sigh of relief and vow never to hard-reset again.

Fixing nobbled staged changes

The previous error wasn't actually as bad as all that. All I had done was remove some entire commits from my branch history. It seems logical that git remembers them for a bit and can put them back. But in the case of staged changes, they were never part of a commit at all, so they're not in the reflog in any form.

Luckily, when we stage the changes git "gets things ready" and creates the object detailling the changes in the object database. But there is no longer any reference to them in the repository itself. They are just orphaned objects. If we can convince git to spit them out, at least we have our changes (or files) back, even if we have to do a bit of work to restore things completely.

The following solution comes from this link. I was able to save my situation using the simple method, because I didn't have many changed files, but the link also details a longer solution for really bad mess-ups. Basically, we have to work out which are the orphaned objects (not part of any commit, branch etc), which of these are files (commits etc can also be in here), and then get git to tell us the content. We can then either hand-pick what we want back, or write a script to spit it all into files and use our shell-scripting prowess to restore things.

The direct is using `git fsck --cache --unreachable $(git for-each-ref --format="%(objectname)")` which asks git to spit out any unreachable object ids. We can then show the content with `git show <id>`. If there's not much stuff, that should work.

If we get loads, AND the reset is the last thing we staged or committed, we can list objects in order using `find .git/objects/ -type f -printf '%TY-%Tm-%Td %TT %p\n' | sort`.This gives us ALL the objects, but we can cross reference the lists to find our lost stuff, or just go by hand. We can again use `git show` to view the objects, although in this case we have to strip out the extra '/' characters from the ids.

Luckily for me, I only needed to restore one or two files, so I used the find command and did `git show` manually, but I was glad not to have to redo all my work!

HELP, this Keeps Happening to Me!

If you keep ending up with these sorts of problems, there are a few handy tips.

  1. DO NOT PANIC. It is never helpful.
  2. STOP. Once you've messed things up, you're already thinking "If only I hadn't done that!" You might notice that a lot of the "solutions" to git problems can make things much worse. Once you're in a mess, stop. Backup what's left of your work. Make a new repository somewhere else to test fixes. Don't just plow on!
  3. Work out what you actually did wrong. The internet abounds with solutions to git problems, but unless you can detail exactly what went wrong, you wont find the right one.
  4. Backup and stash carefully in future. I said that git only "knows" about staged or committed files - it also knows about those you tell it about using the "stash" command. Carefully stashing changes before doing "risky" things can save you.
  5. Slow down, think hard, and never commit on a Friday afternoon. Git is a powerful tool, and should not be operated while tired, distracted, or under the influence. Don't try and do complicated gitty things late or short of time.

Hopefully, this helps with the worst muck-ups you can do, git wise. But remember, nothing beats not doing it in the first place!


October 10, 2019

The Basics aren't so Basic

Sorry for the long break! We've been busy with the start of term and busy expanding our training material (link). This week I am just going to talk about something that you should always keep in mind, not just with programming and computers but with a whole bunch of things, and that is, what does it mean to say something is "basic".

There is a quote often attributed to Einstein, although not directly traceable to himwhich goes

Common sense is the collection of prejudices acquired by age eighteen.

Whether the origin is real or not, it's often true that what people think is simple, or "just common sense" is only so because of their background. To somebody who cut teeth on a BBC Micro, programming might seem super BASIC.

Jokes aside, you will probably keep coming across things that are super-basic and feeling a bit awkward that they somehow escaped you until now. Especially if you have learned things in the usual manner, i.e. by necessity, it is very easy to miss some of the basics. You can find yourself doing really quite advanced things, while not knowing something that "everybody else" seems to. This is normal. It is not beneficial, but it is perfectly normal. Frankly, in computing and programming there is a vast, vast sea of "basics", and no matter how much you learn there always seems to be more.

When I were a Lad

When I was a PhD student, I was happily using 'ssh' to login to remote machines, but I would always type out the whole host spec, such as "username@machinename.blah". I remember feeling a bit dumb when my supervisor pointed out that I didn't need the "username", and he thought this was somehow basic and obvious. I was frankly a little bit irritated because nobody told me! How was I meant to know?

"Simplicity is the final achievement."

(Quote from Frederic Chopin)

Moreover, just because something is "basic" doesn't mean it is simple. In fact, Merriam-Webster's definitionof the adjective "basic", while perhaps a bit unhelpfully recursive does not say simple anywhere. That thing with the username isn't so simple. It's fundamental, sure, but it's not simple!

Years later, I am still regularly coming across things that are "basic" that I have never encountered before. The whole "learning how to program" thing is far more of a helix than a road. You come across fundamental things all the time, some for the first time, some repeatedly, and often you can understand them better every time. Eventually, you find them simple. Sometimes they feel even elegant, because they arise so smoothly from the things you do know, or perhaps even seem so obviously "the only way it could be".

This is most of the motivation for our "WINKT" blog post series. These are fundamental, mostly "basic" things, but they're mostly not things you could usefully be told about the first go-around. Mostly, they are the basics of how the complicated things work. For example:

  • On the command line: if you use the '*' wildcard, when does this get expanded into the list of matches? Specifically, if you accidentally create a file called "-rf" in your home, and ran the command `rm *` to remove files, how much trouble would you be in? The answer is, _a lot_. * is processed first, by the shell, and unfortunately '-' comes first in the alphabet. You just ran the equivalent of `rm -rf *`. Ooops.


  • Any C/C++ programmers: if you use a variable which is undefined, what is it's value? If you said "whatever is in the associated memory beforehand", you're close, but wrong. An undefined variable is undefined behaviour - it can be given any value, including a different one each time it is accessed. Why? Because the standard says so. But who needs to know that? It is enough to know that its value is unreliable. Using your "basic" knowledge of the C memory model, you would likely guess the above, and it would never matter. [Disclaimer: this is one that I personally only learned a few weeks ago. It's absolutely fundamental, but not at all simple.]


  • For Fortran 2003 people: if you have a function-scoped ALLOCATABLE array, allocate it inside the function and forget to free it before the function exits, what happens? A memory leak? Nope! Fortran will helpfully deallocate the array on exit. If you didn't know this and freed everything yourself, there would never be a problem, but this one often surprises people.


  • For Python people: suppose you give a function a default argument, like `def func(arg, list_arg = []): ...` and suppose inside the function, list_arg gets filled with stuff. If you call the function twice without supplying list_arg, what do you get the second time? If you said "the combined contents from the first and second calls" you would be correct! The default arg. is an empty list, but it is the SAME empty list each time!

Take Aways

What's the point of all this rambling? Just that there is so much often classed as "the basics" that nobodycan know it all, and there is nothing so basic that you wouldn't do well to re-examine it anyway. It gets said all the time, but with computers there really are no stupid questions. Well, OK, there are some pretty stupid questions. But I have never seen one yet that wasn't worth thinking about.


Postscript: any suggestions for things that make you go "Well I Never Knew That"!, email us or comment! We can always use more


September 10, 2019

New snippets series: WINKT

Super short blog post to start a new snippets series. Along with our SOUP series (Snippet of the <Undefined Period>) we're trying out a new way of posting. WINKT (Well I Never Knew That) is a place for all those things that we stumble across that don't warrant a complete post, but are interesting (if you're a giant geek, anyway).


Fortran Variable Declarations

To start the series off, something I never thought about in Fortran: when you declare a variable with multiple attributes, what is the Left-Hand side comma-separated list actually doing? I was reading a Doctor Fortran post about the potential pitfalls of blindly applying every attribute you might wantand saw his final example of applying (here erroneously) the DIMENSION keyword to a variable in a separate line. "Wait, " I said, "can you do that?" I wrote a quick example, compiled it (with strict standards adherence) and proclaimed "Well I Never Knew That!"

What am I talking about here? Well, consider creating an array of integers. You'll usually write something like:

INTEGER, DIMENSION(:, :), ALLOCATABLE :: I

for a 2-D array. As it turns out, you could equivalently write this as

INTEGER :: I
DIMENSION :: I(:, :)
 ALLOCATABLE :: I

adding the attributes across several lines.

The comma-separated lists of attributes on the left of the double-colon are effectively "stacking" the attributes, but each attribute can also be applied separately.

In that example, it's clearly a bit silly and unwieldy, but it's something you might see "in the wild" and perhaps be confused by, so worth knowing. In some examples, it might actually help make things clear, with, for example, attributes such as SAVE or INTENTs, which can sometimes get "lost in the noise" of declarations. So rather than

INTEGER, SAVE :: I
 INTEGER :: j, k

I could write

INTEGER :: i, j, k
 SAVE :: i

This might look clearer, especially if I have more stacked attributes.

This is probably not something I will ever use, and I am not sure I would recommend it, since it looks a bit unexpected, and generally code should avoid unexpectedness. But it did show up to me just how the attribute lists must be working, and was an interesting ten minutes.


August 21, 2019

Fortran Memory Management II

Follow-up to Fortran Memory Management from Research Software Engineering at Warwick

This month we're going to cover the question of what to watch out for with using ALLOCATABLEs in Fortran.

Array bounds in functions

Fortran arrays have the very nice property that their indices don't have to run from any specific value to any specific value. So if you want an array that runs from -3 to 103 that's fine. You can allocate it as

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))

This maps quite neatly maps onto lots of scientific use cases so you quite often see arrays with explicit upper and lower bounds in real world Fortran codes. You can check the upper and lower bounds easily enough using the UBOUND and LBOUND functions

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  PRINT *, LBOUND(array), UBOUND(array)

Produces the output "-3 103" as you'd expect. But there is a wrinkle. What if you move those calls to LBOUND and UBOUND into a function?

MODULE mdl
  IMPLICIT NONE
  CONTAINS
  SUBROUTINE print_array(array)
    INTEGER, DIMENSION(:), INTENT(IN) :: array
    PRINT *, LBOUND(array), UBOUND(array)
  END SUBROUTINE print_array
END MODULE mdl

PROGRAM p1
  USE mdl
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  CALL print_array(array)
END PROGRAM p1

The result now is "1 107". The same number of elements but the lower bound has been moved back to the Fortran default of 1. This is the default behaviour of Fortran when you pass an array into a function; the lower bound is reset to 1. You can override this behaviour to specify the lower bound of the array in the function (INTEGER, DIMENSION(-3:), INTENT(IN) :: array will specify that the lower bound is -3), you can even pass in a parameter to the function to specify what the lower bound is (simply pass in an integer parameter and use it in the DIMENSION option in the same way that I used -3 before) but you can't just use the lower bound that the array was given when it was created. However, if you flag the array argument to the function as either ALLOCATABLE or POINTER things are different.

MODULE mdl
  IMPLICIT NONE
  CONTAINS
  SUBROUTINE print_array(array)
    INTEGER, DIMENSION(:), ALLOCATABLE, INTENT(IN) :: array
    PRINT *, LBOUND(array), UBOUND(array)
  END SUBROUTINE print_array
END MODULE mdl

PROGRAM p1
  USE mdl
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  CALL print_array(array)
END PROGRAM p1

This version of the code looks almost identical but now reports the lower and upper bounds as -3 and 103. In fact, you now can't override the bounds of your array even if you wanted to (if you try putting the lower bound in the DIMENSION part of the parameter definition in the function the code will fail to compile). The main downside is that you now can only pass ALLOCATABLE arrays to the function because the main purpose of applying the ALLOCATABLE option to the parameter is to allow you to allocate and deallocate the array inside the function. The POINTER attribute works in much the same way but only for POINTER arrays

Mostly this sort of thing isn't too much of a problem but you do have to be careful. If you have a function that takes a normal non-ALLOCATABLE parameter then it's tempting to simply recognize that the array lower bound will start from 1 inside the function and write the function accordingly. The problem is that if you ever then have cause to add the ALLOCATABLE or POINTER attribute to the function then you'll have to completely rewrite the function because suddenly the lower bound is no longer under your control. It's generally a good idea to always specify the lower bound of an array argument to a function, either through a fixed value if it's always the same for all arrays that the function will be used on or by passing in the lower bound as a parameter to the function.


Automatic Reallocation

The fact that Fortran allows you to do whole array operations is another feature that makes it well suited to scientific programming but there are some features that can be mixed blessings. One feature that was added to Fortran 2003 is that if you do a whole array assignment to an allocatable array that is either not allocated or is allocated to be a different size than the source array then the array will be reallocated to match the size of the source. To give an example

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  INTEGER, DIMENSION(-3:103) :: src
  array = src
  PRINT *, LBOUND(array), UBOUND(array)
END PROGRAM p1

This program always feels like it's invalid but from Fortran 2003 onwards it is completely valid and will give you lower bound of -3 and upper bound of 103 because "array" is automatically allocated when "src" is assigned to it. In this example "array" is not allocated before I used it but if it was then it would have been silently deallocated and reallocated to the new size. This is quite useful for many purposes, it allows you to return an array from a function and just store it in an allocatable array and have everything magically work, and sometimes you do want to do literally what I'm doing in this example and make a copy of an array. Why would this ever be a disadvantage? Because it can make debugging much harder by moving the place where the bug occurs. Imagine the following situation

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array1, array2
  INTEGER :: i

  !The incorrect allocation of array2 is a typo
  ALLOCATE(array1(-3:103), array2(-2:103))
  array2 = 1

  array1 = 5 * array2
  DO i = -2, 103
    array1(i) = array(i) - array(i-1)
  END DO

END PROGRAM p1

This program does nothing even remotely useful but it has the same structure as a real program that I had a problem with. There was an error in the size of an array that was then used in an array assignment. The assignment caused a reallocation of "array1" then then meant that the later loop was iterating over more items than the array now had so it crashed during the operation of that loop. Specifically the first iteration of the loop is trying to access the -3 element that now no longer exists. The loop in fact was perfectly well written for how the code should have been working but due to the error in the allocation of array2 the code was now crashing there. Without the implicit reallocation the error could have much more easily been traced to the array assignment (that in the real code was much further away from the crash site than in this simple example). There are a surprising number of ways of tripping this behaviour and causing errors in unrelated parts of your code because of this behaviour so if you get very unexpected array behaviour you should watch out for this one.

A question that then quite often comes up is if you can suppress this behaviour if you don't want it, and you definitely can. Most compilers have an option to disable the behaviour entirely (-fno-realloc-lhs in gfortran for example) and equally assigning to an array sectiondoesn't trigger the behaviour so

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  INTEGER, DIMENSION(-3:103) :: src
  array(:) = src
  PRINT *, LBOUND(array), UBOUND(array)
END PROGRAM p1

will crash because you are assigning to an unallocated array. Some people say that from F2003 onwards you should probably always do array assignment to an array section if you don't want to invoke the automatic reallocation behaviour. I wouldn't go quite that far but you should definitely consider the question of if any odd bugs that you have are related to the automatic reallocation behaviour


August 07, 2019

Fortran Memory Management

Memory Managment

Manually Managing

Memory management is one of the banes of the programmer's life in almost all programming languages. In many languages, such as C, you have to manually pair up all allocations of memory with associated deallocations or you will "leak" memory as your program runs. (Strictly a "memory leak" is when you create memory that you then lose track of in some way, so that you can't then deallocate it even if you want to. From the perspective of your program crashing when you run out of memory there's no difference between a true leak and just piling up unused but technically available memory somewhere so I tend to use the term a bit loosely).

Collecting the Garbage

Other languages, like Java, use mechanisms for counting how many times you use an item and when the last reference to an item is removed then the item will be deleted by something that is generally called a "garbage collector". The problem with these languages is that the garbage collector runs only periodically, not every time an item has its last reference released, so memory use can increase due to items that will be deleted when the garbage collector next runs but haven't been deleted yet.

Destructive Magic

Still other languages like C++ use objects that have systems of constructor and destructor functions that create memory when an object is created and release the memory when the object is deleted. This doesn't immediately sound like it's very helpful because surely that's still just a different way of saying that you have to have matched allocation and deallocation logic? The advantage comes from the fact that simple variables (like integers, floats etc.) don't have these memory management problems, the compiler automatically knows the lifetimes of these variables: they're either global variables the exist for the entire time that the code runs or they are local to a function and exist only when you are in that function (or in other functions called from that function etc.).

So long as you are dealing with a single instance of a C++ class and not a pointer to one or more of them then the compiler has the same level of lifetime guarantee that it has with the simple variables, and so long as the class knows how to allocate its memory when it is created (constructor function) and deallocate its memory when the compiler decides that it is finished with (destructor function) you don't have to worry about matching every single creation of the object with a matching destruction, you simply create them when you want them and let the compiler get rid of them when you are finished with them. To be strictly correct, modern C++ actually recommends against a developer doing memory management at all and recommends using STL containers to hold your data (which do the memory management themselves internally and have correctly implemented constructors and destructors). It really is a good idea to do this but scientific and research codes quite often find the odd edge cases where manually allocating and deallocating memory is the easiest solution.

Getting it for Free (Fortran Rules, OK?)

Since mostly in academic programming we tend to be working with arrays, C++ style objects are a fairly heavyweight way of dealing with problems of allocating and deallocating memory. It feels like it should be possible to have the same advantage of simply allocating an array when you need one but keep the advantage of allowing the compiler to automatically infer the lifetime of the variable and deallocate it when the lifetime is over without needing to go to a fully garbage collected model. In C that isn't really possible because arrays are mostly just pointers and the compiler can't be sure that a pointer is the only pointer to a block of memory. Programming would be impossible if your memory was deallocated as soon as any pointer to that memory went out of scope! But in more restrictive languages it is possible and fortunately Fortran is one of the languages that does have an option to do exactly that, and it's probably a feature of the language that you are already familiar with : the humble ALLOCATABLE array.


Allocatable Arrays in Fortran

Fortran allocatable arrays are very easy to use and anyone working in modern Fortran is probably familiar with them, but their properties are often not so well understood. For example to someone with a C background it feels as though this function should leak memory badly.

  SUBROUTINE alloc_fn(els)
    INTEGER, INTENT(IN) :: els
    INTEGER, DIMENSION(:), ALLOCATABLE :: array

    ALLOCATE(array(els))
  END SUBROUTINE alloc_fn

but it actually doesn't leak any memory at all (although it equally doesn't do anything useful here). It also doesn't crash because you are trying to reallocate an already allocated array. The Fortran standard requires that when an allocatable array goes out of scope it should be deallocated, so that function will run allocate the array as requested and then deallocate it again as soon as the function is over. One question that you might then ask is "What about if I did want the array to stick around?", and as usual in Fortran that is possible by using the SAVE attribute to the array

  SUBROUTINE alloc_fn(els)
    INTEGER, INTENT(IN) :: els
    INTEGER, DIMENSION(:), ALLOCATABLE, SAVE :: array

    IF (.NOT. ALLOCATED(array)) ALLOCATE(array(els))
  END SUBROUTINE alloc_fn

You'll notice that this time I've had to test if the array is allocated because otherwise I would wind up trying to allocate it twice and that will cause a runtime error. That is actually another nice feature of Fortran allocatable arrays that protects you from getting a common way of getting a memory leak in C - you cannot allocate an already allocated allocatable array. You can deallocate memory using a DEALLOCATE(array) statement and this can be useful if you want to explicitly manage the lifetime of your memory, for example if you have large intermediate arrays in a calculation that you don't want to have hanging around while you allocate other intermediate arrays. Many style guides do recommend manually deallocating memory on leaving a function, but that's mainly just a combination of caution and working round (mostly very old) broken compilers that don't comply with the standard.

Simple enough so far, but are there any pitfalls? Yes, but they tend to be a bit subtle.

Show your Intentions

Since Fortran 2003 an array argument to a function can have the "ALLOCATABLE" property, and that means that you can allocate and deallocate the array inside the function, for example

SUBROUTINE alloc_fn(els, array)
    INTEGER, INTENT(IN) :: els
    INTEGER, DIMENSION(:), ALLOCATABLE, INTENT(INOUT) :: array

    IF (ALLOCATED(array)) THEN
      PRINT *, "Deallocating"
      DEALLOCATE(array)
    END IF
    ALLOCATE(array(els))
  END SUBROUTINE alloc_fn

Nothing terribly surprising there. I call my function the first time on an unallocated array and it silently allocates, and if I call it again then it prints "Deallocating" and reallocates the array to the new size.

But what if I switch the INTENT argument of the array from "INOUT" (meaning that I can both read and write data to the array) to "IN" (meaning that I can read data from the array but not make changes). Happily as you might expect the compiler refuses to compile this code because it involves making changes to the array and I've specified that I can only read data from it.

But what about if I switch the intent to "OUT" (meaning that I can put data in the array but not read data from it)? You would probably expect this to work because I'm not using data from the array, but on second thoughts you might expect it to fail because I am testing the allocation status of the array. Well, if you try it it compiles, it runs and it allocates the array as expected. The strange thing is that the "Deallocating" print statement never triggers, and this is exactly how Fortran reads the "INTENT(OUT)" statement for allocatable arrays. Since INTENT(OUT) is supposed to mean that you don't take any information from this variable you must be assuming that it is not allocated when it enters the function SO IT DEALLOCATES IT IF IT IS ALLOCATED! This is useful but you have to be very careful! The same behaviour happens for types that contain allocatable components so watch out!

Coming Next - Pitfalls of Allocatables

There are more things to watch out for with Fortran allocatables, including the behaviour of array bounds when they are passed into functions in different ways, automatic reallocation of variables during array assignment (that sounds good but can cause absolute chaos!), the behaviour of types containing allocatable components and a few similar bits, but those will have to wait for part 2 of this post.


July 24, 2019

SOUP: The FizzBuzz Variations

I like challenges - sometimes I do maths and programming puzzles for fun. It's good practice, and if you go into industry as a programmer you may well get hit with this sort of thing in an interview. Interviewers want to verify that somebody can vaguely code before moving on to more detailled questions, so often set something simple, but originally, novel. The novelty has rather been eroded now, so you can find entire sites dedicated to compiling these questions and drilling them - which we definitely don't recommend here - but they are a good source of challenges.

One classic "challenge" is FizzBuzz. You may have encountered this as a game - I remember it from Secondary School French lessons, to help learn the names of numbers. The basic idea is very simple - you go through reading out the numbers, usually from 1 to 100, and for every multiple of three, substitute the word "fizz", for every multiple of five, the word "buzz" and for multiples of three and five (i.e. fifteen) "fizzbuzz".

For anybody who has done maths past GCSE or so, you've probably met Modulo division in some form. But the best thing about FizzBuzz is that you don't need to have. In fact, there is a complete solution without using anything except addition and equality-checking. Moreover, the core problem is nice and simple, so anybody should be able to create the pseudo-code version even if they have no idea how to check for being a multiple of three or five.

I once read an article from somebody who thought that FizzBuzz was a terribly difficult problem that needed high-level maths to solve. In the comments was somebody claiming that anybody who could just write out the code must have spent far too long drilling interview questions. That seemed a little absurd, but it did make me want to produce a FizzBuzz solution which did indicate spending far too much time thinking about it. Just to see if it was possible.

More seriously though, simple problems like this can be made into a great little challenge. There are many solutions, and you can make the problem almost arbitrarily hard by imposing artificial restrictions. You can always verify the answer, and you can get surprisingly creative. I've compiled a bunch of variations on FizzBuzz, from the outline, to the "cleverest" solution I could create, to illustrate the learning opportunity of this sort of thing. They are all on Github here.

FizzBuzz outline code

Anybody who can program should be able to turn the description above into psuedo-code, which is itself a handy skill. Here, all you have to do is expand the words into something slightly more precise. For example:

loop variable i from 1 to 100
  if i divides by 3, but not 5, print "fizz"
  else if i divides by 5, but not 3, print "buzz"
  else if i divides by 3 and 5, print "fizzbuzz"
  else print i
end loop

So, here is the first place where knowing some maths will help us: I wrote those three branches very carefully to be independent. But they don't need to be! I could just do this:

if i divides by 3: 
  print "fizz"
  set flag indicating not to print i 
if i divides by 5:
  print "buzz"
  set flag indicating not to print i 
if not flag:
  print i

Note that this is not necessarily better! It's different, sure, but to get this to work I need to add an extra flag variable, and have three independent "if"s rather than one "if-else" chain. We can work around this pretty easily, by constructing a string rather than printing it directly, such as (I have this in the repo in Fortran as StringConstructionFizzBuzz.f90)

string = "" 
if i divides by 3: 
  string = "fizz"
if i divides by 5:
  string += "buzz"
if string =="":
  string = str(i)

Before going on to real code, note the other possible approach, if we didn't have a handy way of checking for divisibility:

 
loop variable i from 1 to 100 
  increment counter_3s 
  increment counter_5s 
  if counter_3s equals 3
    print "fizz"
    set flag_not_print_i
    counter_3s = 1
  if counter_5s equals 5
    print "buzz"
    set flag_not_print_i
    counter_5s = 1
  if not flag_not_print_i
    print i

Both of these "stock solutions" are in the repo in C, Fortran and Python as FizzBuzzModulo and FizzBuzzCounter.

FizzBuzz as a challenge part 1 - pseudo codes

So, what might we come up with as a deliberate variation on this? Well first off, lets fall back onto the "You Aren't Going to Need It" and assume that 3, 5, and 100 are absolutely fixed. Then we have a super simple solution:

print 1
print 2 
print "fizz" 
print 4 
print "buzz"
...

Simple, and it works. But it's seriously inflexible and we have to type a lot!

But wait a moment! There's a pattern here, isn't there, and it repeats every 15 lines. So what about

i = 1
while i is less than 100
  print i
  increment i
  print i
  increment i
  print "fizz"
  increment i
  print i
  increment i
  print "buzz"
  ...
  print i
  increment i
  print "fizzbuzz"
  increment i
end while

Where the ellipsis (...) is, we have enough lines that i ends up as 15 on iteration 1. The nice thing about this, is that we've relaxed the requirement that the end be at exactly 100. Except we have a rather large bug here: we get a multiple of 15 lines, every time. We'd need to find another way of stopping on the last round. In the GitHub repo for this post, I have a silly example using Bash taking this approach (UnrolledFizzBuzzGenerator.sh). It is undoubtedly silly, but this sort of approach is a pretty long way from the obvious first idea, which was after all the intention.

Note the biggest downsides of this code: it is not obvious what it's doing because it is unpacked a little too much, and the 3 and 5 are baked right into the code.

Fizz Buzz Without If

That last example was my first attempt at writing FizzBuzz without using "if", which technically it did achieve. However, "while" can trivially be used to replace if (assuming a language or variant that runs 0 times if the condition is untrue). We can do better than that, surely?

Of course we can, but first we need to think hard about what "if" is doing for us here.

Understanding What If Does

First, we need to find a way of describing the action of "if" without using the word "if". (Note that modern processors actually have inbuilt instructions called "jump-if" so this isn't strictly true, but helps us find a way to code the problem without using "if"). Note that "on" counts as a synonym here.

Consider a simple one line if chain such as

1 if (condition)
2  do thing one
3 else
4  do thing two
5 end if

If the condition is true, we end up on line 2, otherwise we end up on line 4. Then, we end up on line 5 and carry on. So, putting on our geeky hats and trying to be clever, we might think "aha! We end up on line (2 + 0 (true)|2 (false)).

The next step is far more intuitive if you come from C, or one of the other languages where true is just a number. In C, we can write something like

index = 2 + 2 * (condition != true) and get 2 if the condition is true, and 4 if its not, because "true" has value of 1, and false has 0.

Using an actual "goto" instruction is, of course, evil (not entirely true, but good enough), and in this case would be the essence of solving the letter of the problem, rather than the spirit. We need to think more generally.

Function Pointers

I have written before about using function pointers to replace "if" conditions. In this case we can't set a pointer beforehand, but we have just worked out how to replicate the if chains in our original pseudocode. So, what if we made an array, and used something like the expression above to calculate the offset into it, and then called the function we find there?

This variant works pretty well in C, but in Fortran, where Logicals are a distinct kind of variable, it is very clumsy to calculate the offsets, but can be done. This example is in the repo as FunctionPointerFizzBuzz.f90, .c and .py.

Once again, this a fairly silly way to solve FizzBuzz itself, but does use a technique that can be very very handy! Practice with this sort of thing on a silly example is a great way to learn, and if you keep these sorts of snippets around, you have something to refer to when you need to use the technique in anger later.

FizzBuzz for C programmers

The last example, which is the cleverest I could manage to be about this, is pretty deeply based on C semantics. It relies on both how C handles it's strings, and how its pointers work. I'm not going to explain what is going on, but I thought of it after trying to simplify the function pointers example because of how similar the print commands were. This example is in the repo as SillyStringFizzBuzz.c

This example really goes overboard on the cleverness and is not an approach I would ever use in "real code", but it was fun to write and a good challenge. I think it meets my original challenge pretty well - it is FizzBuzz being entirely too clever. If this sort of things appeals, look up the idea of Code Golf and/or Obfuscated C - coding answers to puzzles which are very very short and hard to understand.

That might sound exceptionally silly - why would anybody want to program in a way that's hard to understand? But of course, knowing what's hard to understand really helps know what makes things easy, and brings to your attention all of the sneaky things that might not do what you first think.

Wrap Up

To conclude - there can be a lot to learn even from super simple seeming challenges. In particular, you can lay the challenge of not using some code feature, especially one you first turn to to solve the problem normally. Developing the flair for thinking "outside the box" is very handy when you have such challenges imposed externally.

Cleverness that makes you say "oh that's clever" is often not a good thing - your first reaction should probably include "oh OK, I see how it works" before "that's clever". But developing clever solutions is a good way to learn. Think of it like something like keepy-uppies for a football (soccer) player. They're not a skill that's useful in a game (the Wikipedia anecdote about Scotland not-withstanding) but they develop control skills that absolutely are. It's also a handy display piece to demonstrate those skills, should you need one.


July 10, 2019

Datastructures – Linked lists part 3

Follow-up to Datastructures – Linked lists part 2 from Research Software Engineering at Warwick

This time we're presenting the final piece of the puzzle for linked lists - adding items. Adding items is very, very similar to removing them. To add an item "I" between two other items (lets call them "A" and "B", with A <-> B initially) you just have to set it up so that A's next pointer points to I, B's prev pointer also points to I and that I's prev pointer goes to A and I's next pointer goes to B. You do then have to take care if either A or B don't exist (i.e. you're adding at the start or the end of the list) but basically the idea is always the same. When you don't have a standard that you have to follow interface design in programming is often as much an art as a science and there are several ways of implementing a useful interface to do this. You can write a routine "insert_between(I,A,B)" that adds an item between two specified items (although this the has to deal with what should happen if you try to insert between two items that are not actually linked to each other and also requires the developer to provide redundant information. If this is going to work then A must be linked to B so if you know A you know B and vice versa). To add at the start of the list you specify "A" to be NULL, and to add at the end of the list you can specify "B" to be NULL. This isn't seen very often in the wild but it works perfectly well. You could also write "insert_after(I,A)" and "insert_before(I,B)" routines that would allow you to insert item I either "after A" or "before B", again specifying appropriate NULL items for A or B to add to the start or the end of the list. The most common approach in the wild at the moment is that used by the C++ Standard Type Library containers. Have a routine "insert(I,B)" that inserts the item before a specified item B, a routine "insert_after(I,A) to insert an item after the specified item, a routine "push_back(I)" that adds a new element to the end of your list and a "push_front(I)" routine to add the new item at the start of the list. While I'm not sure that I'd pick them from scratch I'll stick to the STL

void insert(llitem* head, llitem* new, llitem *before)
{
  new->next = before;
  if (before->prev) {
    before->prev->next = new;
    new->prev = before->prev;
  } else {
    head = new;
  }
  before->prev = new;
}

void insert_after(llitem* new, llitem *after)
{
  /*head is not used here because inserting after an item cannot touch the head*/
  new->prev = after;
  if (after->next) {
    after->next->prev = new;
    new->next = after->prev;
  }
  after->next = new;
}

void push_front(llitem* head, llitem* new)
{
  if (head){
    head->prev = new;
  }
  head = new;
}

void push_back(llitem* head, llitem* new)
{
  llitem * current = head;
  if (!current) {
    head = new;
  } else {
    /*spin on until you find an item that has no next element. That element is the end of the list*/
    while(current->next) current = current->next;
    current->next = new;
    new->prev = current;
  }
}
SUBROUTINE insert(head, new, before)
  TYPE(llitem), POINTER, INTENT(INOUT) :: head !Head of linked list
  TYPE(llitem), POINTER, INTENT(INOUT) :: new !Item to be inserted
  TYPE(llitem), POINTER, INTENT(INOUT) :: before !Item to insert before

  new%next => before
  IF (ASSOCIATED(before%prev)) THEN
    before%prev%next => new
    new%prev => before%prev
  ELSE
    head => new
  END IF
  before%prev => new
END SUBROUTINE insert

SUBROUTINE insert_after(new, after)
  !No head pointer because inserting after an item can't touch the head
  TYPE(llitem), POINTER, INTENT(INOUT) :: new !Item to be inserted
  TYPE(llitem), POINTER, INTENT(INOUT) :: after !Item to insert after

  new%prev => after
  IF (ASSOCIATED(after%next)) THEN
    after%next%prev => new
    new%next => after%next
  END IF
  after%next => new
END SUBROUTINE insert_after

SUBROUTINE push_front(head, new)
  !No head pointer because inserting after an item can't touch the head
  TYPE(llitem), POINTER, INTENT(INOUT) :: head !Head of the linked list
  TYPE(llitem), POINTER, INTENT(INOUT) :: new  !Item to be added to the list

  IF (ASSOCIATED(head)) THEN
    head%prev => new
  END IF
  head => new
END SUBROUTINE push_front

SUBROUTINE push_front(head, new)
  !No head pointer because inserting after an item can't touch the head
  TYPE(llitem), POINTER, INTENT(INOUT) :: head !Head of the linked list
  TYPE(llitem), POINTER, INTENT(INOUT) :: new  !Item to be added to the list
  TYPE(llitem), POINTER :: current

  current => head

  IF (.NOT. ASSOCIATED(current)) THEN
    head => new
  ELSE
    DO WHILE(ASSOCIATED(current%next))
      current=>current%next
    END DO
    current%next=>new
    new%prev => current
  END IF
END SUBROUTINE push_front

Once again a diagram will probably make things clearer.

add item to linked list

There is one thing that you can spot easily from the implementations : in order to add to the end of the list you have to cycle through all of the intermediate items to get to the end. This isn't actually a major problem since for a lot of the uses of linked lists you can freely choose to add to the start of the list which happens immediately by just updating the head. However if you want to be able to add to the end of a linked list then it is common to actually store an equivalent of the "head" item that marks where the end of the list is. This item is generally called the "tail" item. I haven't described anything about using tails because it doesn't add anything to the fundamental nature of the linked list while making all of the implementation a little bit messier. Holding the tail also introduces a debugging risk because it is (technically) redundant and can be inconsistent with the actual last element that you would reach by iterating through the list (usually only because of errors in your implementation). The addition of multiple sources of information about something always complicates matters and you have to test very carefully to ensure that they can't become inconsistent.

This series has basically described how to implement a very simple linked list (if you wanted to). There are many limitations and weaknesses of this implementation (not being certain to nullify pointers on items when adding them, not having a tail item and also the fact that each item can only be in a single linked list at any point (usually you relax that requirement by having each llitem have a pointer to your real data rather than holding the data within the llitem structure itself. You do have to be careful to make sure that items are fully deleted when the last reference to them is released though!)) but it does, mostly, work. The final part of this series on linked lists is going to go through why you might not want to use them. Not just this limited implementation but when linked lists are not suitable datastructures and how you can create hybrids of linked lists and arrays that can perform much better.


April 2024

Mo Tu We Th Fr Sa Su
Mar |  Today  |
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30               

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV