December 08, 2021

Advent of Code 2021: First days with R

The Advent of Code is a series of daily programming puzzles running up to Christmas. On 3 December, the Warwick R User Group met jointly with the Manchester R-thritis Statistical Computing Group to informally discuss our solutions to the puzzles from the first two days. Some of the participants shared their solutions in advance as shared in this slide deck.

In this post, Heather Turner (RSE Fellow, Statistics) shares her solutions and how they can be improved based on the ideas put forward at the meetup by David Selby (Manchester R-thritis organizer) and others, while James Tripp (Senior Research Software Engineer, Information and Digital Group) reflects on some issues raised in the meetup discussion.

R Solutions for Days 1 and 2

Heather Turner

Day 1: Sonar Sweep

For a full description of the problem for Day 1, see https://adventofcode.com/2021/day/1.

Day 1 - Part 1

How many measurements are larger than the previous measurement?

199  (N/A - no previous measurement)
200  (increased)
208  (increased)
210  (increased)
200  (decreased)
207  (increased)
240  (increased)
269  (increased)
260  (decreased)
263  (increased)

First create a vector with the example data:

x <- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)

Then the puzzle can be solved with the following R function, that takes the vector x as input, uses diff() to compute differences between consecutive values of x, then sums the differences that are positive:

f01a <- function(x) {
  dx <- diff(x)
  sum(sign(dx) == 1)
}
f01a(x)
## [1] 7

Inspired by David Selby’s solution, this could be made slightly simpler by finding the positive differences with dx > 0, rather than using the sign() function.

Day 1 - Part 2

How many sliding three-measurement sums are larger than the previous sum?

199  A           607  (N/A - no previous sum)
200  A B         618  (increased)
208  A B C       618  (no change)
210    B C D     617  (decreased)
200  E   C D     647  (increased)
207  E F   D     716  (increased)
240  E F G       769  (increased)
269    F G H     792  (increased)
260      G H
263        H

This can be solved by the following function of x. First, the rolling sums of three consecutive values are computed in a vectorized computation, i.e. creating three vectors containing the first, second and third value in the sum, then adding the vectors together. Then, the function from Part 1 is used to sum the positive differences between these values.

f01b <- function(x) {
  n <- length(x)
  sum3 <- x[1:(n - 2)] + x[2:(n - 1)] + x[3:n]
  f01a(sum3)
}
f01b(x)
## [1] 5

David Schoch put forward a solution that takes advantage of the fact that the difference between consecutive rolling sums of three values is just the difference between values three places apart (the second and third values in the first sum cancel out the first and second values in the second sum). Putting what we’ve learnt together gives this much neater solution for Day 1 Part 2:

f01b_revised <- function(x) {
  dx3 <- diff(x, lag = 3)
  sum(dx3 > 0)
}
f01b_revised(x)
## [1] 5

Day 2: Dive!

For a full description of the problem see https://adventofcode.com/2021/day/2.

Day 2 - Part 1

  • forward X increases the horizontal position by X units.
  • down X increases the depth by X units.
  • up X decreases the depth by X units.
                  horizontal  depth
forward 5   -->            5      -
down 5      -->                   5
forward 8   -->           13
up 3        -->                   2
down 8      -->                  10
forward 2   -->           15

==> horizontal = 15, depth = 10

First create a data frame with the example data

x <- data.frame(direction = c("forward", "down", "forward",
                              "up", "down", "forward"),
                amount = c(5, 5, 8, 3, 8, 2))

Then the puzzle can be solved with the following function, which takes the variables direction and amount as input. The horizontal position is the sum of the amounts where the direction is “forward”. The depth is the sum of the amounts where direction is “down” minus the sum of the amounts where direction is “up”.

f02a <- function(direction, amount) {
  horizontal <- sum(amount[direction == "forward"])
  depth <- sum(amount[direction == "down"]) - sum(amount[direction == "up"])
  c(horizontal = horizontal, depth = depth)
}
f02a(x$direction, x$amount)
## horizontal      depth 
##         15         10

The code above uses logical indexing to select the amounts that contribute to each sum. An alternative approach from David Selby is to coerce the logical indices to numeric (coercing TRUE to 1 and FALSE to 0) and multiply the amount by the resulting vectors as required:

f02a_selby <- function(direction, amount) {
  horizontal_move <- amount * (direction == 'forward')
  depth_move <- amount * ((direction == 'down') - (direction == 'up'))
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}

Benchmarking on 1000 datasets of 1000 rows this alternative solution is only marginally faster (an average run-time of 31 μs vs 37 μs), but it has an advantage in Part 2!

Day 2 - Part 2

  • down X increases your aim by X units.
  • up X decreases your aim by X units.
  • forward X does two things:
    • It increases your horizontal position by X units.
    • It increases your depth by your aim multiplied by X.
                  horizontal  aim  depth
forward 5   -->            5    -      - 
down 5      -->                 5      
forward 8   -->           13          40
up 3        -->                 2
down 8      -->                10
forward 2   -->           15          60

==> horizontal = 15, depth = 60

The following function solves this problem by first computing the sign of the change to aim, which is negative if the direction is “up” and positive otherwise. Then for each change in position, if the direction is “forward” the function adds the amount to the horizontal position and the amount multiplied by aim to the depth, otherwise it adds the sign multiplied by the amount to the aim.

f02b <- function(direction, amount) {
  horizontal <- depth <- aim <- 0
  sign <- ifelse(direction == "up", -1, 1)
  for (i in seq_along(direction)){
    if (direction[i] == "forward"){
      horizontal <- horizontal + amount[i]
      depth <- depth + aim * amount[i]
      next
    }
    aim <- aim + sign[i]*amount[i]
  }
  c(horizontal = horizontal, depth = depth)
}
f02b(x$direction, x$amount)
## horizontal      depth 
##         15         60

As an interpreted language, for loops can be slow in R and vectorized solutions are often preferable if memory is not an issue. David Selby showed that his solution from Part 1 can be extended to solve the problem in Part 2, by using cumulative sums of the value that represented depth in Part 1 to compute the aim value in Part 2.

f02b_revised <- function(direction, amount) {
  horizontal_move <- amount * (direction == "forward")
  aim <- cumsum(amount * (direction == "down") - amount * (direction == "up"))
  depth_move <- aim * horizontal_move
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}
f02b_revised(x$direction, x$amount)
## horizontal      depth 
##         15         60

Benchmarking these two solutions on 1000 data sets of 1000 rows, the vectorized solution is ten times faster (on average 58 μs vs 514 μs).

Reflections

James Tripp

How do we solve a problem with code? Writing an answer requires what some educators call computational thinking. We systematically conceptualise the solution to a problem and then work through a series of steps, drawing on coding conventions, to formulate an answer. Each answer is different and, often, a reflection of our priorities, experience, and domains of work. In our meeting, it was wonderful to see people with a wide range of experience and differing interests.

Our discussion considered the criteria of a ‘good solution’.

  • Speed is one criteria of success - a solution which takes 100 μs (microseconds) is better than a solution taking 150 μs.
  • Readability for both sharing with others (as done above) and to help future you, lest you forget the intricacies of your own solution.
  • Good practice such as variable naming and, perhaps, avoiding for loops where possible. Loops are slower and somewhat discouraged in the R community. However, some would argue they are more explicit and helpful for those coming from other languages, such as Python.
  • Debugging friendly.Some participants, including Heather Turner and David Selby, checked their solutions with tests comparing known inputs and outputs. I drew on my Psychology experience and opted for an explicit DataFrame where I can see each operation. Testing is almost certainly a better solution which I adopt in my packages.
  • Generalisability. A solution tailored for the Part 1 task on a given day may not be easily generalisable for the Part 2 task. It seemed desirable to refactor one’s code to create a solution which encompasses both tasks. However, the effort and benefits of doing so are certainly debatable.

We also discussed levels of abstraction. The tidyverse family of R packages is powerful, high-level and quite opinionated. Using tidyverse functions returned some intuitive, but slower solutions where we were unsure of the bottlenecks. Solutions built on R base (the functions which come with R) were somewhat faster, though others using libraries such as data.table were also rather quick. These reflections are certainly generalisations and prompted some discussion.

How does one produce fast, readable, debuggable, generalisable code which follows good practice and operates at a suitable level of abstraction? Our discussions did not produce a definitive answer. Instead, our discussions and sharing solutions helped us understand the pros and cons of different approaches and I certainly learned a few useful tricks.


May 06, 2020

Shebang shebang

The "shebang" is the combination of characters #!, called "hash bang", and sometimes shortened to "sh-bang" or commonly "shebang". You might see this in the first line of things like Bash scripts, Python scripts etc, and wonder what its for, and how to use it. if so, keep reading.

Where does it come from?

The hash-bang combined character was introduced in the world of computers around 1979, although it took a while for the use to standardise. It's a nice, human readable and memorable way to encode two Hex numbers which have a special meaning to the underlying operating system. The hash name is pretty standard, the bang is a bit unusual. According to Wikipedia, the name "bang" for the character "!" was common in the 50s, possibly from comic book gun sounds (Bang!). If the title reminds you of something else about the 50's, you're probably thinking of Sh-boom sh-boom (Life could be a dream). That seems to show up anywhere a TV show wants to indicate the time period.

Where does it go?

In normal bash scripts (or Python), the hash character means a comment. It might then look like the "special" character is the exclamation mark after it, but that's not quite it. The shebang goes far beyond just bash scripts, right into the depths of the operating system. This means it is vitally important that this be the first characters in the file. There can be NO space before them (unlike a simple comment), NO space between them, and NO other characters before or between.

What does it do?

So what does it actually do? If a script with a shebang is run (e.g. typing `./scriptname` in the shell), the hash bang tells the operating system that whatever follows on that line is the interpreter to use to run the script. That is, it should name a program (in full - no relative paths here) that can run the script. Usually, this will be something like `/bin/python` or `/bin/bash` for Python or bash respectively. You can also specify a particular Python version e.g. `/bin/python3` and can do some things with passing flags (see e.g. Wikipedia for some details).

The handy thing about this is you no longer have to specify how to invoke the script, you just run it. If you're a mousey-clicky sort of person, once you've put in the shebang, if you set your OS to run the file (on Linuxes, choose Run not Display in prefs, on OSX choose to open with Terminal, on Windows this depends how you installed Python, see e.g. here) you can double-click to run, which can be handy.

What are the problems?

So that sounds great. But there are a few problems. The way we described using this above, specifying `/bin/python`only works on computers where that specifies the Python the user wants invoked, and where that path is correct for the system at hand. Lots of people might have multiple Pythons, and prefer one be used, or they might have installs in custom places. On Unix-like systems (including OSX) there is a handy utility to deal with this, namely `/usr/bin/env`. This program "knows" what the user has configured to be invoked when they type a program name, such as "python". You can check it by typing e.g. `/usr/bin/env python --version` into a terminal (shell). I get 2.7.10, or if I try python3 I get 3.7.7. The "normal terminal" is probably "sh" or "bash" (commonly those give the same), and you may have other shells such as csh, tcsh or zsh.

You can use this to get the right program with the shebang method too. Instead of `/bin/python` in the first line of your script, use `#! /usr/bin/env python` or bash or python3 and your script will be run with that. Remember to `chmod u+x` (change access mode) to make the file executable too!

Excellent as this is, it has its own minor issue worth noting. On some systems, bin/env will be handed everything that follows as a single command, such as "python --version" in the example we did above. So rather than looking up the "python" program and handing it the --version argument, it tries to look up some non-existent program with a space in the name. It is best to avoid passing any arguments in the shebang line!

Got it?

Hopefully that all makes sense. The shebang is slightly mysterious at first, but very handy once you know what it does.


March 04, 2020

Scheduling of OpenMP

OpenMP is one of the most popular methods in academic programming to write parallel code. It's major advantage is that you can achieve performance improvements in a lot of cases by just putting "directives" into your code to tell the compiler "You can take this loop and split it up into sections for each processor". Other parallelism schemes like MPI or Intel Threaded Building Blocks or Coarray Fortran all involve designing your algorithm around splitting the work up, OpenMP makes it easy to simply add parallelism to bits where you want it. (There are also lots of bits of OpenMP programming that require you to make changes to your code but you can get further than in pretty much any other modern paradigm without having to alter your actual code).

So what does this look like in practice?

 
MODULE prime_finder
  USE ISO_FORTRAN_ENV
  IMPLICIT NONE

  INTEGER(INT64), PARAMETER ::  small_primes_len = 20
  INTEGER(INT64), DIMENSION(small_primes_len), PARAMETER :: &
      small_primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, &
      53, 59, 61, 67, 71]
  INTEGER, PARAMETER :: max_small_prime = 71

  CONTAINS

  FUNCTION check_prime(num)
    INTEGER(INT64), INTENT(IN) :: num
    LOGICAL :: check_prime

    INTEGER(INT64) :: index, end

    end = CEILING(SQRT(REAL(num, REAL64)))

    check_prime = .FALSE.
    !First check against the small primes
    DO index = 1, small_primes_len
      IF (small_primes(index) == num) THEN
        check_prime = .TRUE.
        RETURN
      END IF
      IF (MOD(num, small_primes(index)) == 0) THEN
        RETURN
      END IF
    END DO

    !Test higher numbers, skipping all the evens
    DO index = max_small_prime + 2, end, 2
      IF (MOD(num, index) == 0) RETURN
    END DO
    check_prime = .TRUE.
  END FUNCTION check_prime

END MODULE prime_finder

PROGRAM primes

  USE prime_finder
  IMPLICIT NONE
  INTEGER(INT64) :: ct, i

  ct = 0_INT64
!$OMP PARALLEL DO REDUCTION(+:ct)
  DO i = 2_INT64, 20000000_INT64
    IF (check_prime(i)) ct = ct + 1
  END DO
!$OMP END PARALLEL DO

  PRINT *, "Number of primes = ", ct

END PROGRAM primes

This is a Fortran 2008 program (although OpenMP works on Fortran, C and C++) that uses a very simple algorithm to count the number of prime numbers between 2 and 20,000,000. There are much better algorithms for this but this algorithm correctly counts this number of primes and is very suitable for parallelising since each number is checked for primality separately. This code is already OpenMP parallelised and as you can see the parallelism is very simple. The only lines of OpenMP code are !$OMP PARALLEL DO REDUCTION(+:ct)and$OMP END PARALLEL DO . The first line says that I want this loop to be parallel and that the variable ct should be calculated separately on each processor and then summed over all processors at the end. The second line just says that we have finished with parallelism and should switch back to running the code in serial after this point. Otherwise that is exactly the same program that I would have written if I was testing the problem on a single processor and I get the result that there are 1,270,607 primes less than 20,000,000 regardless of how many processors I run on. So far so good but look what happens when I look at the timings for different numbers of processors

Number of Processors Time(s)
1 11.3
2 7.2
4 3.9
8 2.2
16 1.13

It certainly speeds up! But not as much as it should since every single prime is being tested separately (the number for 16 processors would be 0.71 seconds if it was scaling perfectly). There are lots of reasons why parallel codes don't speed up as much as they should but in this case it isn't any underlying limitation of the hardware but is to do with how OpenMP chooses to split up my problem and a simple change gets the runtime down to 0.81s. The difference between 1.1 seconds and 0.8 seconds isn't much but if your code takes a day rather than a second then 26.5 hours vs 19.2 hours can be significant in terms of electricity costs and cost of buying computer time.

So what is the problem? The problem is in how OpenMP chooses to split the work up over the processors. It isn't trivial to work out how long it will take to check all numbers in a given range for primality (in fact that is a harder problem than just counting the primes in that range!) so if you just do the obvious way of splitting that loop (processor 1 gets 3->10,000,000 processor 2 gets 10,000,001 to 20,000,000) then one of those processors will be getting more work than the other one will and will have to wait until that second processor has finished before it can give you the total number of primes. By default OpenMP does exactly that simple way of splitting the loop up so you don't get all of the benefit that you should from the extra processors. The solution is to specify a SCHEDULE command on the !$OMP PARALLEL DO line. There are 3 main options currently for scheduling : STATIC (the default), DYNAMIC (each iteration of the loop is considered separately and handed off in turn to a processor that has finished it's previous work) and GUIDED (the iterations are split up into work blocks that are "proportional to" the number of currently undone iterations divided by the number of processors. When each processor has finished it's block it requests another one.). (There are also two others, AUTO that has OpenMP try to work out the best strategy based on your code and RUNTIME that allows you to specify one of the 3 main options when the code is running rather than when you compile it). You can also optionally specify a chunk size for each of these that for STATIC and DYNAMIC tries to gang together blocks of chunksizeand then split them off to processors in sequence and for GUIDED makes sure that the work blocks never get smaller than the specified chunk size. For this problem GUIDED gives the best results and switching !$OMP PARALLEL DO REDUCTION(+:ct) SCHEDULE(GUIDED)for !$OMP PARALLEL DO REDUCTION(+:ct)in that code gives you the final runtime of 0.8 seconds on 16 processors which is a good overall performance. But the message here is much more about what OpenMP is doing behind the scenes than the details of this toy problem. The OpenMP standard document (https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf) is quite explicit on some of what happens but other bits are left up to the implementer.

So what do these do in practice? We can only talk in detail about a single implementation so here we're going to talk about the implementation in gcc(gfortran) and I'm using version 9.2.1. You can write a very simple piece of code that fills an array with a symbol representing which processor worked on it to see what's happening. For a simple 20 element array with 2 processors you get the following results (* = processor 1, # = processor 2)

 PROGRAM test
  
  USE omp_lib
  IMPLICIT NONE
  INTEGER, PARAMETER :: nels = 20
  INTEGER, DIMENSION(:), ALLOCATABLE :: ct, proc_used
  CHARACTER(LEN=*), PARAMETER :: symbols &
      = "*#$%ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
  INTEGER :: nproc, proc, ix

  nproc = omp_get_max_threads()
  ALLOCATE(ct(nproc))
  ct = 0
  ALLOCATE(proc_used(nels))

!$OMP PARALLEL PRIVATE(proc, ix)
  proc = omp_get_thread_num() + 1
!$OMP DO SCHEDULE(GUIDED)
  DO ix = 1, nels
    ct(proc) = ct(proc) + 1
    proc_used(ix) = proc
  END DO
!$OMP END DO
!$OMP END PARALLEL

 DO ix = 1, nproc
   PRINT "(A, I0, A, I0, A)", "Processor ", ix, " handled ", ct(ix), " elements"
 END DO

 DO ix = 1, nels
   IF (proc_used(ix) <= LEN(symbols)) THEN
     WRITE(*,'(A)', ADVANCE = 'NO') symbols(proc_used(ix):proc_used(ix))
   ELSE
     WRITE(*,'(A,I0,A)', ADVANCE = 'NO') "!", proc_used(ix),"!"
   END IF
 END DO

 PRINT *,""
END PROGRAM test 
SCHEDULE command Pattern
OMP DO SCHEDULE(STATIC) **********##########
OMP DO SCHEDULE(STATIC,4) ****####****####****
OMP DO SCHEDULE(DYNAMIC) *#********#*#*#*##*#
OMP DO SCHEDULE(DYNAMIC) *#*#*************#*#
OMP DO SCHEDULE(DYNAMIC,4) ****####****####****
OMP DO SCHEDULE(GUIDED) ##########*****#####

You can see immediately that the scheduling behaviour is very different for the different commands. The simple STATIC scheduler simply splits the array into two with each processor getting half of the domain. STATIC,4 specifies a chunk size of 4 and does the same type of splitting but with processors getting adjacent chunks of 4 items. DYNAMIC produces quite complex interleaved patterns of processor responsibility but as you can see two runs with the same SCHEDULE command produce different patterns so this really is a dynamic system - you can't predict what processor is going to run what index of a loop (mostly this doesn't matter but there are plenty of real-world cases where efficiency dicatates that you always want the same processor working on the same data). Finally, GUIDED produced a rather strange looking pattern where processor 2 does most of the work. This pattern is not guaranteed but did drop out of multiple runs quite regularly. It appears to be that processor 1 gets a lot of housekeeping work associated with running in parallel (i.e. actually running the OpenMP library code itself) so the system gives more of the work in my code to processor 2.

An interesting thing that you can immediately see from these results is that I should be able to do something other than GUIDED for my prime number code. Since the problem is that certain chunks of the space of numbers to test are harder to process than other chunks I should be able to fix the STATIC schedule just by using a small-ish chunk size rather than giving every processor 1/16 of the entire list. This would mean that all of the processors would get bits of the easy numbers and bits of the hard numbers. Since there is always a cost for starting a new chunk of data I'm guessing that with 20,000,000 primes to test a chunk size of 1000 would seem suitable, so I go back to my prime code and put in SCHEDULE(STATIC, 1000). If I now run on 16 processors I get a time of 0.80 seconds, slightly faster than my GUIDED case indicating that it was the fact that some processors have an easier time of it than others is the problem.

Take away message for this is that when running parallel code it is vitally important that you understand the question of how your problem is being split up and whether that means that all of your processors will have the same amount of work. You have a decent array of tools currently to control how work is distributed and future releases of OpenMP should also have the option of specifying your own scheduling systems!


January 22, 2020

Magic numbers

We've mentioned magic number before but they are one of those things that are always worth mentioning again. Magic numbers are literal numbers that you write into your code that impact the clarity of the code. The alternative to magic numbers is to store the values into variables or preprocessor directives or other ways of associating a number with a name. Exactly what impacts the clarity of your code is often a bit subjective and is certainly research field specific but there are a few rules of thumb

  1. Is the number only an approximation? You don't want to risk using different approximations in different places in your code. Imagine what would happen if you defined Pi to have different values when using it in different contexts?
  2. Is the number immediately recognizable? As a physicist I'd recognize "0.5*m*v**2" when I see it in code as being the kinetic energy of an object despite the 0.5 but I'd have more trouble with "4.0e-7 * pi". It's probably the pre-2018 definition of the vacuum permeability but it isn't entirely clear.
  3. Is the number arbitrary? It's quite common to use a number to specify things like which problem to run or which optional package to use. You might remember what problem 3 is right now but you'll have to look if you ever come back to this code. Even worse, if you lose discipline then you might change what the numbers do if one of the test cases becomes redundant. Replacing your arbitrary numbers with named constants severely reduces these problems. If you give your problem number a sensible name then you'll have a much better chance of remembering what it does, and similarly if you ever remove a problem then if you remember to remove the named constant as well then you won't be able to compile the code when trying to run the old test. A lot of languages provide "enumerated types" or "enums" that allow you to automatically map numbers to names and can also prevent a user from chosing to supply a simple integer instead.

It can feel like this is unnecessary and slows you down when you are just writing a quick code, but one of the major problems with academic software is that it tends to grow. You write a code to solve a problem that you encounter during your research and you aren't terribly careful because you aren't going to keep it. But it is quite likely that you won't put it to one side and never touch it again. You might encounter a similar problem later and modify the code. At that point elements like magic numbers are annoying but you can usually work out what your own code is doing. The major problem comes if you move on and your code is inherited by someone else. In this case they might work everything out perfectly (which is good), they might find that they can't understand it and have to start again (which is annoying but not too troublesome) but worst of all is that they might misunderstand what your code is doing and make changes that are incorrect.


January 09, 2020

Finding The Solution

New Year, New blog post. Just a short one this time, following on my my post on FizzBuzza few months ago. Even a problem as simple as that can be solved in myriad ways, and as I program more and in more languages, I find myself less often wondering how I can solve a problem, and more often how I should.

Most of the languages I work with let me solve problems using the basic command structures, and, as I wrote about last time in The X-Y problemit can be hard, but is vitally important, not to get confused by your partial solution and miss a better one.

Recently I've been learning Perl to do some complex text-processing, and find it to be a drastically different way of thinking to my C/Fortran background. It's tricky to think in terms of text matches and substitutions when I am so used to thinking of the position of each character in a string and working in terms of "index-of-character-X plus 1" (similar to working in terms of for-each loops when one has only used for). For the processing being done, the proper Perl solution is much shorter, easier to understand, etc, although it takes me a bit longer to produce initially.

A recent Stack Exchange post I saw had somebody asking why his boss didn't appreciate his brilliant coding techniques, because design patternswere second nature to him, and his boss wanted to use far simpler solutions. He probably came back to Earth with a bit of a bump when it was firmly pointed out that "patterns being second nature" was actually a bad thing, because it rather sounded like he trotted out the first "pattern" he could think of, instead of actually thinking about the problem he was solving. Nothing wrong with the patterns themselves, but critical thinking is required to decide whether they are suitable, optimal etc.

The other common mistake people make is demonstrated in Terry Pratchett's description of "Death's Swing" (e.g. https://en.wikipedia.org/wiki/Death_(Discworld)#Home), which mirrors the Sunk Costs fallacy. Trying to build a swing for his Grandaughter, the character of Death plows forward inspite of all problems. He hangs the swing from the two strongest branches. These being on opposite sides of the tree, he cuts away the trunk, shores it up and ... This can easily happen when programming and the trick is never to be afraid to throw away (or file away for future use) a solution, even a good one which took a long time, if it stops fitting the problem. In Death's case, it is less because he is unwilling to throw away the work already done, and more an issue of very linear thinking, but the effect is the same.

Hopefully this is already obvious, and you always think before you code, happily refactor or rework your own code, and have an ever-growing solution bank to call on. I suspect very few people are willing or able to throw away all the false starts they probably should though. Just keep in mind that there is a crucial difference between "what solves the immediate problem" and "the code I should probably actually write", and strive for the latter.


December 05, 2019

The "XY Problem" or how to ask so somebody can answer

There's a lot of things that I feel ought to have pithy names or words, and this is one of them. The classic "XY problem" goes: Person wants to solve problem X. They don't know how. They think about it, and conclude that they will need to do "Y" as part of the solution. They don't know how to do this either. So they ask for help, but omit, or forget, to mention the larger problem. This can lead to bad or not-applicable answers, and an inability of the (presumably) skilled responders to explain how to do X. Since the original asker doesn't know how to do X, there's a good chance "Y" was a poor approach, or irrelevant, or harder than it needs to be etc.

I managed to commit a classic XY swap today while knocking up a very simple bash script. I wanted to take a filename, and strip its extension off. I knew this was going to be ".f90" in context, so I searched for how to strip the last 4 characters of a string in bash. I found this questionand used the substring solution given, before reading a bit more of the page and realising that I (and the original asker) were asking the wrong question. We both actually wanted this solutionto remove the dot and the extension. Funnily enough, this is also the archetype XY problem, as described at e.g. hereor here.

In this case, it isn't a big problem. Both solutions work, and for my actual use I don't need to gracefully handle a string without the ".f90" extension at all. But in more "interesting" (i.e potentially serious and causing of harm) cases on some of the forums I read (e.g. DIY, powerlifting, cybersecurity) the asked question hides a serious misunderstanding and goes off down an unhelpful path. For example, thankfully this guyasked his question openly (why are is breakers tripping after he removed a cooker fan and tied the live to the neutral wire) even if it was rather too late! Somewhere in the hundreds of questions about tripping breakers on that forum, there is probably somebody who's done something just as bad, but hidden it.

Another amusing type is the one where the question as asked can be answered effectively, but the asker would be surprised by the extra information. For instance (summarised from a real exchange):

A: I need a workout program that takes 3 hours every day!
B: Here are some! They're awesome!
C: Hold on, why do you want this?
A: Because I'm bored and I need to occupy 3 hours.
C: Jeepers! Find a hobby buddy! Forcibly spending a fixed amount of time is not going to lead to productive training...


But it Works, Doesn't it?

Even when the question isn't the right one, it is possible to get a solution which "works". But assuming you're trying to improve your programming skills, there's a lot wrong with "good enough" and the "wrong" solutions often have a whole host of problems:

  • They're too specific. For example, the substring solution for my file-extension problem works, but it's less general than it should be. It only works for 3 character extensions (plus one for the '.') and gets the wrong answer if there is no extension, wheras the actual solution works in both of those cases.
  • They're misleading when read back. Again, with the file-extension problem it's not clear what I'm doing by taking 4 characters off - the purpose is much clearer if I split on the dot.
  • They're not actually any quicker/simpler than the "proper" solution. The file-extension thing is a good illustration again - bash substrings are kinda inelegant, while pattern substitutions aren't. The "proper" solution is shorter and much clearer in general.
  • You still don't know how to solve the "actual problem" where you could have learned a much broader-applicable thing and improved your problem solving ability. In a lot of cases, you end up writing "Fortran in every language" rather than actually learning the approch of the one you're in. Again, the file-extension example is a good one. Doing it with a string slice is something familiar to me from other languages, but not a good move in bash, and it makes the problem more fiddly (I need to find the string length because of how substringing works).
  • Last and perhaps most egregiously, it hides from YOU how well you understand what you're doing, and how hard it actually is. My hacky solutions can end up being far more complex than they need to be, or they can end up making a genuinely hard problem look simple, because they don't actually solve it. See, for example, the difference between a genuine AI chatbot and the ELIZA model.

So what's the solution?

Mostly, when asking a question, try to ask the actual real one. This is easy when asking a question (in person or in text), but gets a bit tricky when using a search engine, since you're having to break things down into just a few key words. But the general idea is to look for the overall problem, as well as the partial solution you've thought up, to think very hard about what that actual problem is, taking a step back if required, and to not get too attached to your approach.

Always bear in mind that a complete change of tactic might be required to actually achieve your goal, and that you might not have quite nailed down what that goal is yet. And keep in mind the quote from Henry Ford:

Failure is simply the opportunity to begin again, this time more intelligently.

or as I've heard it paraphrased but can't find a source for:

Finding out you're wrong is great, because you get the chance to become more right.


November 20, 2019

So you've Nobbled your Git

Git protects you from some sorts of errors, by letting you create a history which you can then roll back through. On the other hand, it gives you some very powerful commands with which an incautious person can wreak havok. In fact, everybody who uses git has probably done something catastrophic at least once. It happens. So what can you do about it?

Luckily, Git gives you some powerful tools for fixing things that it lets you do, and there are some sneaky tricks for things it doesn't deliberately help with, but does by coincidence. For some things, you just have to hope your backups have been keeping up. But how do you tell which is which, and what do you do?

Things Wot I have done to my Repos

  • Checked out a file and lost local changes (git checkout <filename>)
  • Done `git pull` and got a merge commit I didn't want
  • Done `git reset HEAD~<N>` to go back before the merge commit, and gone too far
  • Run `git reset --hard` instead of `git reset`
  • Checked out a feature branch under the name of master and then pushed it

If you're thinking I might just be a bit of an idiot, well maybe, but not because of these things. They happen. Even the last one, although that was a bit of a perfect storm of errors combining into horror. They happen to everybody, even the experienced. Sites like DangitGit exist for a reason! The trick is to do them less and fix them better.

Why these things are Good but also Bad

In order:

Checked out a file and lost local changes.

`Git checkout` for a single file means "restore this file to the state in the index" which, in simpler terms means "put this file back to what git thinks it is" (in most circumstances). For a file with staged changes, these are kept, i.e. you get the state from the last commit AND any staged changes.

This is handy when you're making prospective changes and want a way to undo parts that didn't work. You stage the bits that did work, and you checkout, and you try again.

I used it a lot recently because I was writing a script to find-replace things in code. I wanted changes to the script to be kept, but it had gone wrong, so I wanted all the code to be put back to how it had been. `git add <script>` and `git checkout ./`and I could easily undo the horrors my script had wrought.

Why you should BEWARE: you are asking git to do something with things it doesn't know about (your unstaged, uncomitted changes). Git happily does this and it doesn't care. Unstaged changes are none of its business and it throws them away. Git can't help you get them back. It never had them.

Done `git pull` and got a merge commit I didn't want

You try to `git push` and are told your branch is behind the remote. The instructions say to do git pull` first. Git pull is nice and simple, and mostly you don't have to do much more. You pull, you push, job done. This is nice when you really did make distinct changes and are happy with a merge commit.

Why you should BEWARE: merge commits are ugly. Sometimes they are a tolerable evil, but they make the history more complicated and can be tricky when you're trying to go back in time. Don't just blindly `git pull`. Sometimes you need to take the extra time and rebase your local changes onto the remote.

Luckily, this is all doing "gitty" things, so git will help you! You can simply go back (git reset) to before the merge, and fix things properly.

Done `git reset HEAD~<n>` to go back before the merge commit, and gone too far

Suppose you accidentally did that blind git pull, and realised you now have an ugly merge commit you don't want. Ah, you think, I can just reset back to before I did that. This is really handy - I can go back to any of the old states and I can branch from it, or I can go back "undoing the commits" but keeping the changes, so I can refactor what went into what commit etc.

Why you should BEWARE: Counting is a real downfall of mine, (how many days ago was Sunday again?), and I got N wrong and went back too far. Now I've "lost my commits". I have the changes, but I don't know which of them went into what commit and I don't want to have to recreate that! If I was even dumbed and did a hard reset, I don't even have the changes anymore, and I really don't want to redo all of those. But, the changes had been committed, so maybe there's a fix?

Again, this was a "gitty" thing, and I can fix it. I didn't delete the commit entirely (yet), I just took it out of local history. At worst, it should still exist on my remote (hopefully I pushed it before I messed up, but in this case, I had not). See below for the solution.

Run `git reset --hard` instead of `git reset`

I've made some changes, I've staged some things, and I realise actually it was all bunk. Maybe it was a failed experiment, and became clear it wasn't going to work only partway. Maybe I was just messing about. Regardless, there are circumstances where I want to go back to a clean slate, and put everything back into the state of the last commit. This is the task of `git reset --hard`. It means "wipe it all clean. Put me back as though I had just cloned this/hadn't done anything since my last commit.

Why you should BEWARE: Well... read again carefully what this does. All your changes, staged or not, trash them. Reset it all!. Now compare to what `git reset` (without --hard) does. That's a bit different, isn't it. Hard resets are very brutal. They have their place, but must be handled with care. Recently I did a hard reset I didn't mean to. I had a few hours of work, staged but not committed, and it had vanished. Bugger.

So can I fix it? Well... kinda. While I did a "gitty" thing, git obeyed me, and threw my work away. I asked it to. However, all is not lost. I can't just ask git to undo my mistake, but some of it might remain, if I can work out how to find it. See below!

Checked out a feature branch under the name of master and then pushed it

Being able to have a local branch with one name map to a remote branch of another name is handy. For instance, say my remote repo like to separate hotfix from feature branches, or use a developers name as part of the branch. I don't need to remember this when I make local branches, and I don't have to fiddle around renaming branches locally. I just push like `git push origin local_name:remote_name` This is handy.

I can also add more than one remote ("origin" is not special, it's just a default name) and push, pull etc from the one I want.

Why you should BEWARE: Mostly these features are handy, but I managed to make a real mess for myself. I added a new remote and tried to checkout a specific branch. I accidentally pulled a branch, and didn't notice that locally it was now called master. Later, I pushed the branch. I forgot to specify that I wanted to push to the remote, and I forgot to type the branch name. Had I remembered the latter, I would at least have seen an error that that name didn't seem to exist. Had I remembered the former, at least I'd only have nobbled the copy on my personal remote. The remote configuration let me push because I had high priviledge. Oops!

Can I fix this? Luckily, yes. I didn't FORCE push, so all I have to do is take a deep breath, checkout back to what should be the tip of master, and forcibly push that. I should be very, very cautious here though. If I get this wrong, I can do a lot of damage. Force pushing is not something to take lightly. If this is not your project, buy the maintainer a stiff drink (or a cookie) and ask them politely to fix it for you.

HELP, What do I do NOW?

So, assume you've done something similar to one of the above. You've "lost" some work that was staged, or committed so that git knew about it. Can it help me? First, some background.

Every Repository is Equal

Introductions to Git often talk about its character as a "distributed" system. Rather than the older style where there was one "canonical" repo, and people could locally have a subordinate copy, in git "all repositories are equal". This is true for most stuff. Commits, history, objects etc are stored in every copy of the repo, and none of them are special.

However, it is not true of absolutely everything. If you have worked with "git tags" you may recall these aren't pushed by default with the rest of the content. You have to push them specially. Also, (obviously), none of the untracked files in your repo are part of anybody else's repo. You can also have git configured on your machine to e.g. use a global .gitignore file.

Local Specialities

There are a few other things which exist in your local git repo, but aren't part of your remotes, or anybody else's copy. This is one of the ways in which git is not a backup - pushing to a remote is about sharing stuff, not preserving it. A git remote is a copy of some of your stuff, but not uncomitted changes, or the local parts of your repo.

For us, here, the things we're interested in are the git "reflog" which is sort of a local history of some of the "gitty" things you've done, and the git "object directory". Git turns everything into these "objects" which is why you see messages like "Counting objects"" when you push or pull. Things that you undo, or never confirm (like staged-but-never-comitted changes) might still exist in the objects database, but in an "orphaned" state - not part of the repo, but not yet gone forever.

Taking out the Trash

Git is strictly a garbage collected system. "Garbage collected" languages are a class where, when something is freed or deleted, instead of it being immediately wiped, it is merely flagged somehow as "done with". At some point, often on schedule, or when the program isn't doing much else, the garbage collector comes and cleans things up, returning the memory to the program.

If you've watched things like "CSI" or other tech-jargon TV, you have probably seen somebody restoring apparently deleted files on a computer harddrive - this is a bit like garbage collection. When you delete a file from disk in the normal fashion, it would take time to overwrite every byte with a 0-byte, so instead the space is merely flagged as empty. For some time the file may be effectively still there, just "lost" and can be restored. Proper drive wiping involves overwriting (more than once, because magnets are complicated) with 0s. The recycle bin won't cut it.

The reflog and the object database for git are garbage collected. This means that objects created because git thought you were going to do something, even if you didn't, might still be there days, years, commits etc later.

Don't Push your Luck

In most cases, any attempts to use these features to save you are a last ditch, and the better approach is not to make the mistake in the first place. Since the object database is garbage collected, files hiding in it are living on borrowed time, and can be irreparably lost at any time.

So, the Solutions

I am only going to talk about the solutions to two of the messes in the list above, namely the accidental hard reset to an older commit, and the accidental hard reset of staged changes. The things where you might need to do something not entirely "gitty" to fix them. You might notice both things involve resets...

The first "solution" for FUTURE occurences of these problems, is never to run a reset without first stashing everything, or backing it up, or similar, and to be certain the reset is the correct command first. But if you've already messed up, you want to fix it now. There is hope!

Fixing an "undone" commit

The first error, resetting back to HEAD~<N> and going to far, isn't that bad. For a start, if you've been pushing regularly, the work in the "lost" commits isn't gone, as you could always clone afresh. Luckily though, even if you never pushed, you can fix this. Git knew about those changes, and those commits. This can be reconstructed. Note that if you did a hard reset, the changes are apparently "gone" wheras a soft one leaves them there, but your commit details are gone (such as any picking of lines, or separation into multiple commits).

But the gist of the solution is as follows. Whenever I commit, checkout, reset, rebase or pull, the Git reflog stores a "commit" going between the states. These commits are "real" but they aren't part of my core repo. You can see a "history" using the command "git reflog". You should see things like "commit: <commit msg>" and possibly "checkout: moving from <branch> to <branch>".

These states don't persist forever but they are real long enough for this! I can simply backup my work (just in case), take a deep breath, and broach the "reset" command again, but more carefully this time. I carefullyidentify the point I made the error, and I ask git politely but firmly to undo it. In this case, I am looking for the line which says " <id> HEAD@{N}: reset: moving to <blah>" where "blah" might be "HEAD~2" or might be a commit-id or similar. I want to go back to just before this, so I pick the previous line and make a note of the part "HEAD@{N}".

If I have any unstaged changes, or I have anything else I am not sure about I double check that I backed it up! We are about to run another reset. Getting this wrong can make things worse. Backup. Now.

Now, again carefully, I run `git reset --hard HEAD@{N}` where N is the thing I just worked out. Hopefully I am back where I wanted to be! Since the commits I had orphaned are no longer dangling, as they're now part of my branch again, they wont be cleaned up, and I have got them back. I breathe a sigh of relief and vow never to hard-reset again.

Fixing nobbled staged changes

The previous error wasn't actually as bad as all that. All I had done was remove some entire commits from my branch history. It seems logical that git remembers them for a bit and can put them back. But in the case of staged changes, they were never part of a commit at all, so they're not in the reflog in any form.

Luckily, when we stage the changes git "gets things ready" and creates the object detailling the changes in the object database. But there is no longer any reference to them in the repository itself. They are just orphaned objects. If we can convince git to spit them out, at least we have our changes (or files) back, even if we have to do a bit of work to restore things completely.

The following solution comes from this link. I was able to save my situation using the simple method, because I didn't have many changed files, but the link also details a longer solution for really bad mess-ups. Basically, we have to work out which are the orphaned objects (not part of any commit, branch etc), which of these are files (commits etc can also be in here), and then get git to tell us the content. We can then either hand-pick what we want back, or write a script to spit it all into files and use our shell-scripting prowess to restore things.

The direct is using `git fsck --cache --unreachable $(git for-each-ref --format="%(objectname)")` which asks git to spit out any unreachable object ids. We can then show the content with `git show <id>`. If there's not much stuff, that should work.

If we get loads, AND the reset is the last thing we staged or committed, we can list objects in order using `find .git/objects/ -type f -printf '%TY-%Tm-%Td %TT %p\n' | sort`.This gives us ALL the objects, but we can cross reference the lists to find our lost stuff, or just go by hand. We can again use `git show` to view the objects, although in this case we have to strip out the extra '/' characters from the ids.

Luckily for me, I only needed to restore one or two files, so I used the find command and did `git show` manually, but I was glad not to have to redo all my work!

HELP, this Keeps Happening to Me!

If you keep ending up with these sorts of problems, there are a few handy tips.

  1. DO NOT PANIC. It is never helpful.
  2. STOP. Once you've messed things up, you're already thinking "If only I hadn't done that!" You might notice that a lot of the "solutions" to git problems can make things much worse. Once you're in a mess, stop. Backup what's left of your work. Make a new repository somewhere else to test fixes. Don't just plow on!
  3. Work out what you actually did wrong. The internet abounds with solutions to git problems, but unless you can detail exactly what went wrong, you wont find the right one.
  4. Backup and stash carefully in future. I said that git only "knows" about staged or committed files - it also knows about those you tell it about using the "stash" command. Carefully stashing changes before doing "risky" things can save you.
  5. Slow down, think hard, and never commit on a Friday afternoon. Git is a powerful tool, and should not be operated while tired, distracted, or under the influence. Don't try and do complicated gitty things late or short of time.

Hopefully, this helps with the worst muck-ups you can do, git wise. But remember, nothing beats not doing it in the first place!


October 10, 2019

The Basics aren't so Basic

Sorry for the long break! We've been busy with the start of term and busy expanding our training material (link). This week I am just going to talk about something that you should always keep in mind, not just with programming and computers but with a whole bunch of things, and that is, what does it mean to say something is "basic".

There is a quote often attributed to Einstein, although not directly traceable to himwhich goes

Common sense is the collection of prejudices acquired by age eighteen.

Whether the origin is real or not, it's often true that what people think is simple, or "just common sense" is only so because of their background. To somebody who cut teeth on a BBC Micro, programming might seem super BASIC.

Jokes aside, you will probably keep coming across things that are super-basic and feeling a bit awkward that they somehow escaped you until now. Especially if you have learned things in the usual manner, i.e. by necessity, it is very easy to miss some of the basics. You can find yourself doing really quite advanced things, while not knowing something that "everybody else" seems to. This is normal. It is not beneficial, but it is perfectly normal. Frankly, in computing and programming there is a vast, vast sea of "basics", and no matter how much you learn there always seems to be more.

When I were a Lad

When I was a PhD student, I was happily using 'ssh' to login to remote machines, but I would always type out the whole host spec, such as "username@machinename.blah". I remember feeling a bit dumb when my supervisor pointed out that I didn't need the "username", and he thought this was somehow basic and obvious. I was frankly a little bit irritated because nobody told me! How was I meant to know?

"Simplicity is the final achievement."

(Quote from Frederic Chopin)

Moreover, just because something is "basic" doesn't mean it is simple. In fact, Merriam-Webster's definitionof the adjective "basic", while perhaps a bit unhelpfully recursive does not say simple anywhere. That thing with the username isn't so simple. It's fundamental, sure, but it's not simple!

Years later, I am still regularly coming across things that are "basic" that I have never encountered before. The whole "learning how to program" thing is far more of a helix than a road. You come across fundamental things all the time, some for the first time, some repeatedly, and often you can understand them better every time. Eventually, you find them simple. Sometimes they feel even elegant, because they arise so smoothly from the things you do know, or perhaps even seem so obviously "the only way it could be".

This is most of the motivation for our "WINKT" blog post series. These are fundamental, mostly "basic" things, but they're mostly not things you could usefully be told about the first go-around. Mostly, they are the basics of how the complicated things work. For example:

  • On the command line: if you use the '*' wildcard, when does this get expanded into the list of matches? Specifically, if you accidentally create a file called "-rf" in your home, and ran the command `rm *` to remove files, how much trouble would you be in? The answer is, _a lot_. * is processed first, by the shell, and unfortunately '-' comes first in the alphabet. You just ran the equivalent of `rm -rf *`. Ooops.


  • Any C/C++ programmers: if you use a variable which is undefined, what is it's value? If you said "whatever is in the associated memory beforehand", you're close, but wrong. An undefined variable is undefined behaviour - it can be given any value, including a different one each time it is accessed. Why? Because the standard says so. But who needs to know that? It is enough to know that its value is unreliable. Using your "basic" knowledge of the C memory model, you would likely guess the above, and it would never matter. [Disclaimer: this is one that I personally only learned a few weeks ago. It's absolutely fundamental, but not at all simple.]


  • For Fortran 2003 people: if you have a function-scoped ALLOCATABLE array, allocate it inside the function and forget to free it before the function exits, what happens? A memory leak? Nope! Fortran will helpfully deallocate the array on exit. If you didn't know this and freed everything yourself, there would never be a problem, but this one often surprises people.


  • For Python people: suppose you give a function a default argument, like `def func(arg, list_arg = []): ...` and suppose inside the function, list_arg gets filled with stuff. If you call the function twice without supplying list_arg, what do you get the second time? If you said "the combined contents from the first and second calls" you would be correct! The default arg. is an empty list, but it is the SAME empty list each time!

Take Aways

What's the point of all this rambling? Just that there is so much often classed as "the basics" that nobodycan know it all, and there is nothing so basic that you wouldn't do well to re-examine it anyway. It gets said all the time, but with computers there really are no stupid questions. Well, OK, there are some pretty stupid questions. But I have never seen one yet that wasn't worth thinking about.


Postscript: any suggestions for things that make you go "Well I Never Knew That"!, email us or comment! We can always use more


September 10, 2019

New snippets series: WINKT

Super short blog post to start a new snippets series. Along with our SOUP series (Snippet of the <Undefined Period>) we're trying out a new way of posting. WINKT (Well I Never Knew That) is a place for all those things that we stumble across that don't warrant a complete post, but are interesting (if you're a giant geek, anyway).


Fortran Variable Declarations

To start the series off, something I never thought about in Fortran: when you declare a variable with multiple attributes, what is the Left-Hand side comma-separated list actually doing? I was reading a Doctor Fortran post about the potential pitfalls of blindly applying every attribute you might wantand saw his final example of applying (here erroneously) the DIMENSION keyword to a variable in a separate line. "Wait, " I said, "can you do that?" I wrote a quick example, compiled it (with strict standards adherence) and proclaimed "Well I Never Knew That!"

What am I talking about here? Well, consider creating an array of integers. You'll usually write something like:

INTEGER, DIMENSION(:, :), ALLOCATABLE :: I

for a 2-D array. As it turns out, you could equivalently write this as

INTEGER :: I
DIMENSION :: I(:, :)
 ALLOCATABLE :: I

adding the attributes across several lines.

The comma-separated lists of attributes on the left of the double-colon are effectively "stacking" the attributes, but each attribute can also be applied separately.

In that example, it's clearly a bit silly and unwieldy, but it's something you might see "in the wild" and perhaps be confused by, so worth knowing. In some examples, it might actually help make things clear, with, for example, attributes such as SAVE or INTENTs, which can sometimes get "lost in the noise" of declarations. So rather than

INTEGER, SAVE :: I
 INTEGER :: j, k

I could write

INTEGER :: i, j, k
 SAVE :: i

This might look clearer, especially if I have more stacked attributes.

This is probably not something I will ever use, and I am not sure I would recommend it, since it looks a bit unexpected, and generally code should avoid unexpectedness. But it did show up to me just how the attribute lists must be working, and was an interesting ten minutes.


August 21, 2019

Fortran Memory Management II

Follow-up to Fortran Memory Management from Research Software Engineering at Warwick

This month we're going to cover the question of what to watch out for with using ALLOCATABLEs in Fortran.

Array bounds in functions

Fortran arrays have the very nice property that their indices don't have to run from any specific value to any specific value. So if you want an array that runs from -3 to 103 that's fine. You can allocate it as

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))

This maps quite neatly maps onto lots of scientific use cases so you quite often see arrays with explicit upper and lower bounds in real world Fortran codes. You can check the upper and lower bounds easily enough using the UBOUND and LBOUND functions

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  PRINT *, LBOUND(array), UBOUND(array)

Produces the output "-3 103" as you'd expect. But there is a wrinkle. What if you move those calls to LBOUND and UBOUND into a function?

MODULE mdl
  IMPLICIT NONE
  CONTAINS
  SUBROUTINE print_array(array)
    INTEGER, DIMENSION(:), INTENT(IN) :: array
    PRINT *, LBOUND(array), UBOUND(array)
  END SUBROUTINE print_array
END MODULE mdl

PROGRAM p1
  USE mdl
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  CALL print_array(array)
END PROGRAM p1

The result now is "1 107". The same number of elements but the lower bound has been moved back to the Fortran default of 1. This is the default behaviour of Fortran when you pass an array into a function; the lower bound is reset to 1. You can override this behaviour to specify the lower bound of the array in the function (INTEGER, DIMENSION(-3:), INTENT(IN) :: array will specify that the lower bound is -3), you can even pass in a parameter to the function to specify what the lower bound is (simply pass in an integer parameter and use it in the DIMENSION option in the same way that I used -3 before) but you can't just use the lower bound that the array was given when it was created. However, if you flag the array argument to the function as either ALLOCATABLE or POINTER things are different.

MODULE mdl
  IMPLICIT NONE
  CONTAINS
  SUBROUTINE print_array(array)
    INTEGER, DIMENSION(:), ALLOCATABLE, INTENT(IN) :: array
    PRINT *, LBOUND(array), UBOUND(array)
  END SUBROUTINE print_array
END MODULE mdl

PROGRAM p1
  USE mdl
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  ALLOCATE(array(-3:103))
  CALL print_array(array)
END PROGRAM p1

This version of the code looks almost identical but now reports the lower and upper bounds as -3 and 103. In fact, you now can't override the bounds of your array even if you wanted to (if you try putting the lower bound in the DIMENSION part of the parameter definition in the function the code will fail to compile). The main downside is that you now can only pass ALLOCATABLE arrays to the function because the main purpose of applying the ALLOCATABLE option to the parameter is to allow you to allocate and deallocate the array inside the function. The POINTER attribute works in much the same way but only for POINTER arrays

Mostly this sort of thing isn't too much of a problem but you do have to be careful. If you have a function that takes a normal non-ALLOCATABLE parameter then it's tempting to simply recognize that the array lower bound will start from 1 inside the function and write the function accordingly. The problem is that if you ever then have cause to add the ALLOCATABLE or POINTER attribute to the function then you'll have to completely rewrite the function because suddenly the lower bound is no longer under your control. It's generally a good idea to always specify the lower bound of an array argument to a function, either through a fixed value if it's always the same for all arrays that the function will be used on or by passing in the lower bound as a parameter to the function.


Automatic Reallocation

The fact that Fortran allows you to do whole array operations is another feature that makes it well suited to scientific programming but there are some features that can be mixed blessings. One feature that was added to Fortran 2003 is that if you do a whole array assignment to an allocatable array that is either not allocated or is allocated to be a different size than the source array then the array will be reallocated to match the size of the source. To give an example

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  INTEGER, DIMENSION(-3:103) :: src
  array = src
  PRINT *, LBOUND(array), UBOUND(array)
END PROGRAM p1

This program always feels like it's invalid but from Fortran 2003 onwards it is completely valid and will give you lower bound of -3 and upper bound of 103 because "array" is automatically allocated when "src" is assigned to it. In this example "array" is not allocated before I used it but if it was then it would have been silently deallocated and reallocated to the new size. This is quite useful for many purposes, it allows you to return an array from a function and just store it in an allocatable array and have everything magically work, and sometimes you do want to do literally what I'm doing in this example and make a copy of an array. Why would this ever be a disadvantage? Because it can make debugging much harder by moving the place where the bug occurs. Imagine the following situation

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array1, array2
  INTEGER :: i

  !The incorrect allocation of array2 is a typo
  ALLOCATE(array1(-3:103), array2(-2:103))
  array2 = 1

  array1 = 5 * array2
  DO i = -2, 103
    array1(i) = array(i) - array(i-1)
  END DO

END PROGRAM p1

This program does nothing even remotely useful but it has the same structure as a real program that I had a problem with. There was an error in the size of an array that was then used in an array assignment. The assignment caused a reallocation of "array1" then then meant that the later loop was iterating over more items than the array now had so it crashed during the operation of that loop. Specifically the first iteration of the loop is trying to access the -3 element that now no longer exists. The loop in fact was perfectly well written for how the code should have been working but due to the error in the allocation of array2 the code was now crashing there. Without the implicit reallocation the error could have much more easily been traced to the array assignment (that in the real code was much further away from the crash site than in this simple example). There are a surprising number of ways of tripping this behaviour and causing errors in unrelated parts of your code because of this behaviour so if you get very unexpected array behaviour you should watch out for this one.

A question that then quite often comes up is if you can suppress this behaviour if you don't want it, and you definitely can. Most compilers have an option to disable the behaviour entirely (-fno-realloc-lhs in gfortran for example) and equally assigning to an array sectiondoesn't trigger the behaviour so

PROGRAM p1
  IMPLICIT NONE

  INTEGER, DIMENSION(:), ALLOCATABLE :: array
  INTEGER, DIMENSION(-3:103) :: src
  array(:) = src
  PRINT *, LBOUND(array), UBOUND(array)
END PROGRAM p1

will crash because you are assigning to an unallocated array. Some people say that from F2003 onwards you should probably always do array assignment to an array section if you don't want to invoke the automatic reallocation behaviour. I wouldn't go quite that far but you should definitely consider the question of if any odd bugs that you have are related to the automatic reallocation behaviour


October 2024

Mo Tu We Th Fr Sa Su
Sep |  Today  |
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31         

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV