Research Software Engineering at Warwick, University of Warwick

April 08, 2022

License choice in the R community

This week a post on the RSE Slacksparked a lot of discussion on how to choose a license for your research software. The website https://choosealicense.com/is a helpful resource and starts with an important point raised by Hugo Grusonthat a good place to start is to consider the license(s) commonly used in your community. But how do you find out this information? This blog post explores the licenses used in the R and Bioconductor communities, by demonstrating how to obtain licencing information on CRAN and Bioconductor packages.

Licenses on CRAN

The Comprehensive R Archive Network (CRAN) repository is the main repository for R packages and the default repository used when installing add-on packages in R. The tools package that comes with the base distribution of R provides the CRAN_package_db() function to download a data frame of metadata on CRAN packages. Using this function, we can see that there are currently 19051 packages on CRAN.

library(tools)
pkgs <- CRAN_package_db()
nrow(pkgs)

## [1] 19051

The license information is in the License column of the data frame. We'll use the dplyr package to help summarise this variable. With n_distinct() we find that there are 164 unique licenses!

library(dplyr)
n_distinct(pkgs$License)

## [1] 164

However, many of these are different versions of a license, e.g.

pkgs |>
  filter(grepl("^MIT", License)) |>
  distinct(License)


##                                   License
## 1                      MIT + file LICENSE
## 2              MIT License + file LICENSE
## 3                      MIT + file LICENCE
## 4 MIT + file LICENSE | Apache License 2.0
## 5                       MIT +file LICENSE
## 6          MIT + file LICENSE | Unlimited

The above output also illustrates that

An additional LICENSE (or LICENCE) file can be used to add additional terms to the license (the year and copyright holder in the case of MIT).
Packages can have more than one license (the user can choose any of the alternatives).
Authors do not always provide the license in a standard form!

A LICENSE file can also be used to on its own to specify a non-standard license. Given this variation in license specification, we will use transmute() to create a new set of variables, counting the number of times each type of license appears in the specification. We create a helper function n_match() to count the number of matches for a regular expression, which helps to deal with variations in the form provided. Finally we check against the expected number of licenses for each package to check we have covered all the options.

n_match <- function(s, x) lengths(regmatches(x, gregexpr(s, x)))
licenses <- pkgs |>
  transmute(
    ACM = n_match("ACM", License),
    AGPL = n_match("(Affero General Public License)|(AGPL)", License),
    Apache = n_match("Apache", License),
    Artistic = n_match("Artistic", License),
    BSD = n_match("(^|[^e])BSD", License),
    BSL = n_match("BSL", License),
    CC0 = n_match("CC0", License),
    `CC BY` = n_match("(Creative Commons)|(CC BY)", License),
    CeCILL = n_match("CeCILL", License),
    CPL = n_match("(Common Public License)|(CPL)", License),
    EPL = n_match("EPL", License),
    EUPL = n_match("EUPL", License),
    FreeBSD = n_match("FreeBSD", License),
    GPL = n_match("((^|[^ro] )General Public License|(^|[^LA])GPL)", License),
    LGPL = n_match("(Lesser General Public License)|(LGPL)", License),
    LICENSE = n_match("(^|[|] *)file LICEN[SC]E", License),
    LPL = n_match("(Lucent Public License)", License),
    MIT = n_match("MIT", License),
    MPL = n_match("(Mozilla Public License)|(MPL)", License),
    Unlimited = n_match("Unlimited", License))
n_license <- n_match("[|]", pkgs$License) + 1
all(rowSums(licenses) == n_license)

## TRUE

Now we can tally the counts for each license, discounting version differences (i.e., GPL-2 | GPL-3 would only count once for GPL). We convert the license variable into a factor so that we can order by descending frequency in a plot.

tally <- colSums(licenses > 0)
tally_data <- 
  tibble(license = names(tally),
         count = tally) |>
  arrange(desc(count)) |>
  mutate(license = factor(license, levels = license))

Bar chart of license frequencies on CRAN as a percentage of the number of packages. The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages.

tally_data.csv

The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages. This is consistent with R itself being licensed under GPL-2 | GPL-3. The only licenses in the top 10 that are not mentioned as "in use" on https://www.r-project.org/Licenses/, are the Apache and CC0 licenses, used by 1.7% and 1.1% of packages, respectively. The Apache license is a modern permissive license similar to MIT or the older BSD license, while CC0 is often use for data packages where attribution is not required. A separate LICENSE file is the 3rd most common license among CRAN packages; without exploring further it is unclear if this is always a stand-alone alternative license (as the specification implies) or if it might sometimes be adding further terms to another license.

Licenses on Bioconductor

Bioconductor is the second largest repository of R packages (excluding GitHub, which acts as a more informal repository). Bioconductor specialises in packages for bioinformatics. We can conduct a similar analysis to that for CRAN using the BiocPkgToolspackage. The function to obtain metadata on Bioconductor packages is biocPkgList(). With this we find there are currently 2041 packages on Bioconductor:

library(BiocPkgTools)
pkgs <- biocPkgList()
nrow(pkgs)

## [1] 2041

Still, there are 89 distinct licenses among these pckages:

n_distinct(pkgs$License)

## [1] 89

We can use the same code as before to tally each license and create a plot - the only change made to create the plot below was to exclude licenses that were not represented on Bioconductor.

Bar chart of license frequencies on Bioconductor as a percentage of the number of packages. GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages.

tally_data_bioc.csv

GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). This reflects the fact that the Bioconductor core packages are typically licensed under Artistic-2.0 and community members may follow the lead of the core team. Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages. The ACM, BSL, CC0, EUPL, FreeBSD and LPL licenses are unrepresented here.

Summary

Although the Biconductor community is a subset of the R community, it has different norms regarding package licenses. In both communities though, the GPL is a common choice for licensing R packages, consistent with the license choice for R itself.

Heather Turner : 08 Apr 2022 10:32 | Tags: Code Licenses R | Comments (0) | Close comments | Report a problem

March 18, 2022

Getting a little testy

Testing code is essential to knowing if it works. But how do you know what to test? How do you know you've done enough?

Let's be clear to start here, "testing" as we think of it is some form of comparing a code answer to a predicted one, making sure the code "gets it right" or meets expectations. If there is a canonical "right answer" then it's fairly easy to do - if not then things get difficult, if not impossible, but correctness is the goal.

What to test? Well, all of it, of course! We want to know the entire program works, from each function, to the entire chain. We want to know it all hangs together correctly and does what it should.

So when have you done enough testing to be sure? Almost certainly never. In a non trivial program there is almost always more you could test. Anything which takes a user input has almost infinite possibilities. Anything which can be asked to repeat a task for as long as you like has literally infinite possibilities. Can we test them all? Of course not!

Aside - Testing Everything

Why did we say user input is only almost infinite? Well, all numbers in the computer have a fixed number of bits for storage, which means there's a strictly fixed number of numbers. Technically, for a lot of problems, we really can test absolutely every input. In practice, we can't afford the time and anyway, how do we know what the right answer is without solving the problem completely.

Picking random things to check is an idea that's used sometimes, as are "fuzzers", which aim to try a wide range of correct and incorrect inputs to find errors. But these are also costly to perform, and still require us to know the right answer to be really useful. They can find crashes and other "always wrong" behaviour though.

Lastly, in lots of codes, we aren't expected to protect users from themselves, so entering an obviously silly value (like a person's height of 15 ft) needn't give a sensible answer. We can fall back to "Garbage In Garbage Out" to excuse our errors. But our users might be a lot happier if we don't, and in important circumstances we wont be allowed to either.

Back to the Grind

That all sounds rather dreary - writing good tests is hard, and we're saying they're never enough, so why bother? Well, lets back off for a minute and think about this. How do we turn an "infinite" space into a tractable one? Well, we have to make it smaller. We have to impose restrictions, and we have to break links and make more things independent. The smaller the space we need to test, the more completely we can cover it.

Unit testing

Most people have probably heard of unit testing by now - testing individual functions in isolation. It seems like if you do this then you have tested everything, but this is not true. What happens if you call this function followed by that one (interdependency)? What happens if you call this function before you did this other thing (violation of preconditions)?

Unit testing is not the solution! Unit testing is the GOAL!

If we could reliably say that all of the units working means the program works, then we can completely test our program, which is amazing! But in reality, we can't decouple all the bits to that extent, because our program is a chain of actions - they will always be coupled because the next action occurs in the arena set up by the preceeding ones. But we have to try!

Meeting the Goal

So we have to design, architect and write our programs to decouple as much as possible, otherwise we can't understand them, struggle to reason about them, and are forced to try and test so much we will certainly fail. A lot of advice is given with this sort of "decoupling" in mind - making sure the ways in which one part of a program affects another are:

as few as possible
as obvious as possible
as thin/minimal as possible

What sorts of things do we do?

Avoid global variables as much as possible as anything which touches them is implicitly coupled

C++ statics are a form of global, as are public Fortran module variables (in most contexts)

Similarly, restrict the scope of everything to be as small as we can to reduce coupling of things between and inside our functions
- C++ namespaces, Fortran private module variables, Python "private" variables (starting with an underscore, although not enforced in any way by the language)
Avoid side-effects from functions where possible, as these produce coupling. Certainly avoid unexpected side effects (a "getter" function should not make changes)
- Fortran offers PURE functions explicitly. C++ constexpr is a more complicated, but related idea
Use language features to limit/indicate where things can change. Flag fixed things as constant.
- Use PARAMETER, const etc freely
- In Fortran, use INTENT, in C/C++ make function arguments const etc
- In C++ try for const correctness, and make member functions const where possible. Use const references for things you only want to read
Reduce the "branchiness" (cyclomatic complexity) of code. More paths means more to test.
Keep "re-entrancy" in mind - a function which depends only on its arguments, doesn't change it's arguments, and has no side effects can be called over and over and always give the same answer. It can be interrupted/stopped and started afresh and still give the same answer. This is the ultimate unit-test friendly function!

Overall, we are trying to break our code into separate parts, making the "seams"between parts small or narrow. If we drew a graph of the things in our code and denote links between them with lines, we want blobs connected with few lines wherever possible.

Every link is something we can't check via unit testing, and unit testing is king because it is possible to see that we have checked every function, although we can never see if we have checked every possible behaviour within the function, even with code coverage tools.

Aside - Writing useful unit tests

Even if everything is unit-testing amenable, it doesn't make it actually easy. There's a few things to keep in mind when writing the tests too.

Poke at the edge-cases. If an argument of '3' works, then '6' probably will too, but will '0'? What about a very large number? Even (a+b)/2 for an average will break down if a + b overflows! A negative number? Look for the edges where behaviour might change, and test extra hard there
If your function is known only to work for certain inputs, make them preconditions (things which must be true). This can be done either in documentation, or using things like assert
Check for consistency. If you mutate an object, check it is still a valid one at the end
Try to come up with things you didn't think about when you wrote the function, or that no "sensible caller" would do
Check that things fail! If you have error states or exceptions, make sure these do occur under error conditions

Dealing with the Rest

If we could actually produce a piece of code with no globals, no side-effects and every function fully re-entrant, unit tests would be sufficient to check everything. Sadly, this is impossible. Printing to screen is a side-effect. All useful code has some side effects somewhere. Mutating an argument makes a function non-reentrant (we'd have to make a copy and return the new one instead, and that has costs). So we seem doomed to fail.

But that is OK. We said unit-testing is a goal, something we're trying to make as useful as possible. We can do other things to make up for our shortfalls, and we definitely should, even if we think we got everything. Remember, all of those bullets are about minimising the places we violate them, minimising the chances of emergent things happening when we plug functions together.

We need to check actual paths through our entire code (integration testing). We need to check that things we fixed previously still work (regression testing). We probably want to run some "real" scenarios, or get a colleague to do so (sorta beta testing). These are hard, and we might mostly do them by just running our program with our sceptical hats on and watching for any suspicious results. This is not a bad start, as long as we keep in mind its limitations.

Takeaway

What's the takeaway point here? Unit Testing works well on code that is written with the above points in mind. Those points also make our code easier to understand and reason about, meaning we're much less likely to make mistakes. Honestly, sometimes writing the code to be testable has already gained us so much that the tests themselves are only proving what we already know. Code where we're likely to have written bugs is unlikely to fail our unit tests - the errors will run deeper than that.

So don't get caught up in trying to jam tests into your existing code if that proves difficult. You won't gain nearly as much as by rewriting it to be test friendly, and then you almost get your tests for free. And if you can't do that, sadly unit tests might not get you much more than a false sense of security anyway.

Heather Ratcliffe : 18 Mar 2022 13:26 | Tags: Philosophy Programming Testing | Comments (0) | Close comments | Report a problem

March 07, 2022

What's the time right now?

This is not one of my mistakes, this is one I read about second hand (you'll see the pun in a moment), but it is a great illustration of the importance of writing the code you actually mean to, thinking properly about what it is that you mean, and watching out for weasel words that might confuse you.

Consider these two lines of code and decide for youself whether they have the same effect:

if (DateTime.Now.Second == 0 && DateTime.Now.Minute == 0)

if (DateTime.Now.Minute == 0 && DateTime.Now.Second == 0)

Decided yet? If you think they are the same, take a second now (that's a hint!) to think about the meaning of each part, and what its value will be. Suppose the original author intended to run some process once per hour, so put one of these into a loop. Can you see why one of these might succeed, and the other fail occasionally (assume the rest of the code takes some tiny fraction of a second)?

By the way, if you're worried about not knowing what language this is in, don't be. This is a perfect example of good "self documenting code" which we've written a bit about before. Based on the names, we can infer (correctly) that DateTime is some sort of thing, which can supply the date-and-time Now through the "DateTime.Now" construct, and that we can then look at the Minutes and Seconds value of the time Now. That's all we need to know.

So, back to the question. Are the lines the same? Well, the order is different. There's a double-& AND operation. We don't strictly know whether the left or right condition will be checked first (it might depend on language), and we don't actually know if both will be (if you're not sure why that is, we've talked about short circuiting in logical operations here). But that shouldn't matter unless the DateTime object is having some weird side-effects and I promise it isn't. The problem is much more fundamental than that.

If you haven't spotted it, it's time to write down very clearly what we wanted to check. We want to know if, right now, the value of both the minutes and the seconds entries are 0. It looks like that's what either line does, but there's a nasty weasly word sneaking in here - the word "now". What can "now" actually mean in programming terms, where we know operations ultimately are carried out one by one in sequence? For instance, what would you expect the following code could print?

if (DateTime.Now.Second == 0) print(DateTime.Now.Second)

If you said 0 only, hold on a second (that's a hint, too) and look again, and think about what this code will actually do in terms of basic operations. It will get a value for DateTime.Now.Second. It will compare this to 0. If the comparison is true it will get a value for DateTime.Now.Second, and it will print this value. And there is the problem! The value we check and the value we print are not always the same.

If it still isn't clear to you why the original two lines are not equivalent, reread that last paragraph a few times and apply the same idea to the original question. If it's been obvious to you all along, great. You're unlikely to make the mistake this original programmer did. But why did they make this mistake? There could be several reasons:

They might simply not have spotted there could be a problem - this is pretty likely, but not very interesting
They might have thought somehow the compiler/runtime would recognise the values as "the same" - this is a pretty fundamental mistake, perhaps due to thinking of "Now" as a property rather than an operation. In some very imprecise sense, we should think of "Now" as having a side-effect on itself because it takes time, so multiple calls will not have the same result as a single one
They might have blindly removed a temporary variable containing DateTime.Now, without thinking deeply about the consequences. This is quite likely because it looks so simple, and is why refactoring like this needs, if anything, more attention than writing the code in the first place
Lastly, they might have thought carefully and precisely, but ultimately wrongly about it. If they confused the idea of seconds as a time point with seconds as a time interval, they may have thought that both the Seconds and the Minutes property would surely be evaluated within the same second, and thus work as expected, where actually even a microsecond between them could be the difference between 1:59.999 and 2:00.000

So, that explains why the lines don't work as intended, but there's one weird thing left - how come one of them (the first one, if evaluation occurs left to right) works, but the other doesn't? Effectively, the minutes value depends on the seconds value - the minute is incremented only when the seconds reads 59. We know the whole operation takes far far less than 1 second (ignoring interrupts or anything else suspending operation), so once we have confirmed the seconds value as 0, we have at least 59-60 seconds to check the minutes value, during which it cannot change. Certainly this will occur within that interval, and the code will work perfectly (again, ignoring external iterrupts, which our code can never account for anyway).

So, does that mean the first option is actually safe, or good code? Well, not without a comment it certainly isn't! If we're relying on this sort of subtlety, somebody is going to be tripped up by it, and it might even be us. It's also a bit delicate - suppose we were to need to use DateTime.Now in another part of the condition, or in the body? We'd have to think very hard about whether we're still "safe" and we simply don't need to. A temporary variable is much clearer, and the possible optimisation here is a) tiny and b) best left to our compiler anyway. Oh and c) possibly a de-optimisation anyway, depending on the cost of the DateTime.Now lookup. Also, it'd be easy to forget that the left-to-right ordering is important, and we might try this in a situation or language where that is not guaranteed.

Ultimately, I simply don't like relying on such a subtlety. It's too clever and it doesn't need to be. It's not that I don't like clever code, because I do. I like clever code that makes one exclaim "oh yes, that's so simple, it's brilliant", and this aint that. This is more "oh blimey, is that how it works". Keep it simple, and if you can't keep it simple, at least make it satisfying.

Postscript

Before we go, I hope something has been nagging at you throughout this post. I started off by saying we should think hard about what we mean and what we want to actually do. I introduced this idea of minute-0 and second-0 and then proposed it as a solution to doing something once per hour. Did it strike you as a pretty terrible solution to that problem? Because it is. Suppose we start running our code at 9:00:30 AM. Would we expect whatever this block controls to wait an entire hour to fire for the first time? What if the code is never running at the change of the hour? This block would never run. Suppose we start 1000 copies of this code. Do we really want all 1000 to execute this block at the same time? Suppose some other code took longer than expected and this block isn't reached once per second - some hours might be simply skipped. In nearly all cases we'd be better checking whether at least an hour has elapsed since last we did the thing, and we'd avoid all of these problems. We meant once per hour, not on the first second of every hour, and we should have coded that, not this.

This post is inspired by an article I read a while ago, namely https://thedailywtf.com/articles/an-hourly-rate In particular, a comment mentions the different behaviour of the two possible lines, which this post focuses on.

Heather Ratcliffe : 07 Mar 2022 16:11 | Tags: Bugs Code Programming | Comments (0) | Close comments | Report a problem

March 02, 2022

Learning to Program

We (Warwick RSE) love quotes, and we love analogies. We do always caution against letting your analogies leak, by supposing properties of the model will necessarily apply to reality, but nontheless, a carefully chosen story can illustrate a concept like nothing else. This post discusses some of my favorite analogies for the process of learning to program - which lasts one's entire programming life, by the way. We're never done learning.

Jigsaw Puzzles

Learning to program is a lot like trying to complete a jigsaw puzzle without the picture on the box. At first, you have a blank table and a heap of pieces. Perhaps some of the pieces have snippets you can understand already - writing perhaps, or small pictures. You know what these depict, but not where they might fit overall. You start by gathering pieces that somehow fit together - edges perhaps, or writing, or small amounts of distinctive colours. You build up little blocks (build understanding of isolated concepts) and start to fit them together. As the picture builds up, new pieces fit in far more easily - you use the existing picture to help you understand what they mean. Of course, the programming puzzle is never finished, which is where this analogy starts to leak.

Jigsaw image

I particularly like this one for two reasons. Firstly, one really does tend to build little isolated mini-pictures, and move them around the table until they start to connect together, both in jigsaws and in programming. Secondly, occasionally, you have these little moments - a piece that seemed to be just some weird lines fits into place and you say "oh is THAT what that is!". One gets the same thing when programming - "oh, is that how that works!" or "is that what that means".

Personally, a lot of these moments then become "Why didn't anybody SAY that!". This was the motivation behind our blog series of "Well I Never Knew That" - all those little things that make so much sense once you know.

The other motivation for the WINKT series is better illustrated using a different analogy - hence why we like to have plenty on hand.

A Walk in the Woods

Another way to look at the learning process, is as a process of exploring some new terrain, building your mental map of it. At first, you mostly stick to obvious paths, slowly venturing further and further. You find landmarks and add them to your map. Some paths turn out to link up to other paths, and you start to assemble a true connected picture of how things fit together. Yet still there is terrain just off the paths you haven't gone into. There could be anything out there, just outside your view. Even as you venture further and deeper into the woods, you must also make sure to look around you, and truly understand the areas you've already walked through, and how they connect together.

My badly done keynote picture is below. The paths are thin lines, the red fuzzy marks show those we have explored and the grey shading shows the areas we've actually been into and now "understand". Eventually we notice that the two paths in the middle are joined - a nice "Aha" moment of clarity ensues. But much of our map remains a mystery.
Schematic map with coverage marked

And that is one of the reasons I really like this analogy - all of the things you still don't know - all that unexplored terrain. This too is very like the learning process - you are able to use things once you have walked their paths, even though there may be details about them you don't understand. On the one hand, this is very powerful. On the other, it can lead to "magic incantations" - things you can repeat without actually understanding. This is inevitable in a task this complex, but it is important to go back and understand eventually. Don't let your map get too thin and don't always stick to the paths, or when you do step off them, you'll be lost.

This was the other main motivation behind our WINKT series - the moments when things you do without thinking suddenly make sense, and a new chunk of terrain gets filled in. Like approaching the same bit of terrain from a new angle, or discovering that two paths join, you gain a new perspective, and ideally, this new perspective is simpler than before. Rather than two paths, you see there is only one. Rather than two mysterious houses deep in the woods, you see there is one, seen from two angles.

Take Away Point

If you take one thing away from this post, make it this: the more you work on fitting the puzzle pieces together, the clearer things become. Rather than brown and green and blue and a pile of mysterious shapes, eventually you just have a horse.

And one final thing, if we may stretch the jigsaw analogy a little further: just because a piece seems to fit, doesn't mean it's always the right piece...

Heather Ratcliffe : 02 Mar 2022 12:52 | Tags: Code Philosophy Programming Winkt | Comments (0) | Close comments | Report a problem

December 08, 2021

Advent of Code 2021: First days with R

The Advent of Code is a series of daily programming puzzles running up to Christmas. On 3 December, the Warwick R User Group met jointly with the Manchester R-thritis Statistical Computing Group to informally discuss our solutions to the puzzles from the first two days. Some of the participants shared their solutions in advance as shared in this slide deck.

In this post, Heather Turner (RSE Fellow, Statistics) shares her solutions and how they can be improved based on the ideas put forward at the meetup by David Selby (Manchester R-thritis organizer) and others, while James Tripp (Senior Research Software Engineer, Information and Digital Group) reflects on some issues raised in the meetup discussion.

R Solutions for Days 1 and 2

Heather Turner

Day 1: Sonar Sweep

For a full description of the problem for Day 1, see https://adventofcode.com/2021/day/1.

Day 1 - Part 1

How many measurements are larger than the previous measurement?

199  (N/A - no previous measurement)
200  (increased)
208  (increased)
210  (increased)
200  (decreased)
207  (increased)
240  (increased)
269  (increased)
260  (decreased)
263  (increased)

First create a vector with the example data:

x <- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)

Then the puzzle can be solved with the following R function, that takes the vector x as input, uses diff() to compute differences between consecutive values of x, then sums the differences that are positive:

f01a <- function(x) {
  dx <- diff(x)
  sum(sign(dx) == 1)
}
f01a(x)

## [1] 7

Inspired by David Selby’s solution, this could be made slightly simpler by finding the positive differences with dx > 0, rather than using the sign() function.

Day 1 - Part 2

How many sliding three-measurement sums are larger than the previous sum?

199  A           607  (N/A - no previous sum)
200  A B         618  (increased)
208  A B C       618  (no change)
210    B C D     617  (decreased)
200  E   C D     647  (increased)
207  E F   D     716  (increased)
240  E F G       769  (increased)
269    F G H     792  (increased)
260      G H
263        H

This can be solved by the following function of x. First, the rolling sums of three consecutive values are computed in a vectorized computation, i.e. creating three vectors containing the first, second and third value in the sum, then adding the vectors together. Then, the function from Part 1 is used to sum the positive differences between these values.

f01b <- function(x) {
  n <- length(x)
  sum3 <- x[1:(n - 2)] + x[2:(n - 1)] + x[3:n]
  f01a(sum3)
}
f01b(x)

## [1] 5

David Schoch put forward a solution that takes advantage of the fact that the difference between consecutive rolling sums of three values is just the difference between values three places apart (the second and third values in the first sum cancel out the first and second values in the second sum). Putting what we’ve learnt together gives this much neater solution for Day 1 Part 2:

f01b_revised <- function(x) {
  dx3 <- diff(x, lag = 3)
  sum(dx3 > 0)
}
f01b_revised(x)

## [1] 5

Day 2: Dive!

For a full description of the problem see https://adventofcode.com/2021/day/2.

Day 2 - Part 1

forward X increases the horizontal position by X units.
down X increases the depth by X units.
up X decreases the depth by X units.

                  horizontal  depth
forward 5   -->            5      -
down 5      -->                   5
forward 8   -->           13
up 3        -->                   2
down 8      -->                  10
forward 2   -->           15

==> horizontal = 15, depth = 10

First create a data frame with the example data

x <- data.frame(direction = c("forward", "down", "forward",
                              "up", "down", "forward"),
                amount = c(5, 5, 8, 3, 8, 2))

Then the puzzle can be solved with the following function, which takes the variables direction and amount as input. The horizontal position is the sum of the amounts where the direction is “forward”. The depth is the sum of the amounts where direction is “down” minus the sum of the amounts where direction is “up”.

f02a <- function(direction, amount) {
  horizontal <- sum(amount[direction == "forward"])
  depth <- sum(amount[direction == "down"]) - sum(amount[direction == "up"])
  c(horizontal = horizontal, depth = depth)
}
f02a(x$direction, x$amount)

## horizontal      depth 
##         15         10

The code above uses logical indexing to select the amounts that contribute to each sum. An alternative approach from David Selby is to coerce the logical indices to numeric (coercing TRUE to 1 and FALSE to 0) and multiply the amount by the resulting vectors as required:

f02a_selby <- function(direction, amount) {
  horizontal_move <- amount * (direction == 'forward')
  depth_move <- amount * ((direction == 'down') - (direction == 'up'))
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}

Benchmarking on 1000 datasets of 1000 rows this alternative solution is only marginally faster (an average run-time of 31 μs vs 37 μs), but it has an advantage in Part 2!

Day 2 - Part 2

down X increases your aim by X units.
up X decreases your aim by X units.
forward X does two things:
- It increases your horizontal position by X units.
- It increases your depth by your aim multiplied by X.

                  horizontal  aim  depth
forward 5   -->            5    -      - 
down 5      -->                 5      
forward 8   -->           13          40
up 3        -->                 2
down 8      -->                10
forward 2   -->           15          60

==> horizontal = 15, depth = 60

The following function solves this problem by first computing the sign of the change to aim, which is negative if the direction is “up” and positive otherwise. Then for each change in position, if the direction is “forward” the function adds the amount to the horizontal position and the amount multiplied by aim to the depth, otherwise it adds the sign multiplied by the amount to the aim.

f02b <- function(direction, amount) {
  horizontal <- depth <- aim <- 0
  sign <- ifelse(direction == "up", -1, 1)
  for (i in seq_along(direction)){
    if (direction[i] == "forward"){
      horizontal <- horizontal + amount[i]
      depth <- depth + aim * amount[i]
      next
    }
    aim <- aim + sign[i]*amount[i]
  }
  c(horizontal = horizontal, depth = depth)
}
f02b(x$direction, x$amount)

## horizontal      depth 
##         15         60

As an interpreted language, for loops can be slow in R and vectorized solutions are often preferable if memory is not an issue. David Selby showed that his solution from Part 1 can be extended to solve the problem in Part 2, by using cumulative sums of the value that represented depth in Part 1 to compute the aim value in Part 2.

f02b_revised <- function(direction, amount) {
  horizontal_move <- amount * (direction == "forward")
  aim <- cumsum(amount * (direction == "down") - amount * (direction == "up"))
  depth_move <- aim * horizontal_move
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}
f02b_revised(x$direction, x$amount)

## horizontal      depth 
##         15         60

Benchmarking these two solutions on 1000 data sets of 1000 rows, the vectorized solution is ten times faster (on average 58 μs vs 514 μs).

Reflections

James Tripp

How do we solve a problem with code? Writing an answer requires what some educators call computational thinking. We systematically conceptualise the solution to a problem and then work through a series of steps, drawing on coding conventions, to formulate an answer. Each answer is different and, often, a reflection of our priorities, experience, and domains of work. In our meeting, it was wonderful to see people with a wide range of experience and differing interests.

Our discussion considered the criteria of a ‘good solution’.

Speed is one criteria of success - a solution which takes 100 μs (microseconds) is better than a solution taking 150 μs.
Readability for both sharing with others (as done above) and to help future you, lest you forget the intricacies of your own solution.
Good practice such as variable naming and, perhaps, avoiding for loops where possible. Loops are slower and somewhat discouraged in the R community. However, some would argue they are more explicit and helpful for those coming from other languages, such as Python.
Debugging friendly.Some participants, including Heather Turner and David Selby, checked their solutions with tests comparing known inputs and outputs. I drew on my Psychology experience and opted for an explicit DataFrame where I can see each operation. Testing is almost certainly a better solution which I adopt in my packages.
Generalisability. A solution tailored for the Part 1 task on a given day may not be easily generalisable for the Part 2 task. It seemed desirable to refactor one’s code to create a solution which encompasses both tasks. However, the effort and benefits of doing so are certainly debatable.

We also discussed levels of abstraction. The tidyverse family of R packages is powerful, high-level and quite opinionated. Using tidyverse functions returned some intuitive, but slower solutions where we were unsure of the bottlenecks. Solutions built on R base (the functions which come with R) were somewhat faster, though others using libraries such as data.table were also rather quick. These reflections are certainly generalisations and prompted some discussion.

How does one produce fast, readable, debuggable, generalisable code which follows good practice and operates at a suitable level of abstraction? Our discussions did not produce a definitive answer. Instead, our discussions and sharing solutions helped us understand the pros and cons of different approaches and I certainly learned a few useful tricks.

Heather Turner : 08 Dec 2021 10:00 | Tags: Code Programming R | Comments (0) | Close comments | Report a problem

May 06, 2020

Shebang shebang

The "shebang" is the combination of characters #!, called "hash bang", and sometimes shortened to "sh-bang" or commonly "shebang". You might see this in the first line of things like Bash scripts, Python scripts etc, and wonder what its for, and how to use it. if so, keep reading.

Where does it come from?

The hash-bang combined character was introduced in the world of computers around 1979, although it took a while for the use to standardise. It's a nice, human readable and memorable way to encode two Hex numbers which have a special meaning to the underlying operating system. The hash name is pretty standard, the bang is a bit unusual. According to Wikipedia, the name "bang" for the character "!" was common in the 50s, possibly from comic book gun sounds (Bang!). If the title reminds you of something else about the 50's, you're probably thinking of Sh-boom sh-boom (Life could be a dream). That seems to show up anywhere a TV show wants to indicate the time period.

Where does it go?

In normal bash scripts (or Python), the hash character means a comment. It might then look like the "special" character is the exclamation mark after it, but that's not quite it. The shebang goes far beyond just bash scripts, right into the depths of the operating system. This means it is vitally important that this be the first characters in the file. There can be NO space before them (unlike a simple comment), NO space between them, and NO other characters before or between.

What does it do?

So what does it actually do? If a script with a shebang is run (e.g. typing `./scriptname` in the shell), the hash bang tells the operating system that whatever follows on that line is the interpreter to use to run the script. That is, it should name a program (in full - no relative paths here) that can run the script. Usually, this will be something like `/bin/python` or `/bin/bash` for Python or bash respectively. You can also specify a particular Python version e.g. `/bin/python3` and can do some things with passing flags (see e.g. Wikipedia for some details).

The handy thing about this is you no longer have to specify how to invoke the script, you just run it. If you're a mousey-clicky sort of person, once you've put in the shebang, if you set your OS to run the file (on Linuxes, choose Run not Display in prefs, on OSX choose to open with Terminal, on Windows this depends how you installed Python, see e.g. here) you can double-click to run, which can be handy.

What are the problems?

So that sounds great. But there are a few problems. The way we described using this above, specifying `/bin/python`only works on computers where that specifies the Python the user wants invoked, and where that path is correct for the system at hand. Lots of people might have multiple Pythons, and prefer one be used, or they might have installs in custom places. On Unix-like systems (including OSX) there is a handy utility to deal with this, namely `/usr/bin/env`. This program "knows" what the user has configured to be invoked when they type a program name, such as "python". You can check it by typing e.g. `/usr/bin/env python --version` into a terminal (shell). I get 2.7.10, or if I try python3 I get 3.7.7. The "normal terminal" is probably "sh" or "bash" (commonly those give the same), and you may have other shells such as csh, tcsh or zsh.

You can use this to get the right program with the shebang method too. Instead of `/bin/python` in the first line of your script, use `#! /usr/bin/env python` or bash or python3 and your script will be run with that. Remember to `chmod u+x` (change access mode) to make the file executable too!

Excellent as this is, it has its own minor issue worth noting. On some systems, bin/env will be handed everything that follows as a single command, such as "python --version" in the example we did above. So rather than looking up the "python" program and handing it the --version argument, it tries to look up some non-existent program with a space in the name. It is best to avoid passing any arguments in the shebang line!

Got it?

Hopefully that all makes sense. The shebang is slightly mysterious at first, but very handy once you know what it does.

Heather Ratcliffe : 06 May 2020 11:43 | Tags: Programming Scripting | Comments (0) | Close comments | Report a problem

March 04, 2020

Scheduling of OpenMP

OpenMP is one of the most popular methods in academic programming to write parallel code. It's major advantage is that you can achieve performance improvements in a lot of cases by just putting "directives" into your code to tell the compiler "You can take this loop and split it up into sections for each processor". Other parallelism schemes like MPI or Intel Threaded Building Blocks or Coarray Fortran all involve designing your algorithm around splitting the work up, OpenMP makes it easy to simply add parallelism to bits where you want it. (There are also lots of bits of OpenMP programming that require you to make changes to your code but you can get further than in pretty much any other modern paradigm without having to alter your actual code).

So what does this look like in practice?

 
MODULE prime_finder
  USE ISO_FORTRAN_ENV
  IMPLICIT NONE

  INTEGER(INT64), PARAMETER ::  small_primes_len = 20
  INTEGER(INT64), DIMENSION(small_primes_len), PARAMETER :: &
      small_primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, &
      53, 59, 61, 67, 71]
  INTEGER, PARAMETER :: max_small_prime = 71

  CONTAINS

  FUNCTION check_prime(num)
    INTEGER(INT64), INTENT(IN) :: num
    LOGICAL :: check_prime

    INTEGER(INT64) :: index, end

    end = CEILING(SQRT(REAL(num, REAL64)))

    check_prime = .FALSE.
    !First check against the small primes
    DO index = 1, small_primes_len
      IF (small_primes(index) == num) THEN
        check_prime = .TRUE.
        RETURN
      END IF
      IF (MOD(num, small_primes(index)) == 0) THEN
        RETURN
      END IF
    END DO

    !Test higher numbers, skipping all the evens
    DO index = max_small_prime + 2, end, 2
      IF (MOD(num, index) == 0) RETURN
    END DO
    check_prime = .TRUE.
  END FUNCTION check_prime

END MODULE prime_finder

PROGRAM primes

  USE prime_finder
  IMPLICIT NONE
  INTEGER(INT64) :: ct, i

  ct = 0_INT64
!$OMP PARALLEL DO REDUCTION(+:ct)
  DO i = 2_INT64, 20000000_INT64
    IF (check_prime(i)) ct = ct + 1
  END DO
!$OMP END PARALLEL DO

  PRINT *, "Number of primes = ", ct

END PROGRAM primes

This is a Fortran 2008 program (although OpenMP works on Fortran, C and C++) that uses a very simple algorithm to count the number of prime numbers between 2 and 20,000,000. There are much better algorithms for this but this algorithm correctly counts this number of primes and is very suitable for parallelising since each number is checked for primality separately. This code is already OpenMP parallelised and as you can see the parallelism is very simple. The only lines of OpenMP code are !$OMP PARALLEL DO REDUCTION(+:ct)and$OMP END PARALLEL DO . The first line says that I want this loop to be parallel and that the variable ct should be calculated separately on each processor and then summed over all processors at the end. The second line just says that we have finished with parallelism and should switch back to running the code in serial after this point. Otherwise that is exactly the same program that I would have written if I was testing the problem on a single processor and I get the result that there are 1,270,607 primes less than 20,000,000 regardless of how many processors I run on. So far so good but look what happens when I look at the timings for different numbers of processors

Number of Processors	Time(s)
1	11.3
2	7.2
4	3.9
8	2.2
16	1.13

It certainly speeds up! But not as much as it should since every single prime is being tested separately (the number for 16 processors would be 0.71 seconds if it was scaling perfectly). There are lots of reasons why parallel codes don't speed up as much as they should but in this case it isn't any underlying limitation of the hardware but is to do with how OpenMP chooses to split up my problem and a simple change gets the runtime down to 0.81s. The difference between 1.1 seconds and 0.8 seconds isn't much but if your code takes a day rather than a second then 26.5 hours vs 19.2 hours can be significant in terms of electricity costs and cost of buying computer time.

So what is the problem? The problem is in how OpenMP chooses to split the work up over the processors. It isn't trivial to work out how long it will take to check all numbers in a given range for primality (in fact that is a harder problem than just counting the primes in that range!) so if you just do the obvious way of splitting that loop (processor 1 gets 3->10,000,000 processor 2 gets 10,000,001 to 20,000,000) then one of those processors will be getting more work than the other one will and will have to wait until that second processor has finished before it can give you the total number of primes. By default OpenMP does exactly that simple way of splitting the loop up so you don't get all of the benefit that you should from the extra processors. The solution is to specify a SCHEDULE command on the !$OMP PARALLEL DO line. There are 3 main options currently for scheduling : STATIC (the default), DYNAMIC (each iteration of the loop is considered separately and handed off in turn to a processor that has finished it's previous work) and GUIDED (the iterations are split up into work blocks that are "proportional to" the number of currently undone iterations divided by the number of processors. When each processor has finished it's block it requests another one.). (There are also two others, AUTO that has OpenMP try to work out the best strategy based on your code and RUNTIME that allows you to specify one of the 3 main options when the code is running rather than when you compile it). You can also optionally specify a chunk size for each of these that for STATIC and DYNAMIC tries to gang together blocks of chunksizeand then split them off to processors in sequence and for GUIDED makes sure that the work blocks never get smaller than the specified chunk size. For this problem GUIDED gives the best results and switching !$OMP PARALLEL DO REDUCTION(+:ct) SCHEDULE(GUIDED)for !$OMP PARALLEL DO REDUCTION(+:ct)in that code gives you the final runtime of 0.8 seconds on 16 processors which is a good overall performance. But the message here is much more about what OpenMP is doing behind the scenes than the details of this toy problem. The OpenMP standard document (https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf) is quite explicit on some of what happens but other bits are left up to the implementer.

So what do these do in practice? We can only talk in detail about a single implementation so here we're going to talk about the implementation in gcc(gfortran) and I'm using version 9.2.1. You can write a very simple piece of code that fills an array with a symbol representing which processor worked on it to see what's happening. For a simple 20 element array with 2 processors you get the following results (* = processor 1, # = processor 2)

 PROGRAM test
  
  USE omp_lib
  IMPLICIT NONE
  INTEGER, PARAMETER :: nels = 20
  INTEGER, DIMENSION(:), ALLOCATABLE :: ct, proc_used
  CHARACTER(LEN=*), PARAMETER :: symbols &
      = "*#$%ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
  INTEGER :: nproc, proc, ix

  nproc = omp_get_max_threads()
  ALLOCATE(ct(nproc))
  ct = 0
  ALLOCATE(proc_used(nels))

!$OMP PARALLEL PRIVATE(proc, ix)
  proc = omp_get_thread_num() + 1
!$OMP DO SCHEDULE(GUIDED)
  DO ix = 1, nels
    ct(proc) = ct(proc) + 1
    proc_used(ix) = proc
  END DO
!$OMP END DO
!$OMP END PARALLEL

 DO ix = 1, nproc
   PRINT "(A, I0, A, I0, A)", "Processor ", ix, " handled ", ct(ix), " elements"
 END DO

 DO ix = 1, nels
   IF (proc_used(ix) <= LEN(symbols)) THEN
     WRITE(*,'(A)', ADVANCE = 'NO') symbols(proc_used(ix):proc_used(ix))
   ELSE
     WRITE(*,'(A,I0,A)', ADVANCE = 'NO') "!", proc_used(ix),"!"
   END IF
 END DO

 PRINT *,""
END PROGRAM test

SCHEDULE command	Pattern
OMP DO SCHEDULE(STATIC)	**********##########
OMP DO SCHEDULE(STATIC,4)	**########**
OMP DO SCHEDULE(DYNAMIC)	#******#####*#
OMP DO SCHEDULE(DYNAMIC)	##************##
OMP DO SCHEDULE(DYNAMIC,4)	**########**
OMP DO SCHEDULE(GUIDED)	##########*****#####

You can see immediately that the scheduling behaviour is very different for the different commands. The simple STATIC scheduler simply splits the array into two with each processor getting half of the domain. STATIC,4 specifies a chunk size of 4 and does the same type of splitting but with processors getting adjacent chunks of 4 items. DYNAMIC produces quite complex interleaved patterns of processor responsibility but as you can see two runs with the same SCHEDULE command produce different patterns so this really is a dynamic system - you can't predict what processor is going to run what index of a loop (mostly this doesn't matter but there are plenty of real-world cases where efficiency dicatates that you always want the same processor working on the same data). Finally, GUIDED produced a rather strange looking pattern where processor 2 does most of the work. This pattern is not guaranteed but did drop out of multiple runs quite regularly. It appears to be that processor 1 gets a lot of housekeeping work associated with running in parallel (i.e. actually running the OpenMP library code itself) so the system gives more of the work in my code to processor 2.

An interesting thing that you can immediately see from these results is that I should be able to do something other than GUIDED for my prime number code. Since the problem is that certain chunks of the space of numbers to test are harder to process than other chunks I should be able to fix the STATIC schedule just by using a small-ish chunk size rather than giving every processor 1/16 of the entire list. This would mean that all of the processors would get bits of the easy numbers and bits of the hard numbers. Since there is always a cost for starting a new chunk of data I'm guessing that with 20,000,000 primes to test a chunk size of 1000 would seem suitable, so I go back to my prime code and put in SCHEDULE(STATIC, 1000). If I now run on 16 processors I get a time of 0.80 seconds, slightly faster than my GUIDED case indicating that it was the fact that some processors have an easier time of it than others is the problem.

Take away message for this is that when running parallel code it is vitally important that you understand the question of how your problem is being split up and whether that means that all of your processors will have the same amount of work. You have a decent array of tools currently to control how work is distributed and future releases of OpenMP should also have the option of specifying your own scheduling systems!

Christopher Brady : 04 Mar 2020 15:43 | Tags: Code Hpc Programming | Comments (0) | Close comments | Report a problem

January 22, 2020

Magic numbers

We've mentioned magic number before but they are one of those things that are always worth mentioning again. Magic numbers are literal numbers that you write into your code that impact the clarity of the code. The alternative to magic numbers is to store the values into variables or preprocessor directives or other ways of associating a number with a name. Exactly what impacts the clarity of your code is often a bit subjective and is certainly research field specific but there are a few rules of thumb

Is the number only an approximation? You don't want to risk using different approximations in different places in your code. Imagine what would happen if you defined Pi to have different values when using it in different contexts?
Is the number immediately recognizable? As a physicist I'd recognize "0.5*m*v**2" when I see it in code as being the kinetic energy of an object despite the 0.5 but I'd have more trouble with "4.0e-7 * pi". It's probably the pre-2018 definition of the vacuum permeability but it isn't entirely clear.
Is the number arbitrary? It's quite common to use a number to specify things like which problem to run or which optional package to use. You might remember what problem 3 is right now but you'll have to look if you ever come back to this code. Even worse, if you lose discipline then you might change what the numbers do if one of the test cases becomes redundant. Replacing your arbitrary numbers with named constants severely reduces these problems. If you give your problem number a sensible name then you'll have a much better chance of remembering what it does, and similarly if you ever remove a problem then if you remember to remove the named constant as well then you won't be able to compile the code when trying to run the old test. A lot of languages provide "enumerated types" or "enums" that allow you to automatically map numbers to names and can also prevent a user from chosing to supply a simple integer instead.

It can feel like this is unnecessary and slows you down when you are just writing a quick code, but one of the major problems with academic software is that it tends to grow. You write a code to solve a problem that you encounter during your research and you aren't terribly careful because you aren't going to keep it. But it is quite likely that you won't put it to one side and never touch it again. You might encounter a similar problem later and modify the code. At that point elements like magic numbers are annoying but you can usually work out what your own code is doing. The major problem comes if you move on and your code is inherited by someone else. In this case they might work everything out perfectly (which is good), they might find that they can't understand it and have to start again (which is annoying but not too troublesome) but worst of all is that they might misunderstand what your code is doing and make changes that are incorrect.

Christopher Brady : 22 Jan 2020 16:35 | Tags: Programming | Comments (0) | Close comments | Report a problem

January 09, 2020

Finding The Solution

New Year, New blog post. Just a short one this time, following on my my post on FizzBuzza few months ago. Even a problem as simple as that can be solved in myriad ways, and as I program more and in more languages, I find myself less often wondering how I can solve a problem, and more often how I should.

Most of the languages I work with let me solve problems using the basic command structures, and, as I wrote about last time in The X-Y problemit can be hard, but is vitally important, not to get confused by your partial solution and miss a better one.

Recently I've been learning Perl to do some complex text-processing, and find it to be a drastically different way of thinking to my C/Fortran background. It's tricky to think in terms of text matches and substitutions when I am so used to thinking of the position of each character in a string and working in terms of "index-of-character-X plus 1" (similar to working in terms of for-each loops when one has only used for). For the processing being done, the proper Perl solution is much shorter, easier to understand, etc, although it takes me a bit longer to produce initially.

A recent Stack Exchange post I saw had somebody asking why his boss didn't appreciate his brilliant coding techniques, because design patternswere second nature to him, and his boss wanted to use far simpler solutions. He probably came back to Earth with a bit of a bump when it was firmly pointed out that "patterns being second nature" was actually a bad thing, because it rather sounded like he trotted out the first "pattern" he could think of, instead of actually thinking about the problem he was solving. Nothing wrong with the patterns themselves, but critical thinking is required to decide whether they are suitable, optimal etc.

The other common mistake people make is demonstrated in Terry Pratchett's description of "Death's Swing" (e.g. https://en.wikipedia.org/wiki/Death_(Discworld)#Home), which mirrors the Sunk Costs fallacy. Trying to build a swing for his Grandaughter, the character of Death plows forward inspite of all problems. He hangs the swing from the two strongest branches. These being on opposite sides of the tree, he cuts away the trunk, shores it up and ... This can easily happen when programming and the trick is never to be afraid to throw away (or file away for future use) a solution, even a good one which took a long time, if it stops fitting the problem. In Death's case, it is less because he is unwilling to throw away the work already done, and more an issue of very linear thinking, but the effect is the same.

Hopefully this is already obvious, and you always think before you code, happily refactor or rework your own code, and have an ever-growing solution bank to call on. I suspect very few people are willing or able to throw away all the false starts they probably should though. Just keep in mind that there is a crucial difference between "what solves the immediate problem" and "the code I should probably actually write", and strive for the latter.

Heather Ratcliffe : 09 Jan 2020 15:47 | Tags: Philosophy Programming | Comments (0) | Close comments | Report a problem

December 05, 2019

The "XY Problem" or how to ask so somebody can answer

There's a lot of things that I feel ought to have pithy names or words, and this is one of them. The classic "XY problem" goes: Person wants to solve problem X. They don't know how. They think about it, and conclude that they will need to do "Y" as part of the solution. They don't know how to do this either. So they ask for help, but omit, or forget, to mention the larger problem. This can lead to bad or not-applicable answers, and an inability of the (presumably) skilled responders to explain how to do X. Since the original asker doesn't know how to do X, there's a good chance "Y" was a poor approach, or irrelevant, or harder than it needs to be etc.

I managed to commit a classic XY swap today while knocking up a very simple bash script. I wanted to take a filename, and strip its extension off. I knew this was going to be ".f90" in context, so I searched for how to strip the last 4 characters of a string in bash. I found this questionand used the substring solution given, before reading a bit more of the page and realising that I (and the original asker) were asking the wrong question. We both actually wanted this solutionto remove the dot and the extension. Funnily enough, this is also the archetype XY problem, as described at e.g. hereor here.

In this case, it isn't a big problem. Both solutions work, and for my actual use I don't need to gracefully handle a string without the ".f90" extension at all. But in more "interesting" (i.e potentially serious and causing of harm) cases on some of the forums I read (e.g. DIY, powerlifting, cybersecurity) the asked question hides a serious misunderstanding and goes off down an unhelpful path. For example, thankfully this guyasked his question openly (why are is breakers tripping after he removed a cooker fan and tied the live to the neutral wire) even if it was rather too late! Somewhere in the hundreds of questions about tripping breakers on that forum, there is probably somebody who's done something just as bad, but hidden it.

Another amusing type is the one where the question as asked can be answered effectively, but the asker would be surprised by the extra information. For instance (summarised from a real exchange):

A: I need a workout program that takes 3 hours every day!
B: Here are some! They're awesome!
C: Hold on, why do you want this?
A: Because I'm bored and I need to occupy 3 hours.
C: Jeepers! Find a hobby buddy! Forcibly spending a fixed amount of time is not going to lead to productive training...

But it Works, Doesn't it?

Even when the question isn't the right one, it is possible to get a solution which "works". But assuming you're trying to improve your programming skills, there's a lot wrong with "good enough" and the "wrong" solutions often have a whole host of problems:

They're too specific. For example, the substring solution for my file-extension problem works, but it's less general than it should be. It only works for 3 character extensions (plus one for the '.') and gets the wrong answer if there is no extension, wheras the actual solution works in both of those cases.
They're misleading when read back. Again, with the file-extension problem it's not clear what I'm doing by taking 4 characters off - the purpose is much clearer if I split on the dot.
They're not actually any quicker/simpler than the "proper" solution. The file-extension thing is a good illustration again - bash substrings are kinda inelegant, while pattern substitutions aren't. The "proper" solution is shorter and much clearer in general.
You still don't know how to solve the "actual problem" where you could have learned a much broader-applicable thing and improved your problem solving ability. In a lot of cases, you end up writing "Fortran in every language" rather than actually learning the approch of the one you're in. Again, the file-extension example is a good one. Doing it with a string slice is something familiar to me from other languages, but not a good move in bash, and it makes the problem more fiddly (I need to find the string length because of how substringing works).
Last and perhaps most egregiously, it hides from YOU how well you understand what you're doing, and how hard it actually is. My hacky solutions can end up being far more complex than they need to be, or they can end up making a genuinely hard problem look simple, because they don't actually solve it. See, for example, the difference between a genuine AI chatbot and the ELIZA model.

So what's the solution?

Mostly, when asking a question, try to ask the actual real one. This is easy when asking a question (in person or in text), but gets a bit tricky when using a search engine, since you're having to break things down into just a few key words. But the general idea is to look for the overall problem, as well as the partial solution you've thought up, to think very hard about what that actual problem is, taking a step back if required, and to not get too attached to your approach.

Always bear in mind that a complete change of tactic might be required to actually achieve your goal, and that you might not have quite nailed down what that goal is yet. And keep in mind the quote from Henry Ford:

Failure is simply the opportunity to begin again, this time more intelligently.

or as I've heard it paraphrased but can't find a source for:

Finding out you're wrong is great, because you get the chance to become more right.

Heather Ratcliffe : 05 Dec 2019 12:13 | Tags: Philosophy Programming | Comments (0) | Close comments | Report a problem

Newer entries - Older entries

Mo	Tu	We	Th	Fr	Sa	Su
Nov		\| Today \|
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31