December 15, 2023

Gnu Parallel without root

A quick post for future reference: GNU parallel, the command line app to allow parallel running of multiple copies of a program etc, is designed to work just about anywhere. This means we can install it without needing any root access.

DISCLAIMER: This is intended for something like a managed laptop with Linux on. If you are working on shared systems, anything run through parallel would count as computationally intensive - please pay attention to where you are allowed to run such work! Also, check the package or module system for an existing parallel install! It's likely to exist already!

Here's the recipe:

# Do all this in our home directory. If you change to somewhere else, replace $HOME with that path where it occurs

cd $HOME

# Grab that latest source

wget https://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2

# We could check the signature of this now, but since we'd grab the sig from where we grabbed the file, there is no point...

# Create an install directory called pllel - change this if you like

mkdir pllel

# Unpack and change to that folder

tar -xvf parallel-latest.tar.bz2

cd parallel-latest

# Configure to install into the directory we just created

./configure --prefix=$HOME/pllel

make

make install

# Now we add the location of the parallel program to our path

# Putting it first (before we repeat $PATH) means it will take priority over any other program called parallel

export PATH=$HOME/pllel/bin:$PATH

# Test everything

parallel echo ::: A B C D


Now, if you want this to be available in all your shells or sessions, you can add the export PATH line to your .bashrc, .bash_profile etc and then you don't have to do that every time.


July 25, 2023

A tricksy bit of C++ templating

Basic C++ Templates

My First Templates

Most people's first encounter with C++ templates is when they want to write a function which can take several types and do the same thing regardless of which. That is, something where the function body ends up the same, but the signatures differ. The templating engine lets you do this really easily, and better still, only the versions of the function you actually use will be created by the compiler which reduces the size of your compiled binary.

Let's make a couple of dummy functions to discuss:

template <typename T>
T myfunc(T);
template <typename T>
void myfunc(const & T);

The first of these is probably the simplest possible template function - we define a name for our 'type' parameter, calling it T, and use T where we would normally use a type such as int, double etc. The line-break is purely convention, but is the common way to write these. We can use T in the signature and/or the body, or we can not use it at all in certain circumstances. This is the 'template' for how the function should look, and if we call it from our code we'll get a version where T is substituted for an actual type, based on, in this case, the type of the argument we pass.

The second block defines a slightly more complex function, where we add some qualifiers to the T type argument, in this case making it a const reference. Note we can overload between T, const T, ref T etc just like we can with explicit types, int and int & being different functions.

Overload resolution between templated functions and between templated functions, non-templated functions and what are called template specialisations are complicated, but mostly can be thought of as going from more to less restrictive - int myfunc(int) would take priority over the templated version for example.

Templates go in Header Files

The fact that T is a placeholder for an actual type is a chunk of why template _definitions_ have to go in header files (excepting so called 'explicit instantiation' which wont discuss here) - the compiler needs the definition to compile the function when it encounters a call to it, and make sure that this function can compile for that type - for instance if the function body passes T to other functions, or calls member functions on it (is T is meant to be a class), are these available? If not, the function cannot be compiled. Since a header file is included (directly or indirectly) in every C++ file which needs it, the full definition is always available during compilation.

Or putting it another way, we can't compile a C++ file containing a templated definition on it's own because we don't know which types we're compiling for, and we certainly don't want to compile for every possible type, even if we could, as this would produce a ton of useless code.

By the way, because templates are not completely 'fleshed out' functions, they are implicitly exempted from the 'one-definition' rule, although properly using include guards etc is still a good idea!

Some More Specific Template Stuff

Template Specialisation

So what if we want to use a different version of a function for, say, integers and class types? Well, we can provide an integer _specialisation_ of the template, which will take priority over the general version, so like

template <typename T>
T myfunc(T in);
template <>
int myfunc(int in);

(Note we have to indicate this is a specialisation using the template<> bit. If we leave that out we get something slightly different and have to consider those overload resolution rules properly - lets not go into the details because they start to hurt).

Suppose in the generic version we use something like 'in.size()' - this wont compile for an integer type. But the compiler doesn't want to do extra work, and is programmed to let an explicit specialisation take priority over a general version, so, having found the specialisation, the compiler stops looking.

Thinking about this for a few moments, we see that a) if this wasn't the case these specialisations would be a bit useless and b) this is actually very powerful - we have potentially uncompilable code, but as long as we never _use_ it we're safe.

Templates For Specific Types

OK, so suppose we have a method that makes sense for int, double, float, long etc, all of the numeric types - we kinda want a template that only allows one of those. We can do it, although we wont show the details, because exactly how is messy and varies between C++11, 14, 17 and newer as this ability has got refined. The trick though, is creating a function template that only works for numeric types, by making something that tries to create an invalid type otherwise. Non-numeric types will fail to substitute, so trying to call our function with these fails - as we desire since this method doesn't make sense.

In some cases we might want to select between methods depending on something like 'is this type numeric' or is it an integer or other stuff. Assuming this technique does something similar to what we just described, we're going to get some options where certain types 'don't make sense' with potential function overloads, but we want to skip those and look at others instead.

This last bit is important - it is a fundamental principle in C++ that templates that "don't work out" don't count as compiler failure. This is "Substitution Failure Is Not An Error" or Sfinae (see other vid, link in description).

This is very powerful, pretty complicated and a big old mess, so lets leave the details alone! Instead, lets look at something neat but clever which we can do in the next section.

Decorator or Wrapper Classes

A Basic Wrapper Class - Function Pass-thru

Suppose we're writing a class which is a bit like a Python decorator - we have some class and we want to encase it in another which adds a bit of function. This is the "Decorator" pattern in design pattern terms too. A common idea is to decorate a single function with e.g. code to time it's execution. The crux is to have two things which are independent, and the ability to combine them - timing code for example would simply execute the function, timing as it goes, and return whatever it returns. It would not and should not care about what the function is, what it does, or how it does it.

A wrapper class is a similar idea, where we want to encase one class in a larger one, and in some sense 'pass through' it's functions. Perhaps we want to change the interface (function names, signatures etc) of a class to make it fit better with calling code (the Adapter pattern), or add defaults or something, and we do not want to edit the wrapped class to do so. Or again, suppose we want to decorate the class with some extra function.

As a pretty dumb but decently useful example, suppose we have some logger class for writing information to file or screen, and we wanted to add a timestamp (imagine our favorite system doesn't have this already). Naively, we might imagine just stringifying everything, then passing those to the logger, having prepended our extra info. But this is clumsy - why restrict ourselves to only taking strings? Better to write wrapper functions that take any type using a template. For instance (semi pseudocode)

class logger; // This comes from some library

class timestampedLogger{
logger myLogger;
string getTimeNow(){...}

template <typename T>
void write(T in){
logger.write(getTimeNow());
logger.write(in);
}
};

We insert a timestamp, but otherwise, we just funnel everything through. In some senses, we're making timestampedLogger substitutable for plain logger, BUT AT COMPILE TIME. Instead of creating a 'logger' we create our wrapped timestampedLogger, and implement all of the same functions, meaning we can simply swap it out. We're not inheriting from logger, which would make us substitutable at runtime, we're replacing it completely at compile time.

Compile time fiddling like this is what templates allow us to do, and do (fairly) cleanly.

Unknown Return Types

Now imagine a slightly more fiddly thing, which is something we've been playing with doing recently - suppose we have some data-storage classes, for example vector or array and we want to wrap them up in something which is aware of physical units (kg, m, s etc). This is very much a decorator pattern, with a slight twist which is that not only do we want to pass-through (certain) member functions of the storage type unchanged, we also add functions dealing with say multiplying one thing with another.

The motivation for this is a kind of enhanced type safety - a length and a time cannot be added together, so lets exploit the type system to enforce this. Ideally this happens at compile time, as our simulations are time-critical and we don't want needless runtime checks for something that, once programmed correctly, wont change. But we might be adding or changing the equations we're solving, so just manually checking things once is insufficient.

The details of doing this are actually pretty mucky and horrible (see the Github Repo for the details, but beware!), but it showed up an interesting tricky bit with wrappers like this - can we write a function which has a return type which is not fixed, nor deducible from its arguments directly, but which depends on the function we are wrapping, in particular what its return type is?

C++14 introduced the idea of an auto function return type. Why should the programmer have to specify something that the compiler can work out? Consider the following:

returnType getLogFilename(){return myLogger.getLogFilename();}

returnType might be a few things, but let's say it's a string. Or could it be a wstring (string supporting wide characters, such as accented or non-Latin)? Why should we have to go rummage in the logger class? And why should our intermediate, our wrapper, have to care if this changed in an update (leaving aside the myriad problems with changing interfaces...).

Obviously the return type of that can't be deduced (worked out) from the arguments - there are none! But it can be worked out from the call and the compiler can and will do this, so we don't have to, if instead of returnType, we put 'auto'.

A Templated Class

So one more twist - suppose we want to wrap one of multiple possible logger classes. Imagine one of these implements a method 'setToRed()' to make output red (e.g. in a terminal). If we write the following function

void setToRed(){myLogger.setToRed();};

this compiles fine for the class which implements that function, but if the logger we are wrapping does not, we get a compile error, whether or not we ever call this function. This is obvious - that code can't compile. So how can we make this available if and only if the backing class offers it?

We might hope that this:

template <typename T>
void setToRed(){myLogger.setToRed();};

would somehow do it - a templated function "isn't compiled until it's used" we think.

But it doesn't. I honestly don't understand the details of what is, and isn't allowed here, because compilers are complicated things and a lot of work goes into making template code compile as fast as possible, since it must be done every time.

But we canuse templating to do this and actually if we write our wrapper "properly" we'll get this ability for free! Before, I assumed we'd just swap out the 'logger' entity rarely, and want only one possible option at any time, so I assumed 'logger' was provided by a library, and I'd maybe use a typedef or using to make sure whatever logging class I was using had that name.

A proper wrapper class would be a templated class, where we can provide information on the logger type when we create it. So we'd do something like this:

template <typename L>
class timestampedLogger{
public:
L myLogger;
string getTimeNow(){...}

template <typename T>
void write(T in){
logger.write(getTimeNow());
logger.write(in);
}

void setToRed(){myLogger.setToRed();};

auto getLogFilename(){return myLogger.getLogFilename();}
};

We've included that 'auto' return type we mentioned, as well as an explicitly templated function - notice we have to name the parameters differently - T and L are both template types, but are unrelated.

Now to actually create one of these we have to do:

timestampedLogger<logger> myTLogger;

or we could offer a shortcut with a 'using' statement:

using tLogger = timestampedLogger<logger>;

Only if we use a function will the full body be handled, so as long as we don't use it, we can leave the setToRed here. We are now relying on the code which calls (or does not call) this function to know if it can or not - but if we're thinking of our wrapper as substitutable for a plain logger class, that was already the case, so there is no real disadvantage there.

With those two layers of templating we can write everything we need to wrap timestamping around our logger. We have to wrap every function we want to expose, but using templates and auto we can minimise how much information we're repeating about those functions - which we should always try to do. The compiler will work all that out once we substitute a real logger class type for that L parameter

Like That Type, But Different

For a brief final foray into the power, but also the attendant horror, of the templating system, lets imagine a more complicated decorator class, where our wrapping functions might change the type returned by functions they wrap.

The units checking project we're working on adds one more twist to this idea - how can we wrap our units wrapper onto something we're returning from a function whose return type we don't know? For instance, suppose we have scalars (single numbers), vectors (physics style so triplets of numbers) and tensors (ditto, so a 3x3 array). Suppose some function goes "down the hierarchy" so returns a lower rank for each, except scalars where it returns the same type.

That is how would we implement the following?

class scalar{
public:
int dat{0};
scalar()=default;
scalar(int val):dat(val){};
scalar exampleFunction(){return scalar(1);}
};

class vector{
public:
int dat[3];
vector()=default;
vector(int a, int b, int c){dat[0]=a; dat[1]=b; dat[2]=c;};
scalar exampleFunction(){return scalar(2);}
};

class tensor{
public:
int dat[9];
vector exampleFunction(){return vector(1, 2, 3);}
};
template <typename ST>
class unitsType{
public:
ST myData;
unitsType<???> exampleFunction(){unitsType<???> val; val.myData = myData.exampleFunction(); return val;}
};

What replaces those '???' ? We can't write anything in there ourselves because it depends on the type of the parameter ST, and crucially is NOT ST itself which would be simple. We can use auto in place of the return type explicitly, but not for the declaration of val. But the compiler knows on some level what type myData.exampleFunction() will return, and we can ask it to help us by spitting that out:

using ReturnTypeExample = decltype((std::declval<ST>()).exampleFunction());

Here decltype is saying 'give me just the type information and declval is taking a type, ST, and creating a sort of compile-time instance of that type so that we can access a member function by name. So we have what we need: unitsType<ReturnTypeExample> is the return value and the type we need to give to val in the body. Perfect.

But I Want to do it Myself

But there's one last twist - what if we have the same desire as before - to allow this for functions that ST may or may not implement? This is actually using the function at compile time to get that return value, AND there's no template substitution going on which might defer that until the wrapper function is actually used. The solution to this is a pretty horrible trick, and I DO NOT KNOW for sure it will work in all circumstances, but it solves this problem:

We have to add another layer of template, so that only when a specific instantiation uses the function will everything unroll and the function actually be needed. Lets name this possibly-nonexistent function nopeFunction() since nope, I didn't implement it...

So we do this:

template <typename Q>
using ReturnTypeNope = decltype((std::declval<Q>()).nopeFunction());
auto nopeFunction(){
unitsType<ReturnTypeNope<ST> > val;
val.myData = myData.nopeFunction();
return val;
}

This works - only when the nopeFunction() here is called does anything attempt to unpack its return type. The auto return type deduction can only happen when the body is compiled too.

As a final note, suppose we wanted things to be a bit clearer, or auto would not work for us for some reason. The following DOES NOT work:

unitsType<ReturnTypeNope<ST> > nopeFunction(){
unitsType<ReturnTypeNope<ST> > val;
val.myData = myData.nopeFunction();
return val;
}

because the return type tries to substitute when this line is reached. We need to keep that deferred layer, so we have to do:

template <typename Q>
unitsType<ReturnTypeNope<Q> > nopeFunction(){

unitsType<ReturnTypeNope<Q> > val; val.myData = myData.nopeFunction(); return val;}

(NOTE we don't have to name this Q, it has no relation to the other Q, it's just handy to re-use the letters for similar purposes where they don't overlap, to avoid running out).

This leaves one FINAL (promise) problem: when calling this wrapper, the type for Q cannot be deduced so we have to do ugly things in the calling code like this:

unitsType<scalar> myTyp;
myTyp.nopeFunction<scalar>();

We fix this with a final twist - a default value for the template parameter Q of ST. Note carefully: this is not defining them to be equal, it is setting a default. We get:

template <typename Q>
using ReturnTypeNope = decltype((std::declval<Q>()).nopeFunction());

template <typename Q=ST>
unitsType<ReturnTypeNope<Q> > nopeFunction(){
unitsType<ReturnTypeNope<Q> > val;
val.myData = myData.nopeFunction();
return val;
}

And now as long as the calling code doesn't call nopeFunction, nothing cares if our ST type implements it or not.

This is, frankly, a bit too deep for sanity, but it does work and is not exactly an edge use-case when dealing with decorators and wrappers.


August 03, 2022

Global data idioms in C++


We've spoken a fair bit about global variables and why they are risky, but sometimes they are the best solution available. For instance, a recent video discussed the idea of "tramp" data, which is passed through certain objects and functions simply to get to another place. This, as we talked about there, has a lot of problems, several of which are close to or identical to, the equivalent problems with globals. So... sometimes the simple fact is, you have to choose your evil, and sticking to broad design principles is not a good motivation for locally bad design. As I quoted in that video:

"But passing global data into every function, whether it needs it or not, imposes a great deal of overhead. This is known as "tramp data", and often reflects a design error. If these things are truly global, then they're global, and pretending that they aren't, in order to make your code "OOP", is a design error."

(Pete Becker, this StackOverflow answer)

So let us agree that there are cases where you have global data and/or objects and that representing them as global state is, in fact, the correct approach. This post is going to discuss some of the options for making this "as least-bad as we can".

Static - more slippery than stable

Static as a C++ keyword is another in the unfortunately long list of C++ keywords with subtly different meanings in different contexts. I am not going to try and explain what it technically means, since that's been done before (e.g. here) and has to be quite technical. In a moment, we'll go through a few uses of it and how they work, and after that the definition might make more sense. Let's just use one small bit of the implications - static on a variable or class member is a way to get a "storage location" (i.e. a variable) which lasts for the entire program. In other words, it's lifetime is the program duration, and as we discussed in another recent videothat is in many ways equivalent to being a global variable.

Because it's "basically a global" we have to caution against static (unless also const) in any code which is, or might in future be, multithreaded. Global mutable (able to be changed, aka not const) state is the anathema of threading. In nearly every case, you do not want to try and handle the combination!

Definition, Declaration and the One Definition Rule

C++ requires you to do two things so that a variable or function "works". Firstly, you need to have declared what it looks like, so that all parts of the code know how to handle it. This means giving a variable a type, or a function a prototype - so that a variable access or function call has the right "pattern" - knows how big something is, how many parameters it has, etc. But you also have to define, or "fill in" what something actually is - create the storage space (memory) for a variable, or define the body of the function.

For variables, expecially if you have been avoiding globals, you may have never encountered this because in most cases the declaration and definition are the same - int i;does both. Do note that this is nothing to do with initialising or giving a value to a variable. The only time they become separate is in certain cases where we want multiple parts of the code to be able to use the same variable, such as for globals. Method 1 below shows how we can do this.

For functions, it is a bit more familiar - to call a function we need to have access to its prototype, and for compilation to complete and the program to run, it has to be given a body that can actually execute. For simple functions, this is why we normally declare prototypes in a header, and function bodies in a cpp file. If we put the body in a header, and include that header several times, we get errors telling us the function is multiply defined. For classes though, we can happily define function bodies in a class definition without any problem. What's happening here?

First, lets just clarify what including a header does - it pretty literally dumps the code from the included file into the including file. This is old stuff - the compiler has nothing to do with it, as it happens at the "pre-processing" step while the code is still just text. By the time the compiler sees it, the definition you wrote might appear several times, and there is no way to know that they came from the same place.

Now, the problem arises because C++ disallows certain forms of ambiguity. Suppose a function could be defined in several places - and suppose these were different! Imagine the chaos if one form were used in some places and another in others. Or might the compiler to expected to pick one? Disallowing this is mostly a Good Thing, but sometimes it has issues. For instance, being unable to define any functions in a header would rule out "header-only" libraries, where you simply include them and it works. It would also severely limit templating.

Not being able to define something multiple times is called the One Definition Rule. Because of the drawbacks just mentioned though, there are several places where it is allowed to be violated with the strict requirement that the programmer takes care of the risks. That is, you may be allowed to have a repeated definition, but if it is not the same everywhere, that is your problem (and mostly undefined behaviour, so a Very Bad Thing). The details aren't important here, and it's enough for us to suppose that where the benefits were sufficient, the rule was allowed to be broken.

Inline

Inline is a special keyword, which, yet again, has a complicated history. Originally, "inline" was a hint to the compiler that a function should be inlined (think - pasted into place in source code, rather than executed with a jump). In order to do this, the compiler would need to have access to the function body in all the places in the code it might be used. This meant the function had to be fully defined in an included header (strictly, not true, but good enough for us), which violated the One Definition Rule. Inline was considered useful, thus the rule had to be bent - C++ generally tries to allow useful things where it can (where the compiler writers think they can make it work). Once inline was allowed to bend the rule, people used it for that purpose, such that now it pretty much only is used for this case.

Since a class member function defined within the body of a class can only really be meant to be inline, this was done by default too, hence why you may never have seen this keyword. Note that it applies to functions only - and see Example 1b below for the newer extension to inline - inline variables.

Global state example in C++

Let's assume we really do have some data which is accessed in many parts of our program, both for read and write, and that we have already decided this is the best design. Perhaps it is used so widely that we would end up passing it endlessly, usually as that "tramp data". Once it's passed, it's accessible, and only some very horrible tricks (const_cast...) can allow us to restrict where it can be changed. It's already global in implication, why not admit this and allow it to be global in design.

What options do we have to do this, and what are their pros and cons?

Method 1 - A "simple" global variable

For simple data, such as a single integer, we can use the simplest "global variable" idiom - that is we want a variable whose declaration ("pattern") is available to all parts of our program, which is defined (actually created) once and only once in our code. This means we can't do what we might first think of and just create it in a header file because that would define it several times and our compiler will complain about an ambiguous name. Nothing clever we can do with include guards or trickery can avoid this. What we can do though, is to declare it in a header, and define it in a cpp file, like this:

file.h:

#ifndef file_h
#define file_h
extern int my_var;
#endif

and in file.cpp

#include "file.h"
int my_var = 10;

So what does all that mean? Our header file uses standard include guards (we'll leave these out in future) which stop the header being included repeatedly. We have our my_var variable, an int, and we declare it "extern". This tells the compiler that there will be a definition for my_var by the time the code is linked (if you're not too familiar with linking, think: compiled files combined into an executable). This means all of the parts of the code which use this header will happily compile using what they know my_var has to look like, and not worry about where it might be actually created. Then, the C++ file does the actual creation. This can happen only once, or we would risk having two separate variables with the same name.

This is the simplest idiom to get what we want, but has several problems. It's a bit confusing: extern is an old keyword that many people will never encounter. We have to go looking for the actual initialisation step, and verify that we set a value. In a lot of cases, we might have no other reason to have the file.cpp file except to instantiate one variable - and our only alternative there is to demand it be defined in some other of our cpp files, which is confusing, risky and all around a Bad Thing.

Lastly, a name as generic as my_var has now been "claimed" throughout our entire program and we re-use it at our peril. Shadowing, where a local variable "covers up" a global one, is always confusing. What would this snippet do, for instance?

int main(){
std::cout<<my_var<<std::endl;
{ int my_var = 11;
std::cout<<my_var<<std::endl;
}
std::cout<<my_var<<std::endl;
}

Actually, the problem is a bit worse than this, because only files including our header will see my_var, and you can probably work out why this can be a maintenance nightmare if suddenly code changes mean two names which had been separate begin to collide!

Method 1a - A namespaced global

To avoid the shadowing issue, and improve this solution from "pretty awful" to "alright", we can at least restrict our variable name to a namespace. If we are careful with our namings and dividing things into coherent sets, we can quite usefully indicate more about our variable, and make it easier to find places that related variables are, or should be, changed. For example

namespace display_config{
extern int width;
extern int height;
}

and

int display_config::width = 720;
int display_config::height=1080;

Generally, if we're changing our "display height" we'd also expect the width to be changed, and we now have some ability to spot this.

Method 1b - [C++ 17 or newer] This, but better

From C++ 17, this kinda obviously common and useful idiom is supported much more, by the addition of "inline" variables. We discussed inline above for functions, and this is very similar. The definitions must match, but in our case of there being only a single actual line, which is included several times, this is fine. We end up with the much more elegant looking:

namespace display_config{
inline int width=1080;
inline int height=720;}

This has several advantages - not needing a cpp file definition, being able to see at a glance that our variables are initialised (instead of having to look in two places), and having one less keyword/idiom to remember. However, C++17 is perhaps a little too new to use without at least considering where you will be compiling/running your code, as you might have to put in a request for a compiler update for a few years yet.

Method 2 - A class with static members

At the point where we're talking about linking variables to each other, we are into the territory where a class becomes a good solution. This would let us, for example, relate the setting of height and width explicitly. The obvious step from our namespace example, to a basic class is this:

file.h

class display_data{
static int height;
static int width;
};

and file.cpp

int display_data::height=720;
int display_data::width=1080;

Now this is rather different looking and the keywords have changed. We no longer have any extern, but we have introduced static, and we have these slightly unexpected definitions without which we get compiler errors telling us the variable does not exist.

A static class member variable is shared by all instances of the class. If we read or write to it, this must have the same effect (in terms of how the bits in memory are changed) whichever class instance we would use - thus we don't need to specify. In fact, we don't need to have any instance, we can access those variables as display_data::height from anywhere. This sort of explains why the second bit is needed - if they're not associated with any class instance, the variables have to be "created" and given storage space somewhere. As before, we need those to be in some cpp file and often they are the only thing there, which is ugly.

So this also has problems. Firstly - if our class has only static members, that's weird, and often considered an "anti-pattern". It's a heavy solution and goes against some of the motivation for objects. However, if we want to control the setting of height and width (forcing them to be non-negative for instance), we can get the compilers help to enforce this by making them private and providing setter and getter functions.

On the whole though, entirely static class members is an oddball solution, and I'm not sure where it's really useful.

Method 3 - A static class instance

Going back a bit, our display has sort of taken on an "object like" existence, with several items of data and methods that act on them. This object really could have several instances, it's just that for our specific program, we want to have a global one referring to "the display". This is far more naturally represented by a global instance of a regular class, so we can use the first approach with a user-defined class rather than a plain int, and get some benefits.

However, we still have most of the drawbacks of global-ness, so we do need some strong motivation. And here we additionally have all of the drawbacks of method 1 like needing the C++ file. Moreover, there is a thing called the Static Initialisation Order Fiasco if we have interaction between static objects, and a lot of other stuff gets really messy.

We mention this option for completeness, but would probably never recommend it.

Method 4 - Function Local Statics

Possibly the best approach to allow us to have a single global class like the previous method, but more safely, is to exploit function-local static variables. These are a lot like global statics, but their scope is restricted to the function where they "live". This clears up a lot of the issues from Method 1 and is much, much better. We do this (showing the function body only, there would also be a header containing prototype):

int get_display_height(){
static int the_display_height=720;
return the_display_height;}

That lets us have this global variable, but gives us no way to set it. It's easy in this case to find a sentinel value to let us do this to fix that:

int get_display_height(int new_val=-1) //In header
int get_display_height(int new_val){
static int the_display_height=720;
if(new_val != -1){the_display_height = new_val;}
return the_display_height;}

but not all things have a sentinel.

In those cases, and some others where a sentinel is not appropriate, we could instead return a reference or pointer to the variable, such as this:

int * get_display_height(){
static int the_display_height=720;
return &the_display_height;}

This is perfectly valid for a static function local, in a way which it is absolutely not valid for a regular function local variable. Yet again, we're running into confusing idioms and must tread carefully! Worse still, we've now opened up setting of our variable to the entire code and can no longer restrict values to be positive, or anything else!

Worse still, a casual reading of those snippets might miss that the initialisation to 720 is done only the first time the function is encountered. Any subsequent calls refer to the same variable, the_display_height, but the setting is not redone. In this case, we absolutely rely on that behaviour, but if it is something more complicated like a heap allocation it can really confuse you.

Method 5 - A Singleton Class

The final method we'll discuss here is the most heavy weight, but can be really useful. Suppose we really do have a class containing data and operations on it, and we want there to exist precisely one of them. This will be our global object. Much like the previous idiom, we'll have the actual storage location be a function-local static variable. So we will do something like this:

static theClass * theClass::get_instance(){
static theClass * the_instance = new theClass();
return the_instance;}

Anywhere that we want to use the class, we simply get it using theClass::get_instance()->blah.

There's one more trick we need, because we said that we want only one of these to exist - which is to privatise the constructor for theClass so that only the one in this function can be created. Voila, a single entity! This is why the function above is a class-member function, so that it and only it is able to call the constructor.

This is generally know as the Singleton Pattern and is very handy. However, be aware that it does involve statics and shared entities and so must be handled carefully. In particular, you must make sure either all operations on the class leave it in a valid state, or two parts of your code might both use it and confuse each other - obviously this gets a lot more pressing in multithreaded code, but even serial code can have the problem.

Conclusion

Global data, if it truly is global, is global however you code it. You have to use caution, but whether you pass it about, or use one of these tricks, your data is subject to change in multiple places. Be wary. Use every tool the compiler gives you, such as const, and block scoping, and only globalise things that are worth the headache!


June 06, 2022

Wherfore art thou – variable names matter!

Sometimes, the simplest questions are the hardest to answer. For instance - what is the meaning of the word "the"? If you've never thought about this, have a go. If you'd prefer a non-grammar, or non-English specific example, try to describe the number '1'. Trickier than you'd think, isn't it?

But that is hardly relevant to programming, so lets look at today's deceptively simple question instead. Namely, how long should variable and function names be?

Some people might seem to have an answer to this question, such as "between one and five words, usually two or more". They are wrong. Any answer containing numbers is unhelpful and either sometimes wrong, or too broad to mean anything. For instance, "Supercalifragilisticexpialidocious" [typed from memory... excuse spelling] is one word, and is terrible. And "SolveWaveEquationSecondOrderWithLimiter" is seven and (in the right context) is quite good. "FlagSettingWhetherWeShouldPrintTheAnswerToScreenOrNot" is clearly terrible - but it is thoroughly descriptive.

So why do we struggle so much to answer this question? Because we are in a sort of "tug-of-war" between two competing interests, and which one pulls a little harder depends on a lot of things. Both ends of the rope are anchored in clarity: on the one hand we want maximum clarity for the function - a very descriptive name. This tends to favour longer names, with more detail. On the other hand, we want maximum clarity for the code as a whole - a name that doesn't take up too much mental space in a block of code. This tends to favour shorter names, with fewer, simpler words. To find our optimum, we must somehow balance these two.

A brief aside here into our absolute best weapon in this - make the two ends stop pulling. Find a way to make "descriptive" and "compact" the same thing. Specialised definitions of terms specific to a field of interest, i.e. jargon, really works in our favour here. We must use it with caution, because new jargon can be very jarring and difficult to master, but in general jargon terms are addressing exactly the same problem - descriptive yet compact terms. For instance, a word you may barely think of as jargon - "laptop". Let us expand this definition - perhaps we get "a portable, all-in-one computing device". But we still have some specialised terms here - what is "all-in-one"? What is "computing"? We could continue the expansion almost indefinitely. Or, we can simply use the term "laptop" and in most cases be perfectly understood.

It is important to flag here that word, "most". Is a tablet a laptop? If I strap a monitor to a tower PC and plug in a keyboard, do I have a "laptop"? It's going to depend on context whether those are reasonable (OK, the second one is extremely unlikely to be). But consider similar questions - "do you have a car?" "No, I have a van" - sensible answer, or irritating pedant? "We need to seat 5 people, who has a car?" "I have an MX5" - as it turns out, "car" doesn't always mean 5 seats.

Back from our aside now, new weapon in hand. Careful, selective use of jargon can make our names very descriptive, while remaining compact. If the jargon we use is not universal, we can provide a glossary or add context in our docs. We should be very very careful about making up our own jargon here, because unfamiliar terms can lead to misunderstanding, but if your field provides words, exploit them.

One last important point - sometimes an otherwise optimal name is not useful due to ambiguity. This can be typographical - we should never have names that differ only in characters like '1' and 'l' or differ only in the case (small or CAPITAL) of the letters. Ambiguity can be similarity with some other function name - imagine we somehow tried to have "GetDiscreteName" and "GetDiscreetName" - could you even tell those apart? Ambiguity is an enemy of clarity - avoid it.

OK, actually there is one final point. Clearly redundancy can only harm our compactness of names. But it does have some applications. We might want two functions with very similar names, but different parameters - "SolveTypeAEquation(a, b, c)" and "SolveTypeAEquationNormalised(a, b, param)" for instance. If we can't design the code to make these less confusing, we might deliberately make one name less individually clear, if it makes its usage clearer. So perhaps the second one becomes "SolveTypeAEquationNormalisedWithParam(a, b, param)" which repeats information already in the signature (redundancy), but helps us keep in mind which function we want.

Short posts like this rarely have conclusions, because everything has already been said, usually repeatedly. Since repetition is another form of redundancy that actually works though, lets restate our key point:

We must balance the demands of clarity between our functions and our wider code and try and find a name which is BOTH descriptive and also short and easy to handle. Sometimes one of these will be a bit more important, sometimes the other.


April 08, 2022

License choice in the R community

This week a post on the RSE Slacksparked a lot of discussion on how to choose a license for your research software. The website https://choosealicense.com/is a helpful resource and starts with an important point raised by Hugo Grusonthat a good place to start is to consider the license(s) commonly used in your community. But how do you find out this information? This blog post explores the licenses used in the R and Bioconductor communities, by demonstrating how to obtain licencing information on CRAN and Bioconductor packages.

Licenses on CRAN

The Comprehensive R Archive Network (CRAN) repository is the main repository for R packages and the default repository used when installing add-on packages in R. The tools package that comes with the base distribution of R provides the CRAN_package_db() function to download a data frame of metadata on CRAN packages. Using this function, we can see that there are currently 19051 packages on CRAN.

library(tools)
pkgs <- CRAN_package_db()
nrow(pkgs)
## [1] 19051

The license information is in the License column of the data frame. We'll use the dplyr package to help summarise this variable. With n_distinct() we find that there are 164 unique licenses!

library(dplyr)
n_distinct(pkgs$License)
## [1] 164

However, many of these are different versions of a license, e.g.

pkgs |>
  filter(grepl("^MIT", License)) |>
  distinct(License)

##                                   License
## 1                      MIT + file LICENSE
## 2              MIT License + file LICENSE
## 3                      MIT + file LICENCE
## 4 MIT + file LICENSE | Apache License 2.0
## 5                       MIT +file LICENSE
## 6          MIT + file LICENSE | Unlimited

The above output also illustrates that

  • An additional LICENSE (or LICENCE) file can be used to add additional terms to the license (the year and copyright holder in the case of MIT).
  • Packages can have more than one license (the user can choose any of the alternatives).
  • Authors do not always provide the license in a standard form!

A LICENSE file can also be used to on its own to specify a non-standard license. Given this variation in license specification, we will use transmute() to create a new set of variables, counting the number of times each type of license appears in the specification. We create a helper function n_match() to count the number of matches for a regular expression, which helps to deal with variations in the form provided. Finally we check against the expected number of licenses for each package to check we have covered all the options.

n_match <- function(s, x) lengths(regmatches(x, gregexpr(s, x)))
licenses <- pkgs |>
  transmute(
    ACM = n_match("ACM", License),
    AGPL = n_match("(Affero General Public License)|(AGPL)", License),
    Apache = n_match("Apache", License),
    Artistic = n_match("Artistic", License),
    BSD = n_match("(^|[^e])BSD", License),
    BSL = n_match("BSL", License),
    CC0 = n_match("CC0", License),
    `CC BY` = n_match("(Creative Commons)|(CC BY)", License),
    CeCILL = n_match("CeCILL", License),
    CPL = n_match("(Common Public License)|(CPL)", License),
    EPL = n_match("EPL", License),
    EUPL = n_match("EUPL", License),
    FreeBSD = n_match("FreeBSD", License),
    GPL = n_match("((^|[^ro] )General Public License|(^|[^LA])GPL)", License),
    LGPL = n_match("(Lesser General Public License)|(LGPL)", License),
    LICENSE = n_match("(^|[|] *)file LICEN[SC]E", License),
    LPL = n_match("(Lucent Public License)", License),
    MIT = n_match("MIT", License),
    MPL = n_match("(Mozilla Public License)|(MPL)", License),
    Unlimited = n_match("Unlimited", License))
n_license <- n_match("[|]", pkgs$License) + 1
all(rowSums(licenses) == n_license)
## TRUE

Now we can tally the counts for each license, discounting version differences (i.e., GPL-2 | GPL-3 would only count once for GPL). We convert the license variable into a factor so that we can order by descending frequency in a plot.

tally <- colSums(licenses > 0)
tally_data <- 
  tibble(license = names(tally),
         count = tally) |>
  arrange(desc(count)) |>
  mutate(license = factor(license, levels = license))


Bar chart of license frequencies on CRAN as a percentage of the number of packages. The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages.

tally_data.csv

The vast majority are GPL (73%), followed by MIT (18%). All other licenses are represented in less than 3% of packages. This is consistent with R itself being licensed under GPL-2 | GPL-3. The only licenses in the top 10 that are not mentioned as "in use" on https://www.r-project.org/Licenses/, are the Apache and CC0 licenses, used by 1.7% and 1.1% of packages, respectively. The Apache license is a modern permissive license similar to MIT or the older BSD license, while CC0 is often use for data packages where attribution is not required. A separate LICENSE file is the 3rd most common license among CRAN packages; without exploring further it is unclear if this is always a stand-alone alternative license (as the specification implies) or if it might sometimes be adding further terms to another license.

Licenses on Bioconductor

Bioconductor is the second largest repository of R packages (excluding GitHub, which acts as a more informal repository). Bioconductor specialises in packages for bioinformatics. We can conduct a similar analysis to that for CRAN using the BiocPkgToolspackage. The function to obtain metadata on Bioconductor packages is biocPkgList(). With this we find there are currently 2041 packages on Bioconductor:

library(BiocPkgTools)
pkgs <- biocPkgList()
nrow(pkgs)
## [1] 2041

Still, there are 89 distinct licenses among these pckages:

n_distinct(pkgs$License)
## [1] 89

We can use the same code as before to tally each license and create a plot - the only change made to create the plot below was to exclude licenses that were not represented on Bioconductor.

Bar chart of license frequencies on Bioconductor as a percentage of the number of packages. GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages.

tally_data_bioc.csv

GPL is still a popular license, represented by 55% of packages. However the Artistic license is also popular in this community (23%). This reflects the fact that the Bioconductor core packages are typically licensed under Artistic-2.0 and community members may follow the lead of the core team. Third to fifth place are taken by MIT (9%), LGPL (7%) and LICENSE (4%), respectively, with the remaining licenses represented in less than 2% of packages. The ACM, BSL, CC0, EUPL, FreeBSD and LPL licenses are unrepresented here.

Summary

Although the Biconductor community is a subset of the R community, it has different norms regarding package licenses. In both communities though, the GPL is a common choice for licensing R packages, consistent with the license choice for R itself.


March 18, 2022

Getting a little testy

Testing code is essential to knowing if it works. But how do you know what to test? How do you know you've done enough?

Let's be clear to start here, "testing" as we think of it is some form of comparing a code answer to a predicted one, making sure the code "gets it right" or meets expectations. If there is a canonical "right answer" then it's fairly easy to do - if not then things get difficult, if not impossible, but correctness is the goal.

What to test? Well, all of it, of course! We want to know the entire program works, from each function, to the entire chain. We want to know it all hangs together correctly and does what it should.

So when have you done enough testing to be sure? Almost certainly never. In a non trivial program there is almost always more you could test. Anything which takes a user input has almost infinite possibilities. Anything which can be asked to repeat a task for as long as you like has literally infinite possibilities. Can we test them all? Of course not!

Aside - Testing Everything

Why did we say user input is only almost infinite? Well, all numbers in the computer have a fixed number of bits for storage, which means there's a strictly fixed number of numbers. Technically, for a lot of problems, we really can test absolutely every input. In practice, we can't afford the time and anyway, how do we know what the right answer is without solving the problem completely.

Picking random things to check is an idea that's used sometimes, as are "fuzzers", which aim to try a wide range of correct and incorrect inputs to find errors. But these are also costly to perform, and still require us to know the right answer to be really useful. They can find crashes and other "always wrong" behaviour though.

Lastly, in lots of codes, we aren't expected to protect users from themselves, so entering an obviously silly value (like a person's height of 15 ft) needn't give a sensible answer. We can fall back to "Garbage In Garbage Out" to excuse our errors. But our users might be a lot happier if we don't, and in important circumstances we wont be allowed to either.

Back to the Grind

That all sounds rather dreary - writing good tests is hard, and we're saying they're never enough, so why bother? Well, lets back off for a minute and think about this. How do we turn an "infinite" space into a tractable one? Well, we have to make it smaller. We have to impose restrictions, and we have to break links and make more things independent. The smaller the space we need to test, the more completely we can cover it.

Unit testing

Most people have probably heard of unit testing by now - testing individual functions in isolation. It seems like if you do this then you have tested everything, but this is not true. What happens if you call this function followed by that one (interdependency)? What happens if you call this function before you did this other thing (violation of preconditions)?

Unit testing is not the solution! Unit testing is the GOAL!

If we could reliably say that all of the units working means the program works, then we can completely test our program, which is amazing! But in reality, we can't decouple all the bits to that extent, because our program is a chain of actions - they will always be coupled because the next action occurs in the arena set up by the preceeding ones. But we have to try!

Meeting the Goal

So we have to design, architect and write our programs to decouple as much as possible, otherwise we can't understand them, struggle to reason about them, and are forced to try and test so much we will certainly fail. A lot of advice is given with this sort of "decoupling" in mind - making sure the ways in which one part of a program affects another are:

  1. as few as possible
  2. as obvious as possible
  3. as thin/minimal as possible

What sorts of things do we do?

  • Avoid global variables as much as possible as anything which touches them is implicitly coupled
  • C++ statics are a form of global, as are public Fortran module variables (in most contexts)
  • Similarly, restrict the scope of everything to be as small as we can to reduce coupling of things between and inside our functions
    • C++ namespaces, Fortran private module variables, Python "private" variables (starting with an underscore, although not enforced in any way by the language)
  • Avoid side-effects from functions where possible, as these produce coupling. Certainly avoid unexpected side effects (a "getter" function should not make changes)
    • Fortran offers PURE functions explicitly. C++ constexpr is a more complicated, but related idea
  • Use language features to limit/indicate where things can change. Flag fixed things as constant.
    • Use PARAMETER, const etc freely
    • In Fortran, use INTENT, in C/C++ make function arguments const etc
    • In C++ try for const correctness, and make member functions const where possible. Use const references for things you only want to read
  • Reduce the "branchiness" (cyclomatic complexity) of code. More paths means more to test.
  • Keep "re-entrancy" in mind - a function which depends only on its arguments, doesn't change it's arguments, and has no side effects can be called over and over and always give the same answer. It can be interrupted/stopped and started afresh and still give the same answer. This is the ultimate unit-test friendly function!

Overall, we are trying to break our code into separate parts, making the "seams"between parts small or narrow. If we drew a graph of the things in our code and denote links between them with lines, we want blobs connected with few lines wherever possible.

Every link is something we can't check via unit testing, and unit testing is king because it is possible to see that we have checked every function, although we can never see if we have checked every possible behaviour within the function, even with code coverage tools.

Aside - Writing useful unit tests

Even if everything is unit-testing amenable, it doesn't make it actually easy. There's a few things to keep in mind when writing the tests too.

  • Poke at the edge-cases. If an argument of '3' works, then '6' probably will too, but will '0'? What about a very large number? Even (a+b)/2 for an average will break down if a + b overflows! A negative number? Look for the edges where behaviour might change, and test extra hard there
  • If your function is known only to work for certain inputs, make them preconditions (things which must be true). This can be done either in documentation, or using things like assert
  • Check for consistency. If you mutate an object, check it is still a valid one at the end
  • Try to come up with things you didn't think about when you wrote the function, or that no "sensible caller" would do
  • Check that things fail! If you have error states or exceptions, make sure these do occur under error conditions

Dealing with the Rest

If we could actually produce a piece of code with no globals, no side-effects and every function fully re-entrant, unit tests would be sufficient to check everything. Sadly, this is impossible. Printing to screen is a side-effect. All useful code has some side effects somewhere. Mutating an argument makes a function non-reentrant (we'd have to make a copy and return the new one instead, and that has costs). So we seem doomed to fail.

But that is OK. We said unit-testing is a goal, something we're trying to make as useful as possible. We can do other things to make up for our shortfalls, and we definitely should, even if we think we got everything. Remember, all of those bullets are about minimising the places we violate them, minimising the chances of emergent things happening when we plug functions together.

We need to check actual paths through our entire code (integration testing). We need to check that things we fixed previously still work (regression testing). We probably want to run some "real" scenarios, or get a colleague to do so (sorta beta testing). These are hard, and we might mostly do them by just running our program with our sceptical hats on and watching for any suspicious results. This is not a bad start, as long as we keep in mind its limitations.

Takeaway

What's the takeaway point here? Unit Testing works well on code that is written with the above points in mind. Those points also make our code easier to understand and reason about, meaning we're much less likely to make mistakes. Honestly, sometimes writing the code to be testable has already gained us so much that the tests themselves are only proving what we already know. Code where we're likely to have written bugs is unlikely to fail our unit tests - the errors will run deeper than that.

So don't get caught up in trying to jam tests into your existing code if that proves difficult. You won't gain nearly as much as by rewriting it to be test friendly, and then you almost get your tests for free. And if you can't do that, sadly unit tests might not get you much more than a false sense of security anyway.


March 07, 2022

What's the time right now?

This is not one of my mistakes, this is one I read about second hand (you'll see the pun in a moment), but it is a great illustration of the importance of writing the code you actually mean to, thinking properly about what it is that you mean, and watching out for weasel words that might confuse you.

Consider these two lines of code and decide for youself whether they have the same effect:

if (DateTime.Now.Second == 0 && DateTime.Now.Minute == 0)

if (DateTime.Now.Minute == 0 && DateTime.Now.Second == 0)

Decided yet? If you think they are the same, take a second now (that's a hint!) to think about the meaning of each part, and what its value will be. Suppose the original author intended to run some process once per hour, so put one of these into a loop. Can you see why one of these might succeed, and the other fail occasionally (assume the rest of the code takes some tiny fraction of a second)?

By the way, if you're worried about not knowing what language this is in, don't be. This is a perfect example of good "self documenting code" which we've written a bit about before. Based on the names, we can infer (correctly) that DateTime is some sort of thing, which can supply the date-and-time Now through the "DateTime.Now" construct, and that we can then look at the Minutes and Seconds value of the time Now. That's all we need to know.

So, back to the question. Are the lines the same? Well, the order is different. There's a double-& AND operation. We don't strictly know whether the left or right condition will be checked first (it might depend on language), and we don't actually know if both will be (if you're not sure why that is, we've talked about short circuiting in logical operations here). But that shouldn't matter unless the DateTime object is having some weird side-effects and I promise it isn't. The problem is much more fundamental than that.

If you haven't spotted it, it's time to write down very clearly what we wanted to check. We want to know if, right now, the value of both the minutes and the seconds entries are 0. It looks like that's what either line does, but there's a nasty weasly word sneaking in here - the word "now". What can "now" actually mean in programming terms, where we know operations ultimately are carried out one by one in sequence? For instance, what would you expect the following code could print?

if (DateTime.Now.Second == 0) print(DateTime.Now.Second)

If you said 0 only, hold on a second (that's a hint, too) and look again, and think about what this code will actually do in terms of basic operations. It will get a value for DateTime.Now.Second. It will compare this to 0. If the comparison is true it will get a value for DateTime.Now.Second, and it will print this value. And there is the problem! The value we check and the value we print are not always the same.

If it still isn't clear to you why the original two lines are not equivalent, reread that last paragraph a few times and apply the same idea to the original question. If it's been obvious to you all along, great. You're unlikely to make the mistake this original programmer did. But why did they make this mistake? There could be several reasons:

  • They might simply not have spotted there could be a problem - this is pretty likely, but not very interesting
  • They might have thought somehow the compiler/runtime would recognise the values as "the same" - this is a pretty fundamental mistake, perhaps due to thinking of "Now" as a property rather than an operation. In some very imprecise sense, we should think of "Now" as having a side-effect on itself because it takes time, so multiple calls will not have the same result as a single one
  • They might have blindly removed a temporary variable containing DateTime.Now, without thinking deeply about the consequences. This is quite likely because it looks so simple, and is why refactoring like this needs, if anything, more attention than writing the code in the first place
  • Lastly, they might have thought carefully and precisely, but ultimately wrongly about it. If they confused the idea of seconds as a time point with seconds as a time interval, they may have thought that both the Seconds and the Minutes property would surely be evaluated within the same second, and thus work as expected, where actually even a microsecond between them could be the difference between 1:59.999 and 2:00.000

So, that explains why the lines don't work as intended, but there's one weird thing left - how come one of them (the first one, if evaluation occurs left to right) works, but the other doesn't? Effectively, the minutes value depends on the seconds value - the minute is incremented only when the seconds reads 59. We know the whole operation takes far far less than 1 second (ignoring interrupts or anything else suspending operation), so once we have confirmed the seconds value as 0, we have at least 59-60 seconds to check the minutes value, during which it cannot change. Certainly this will occur within that interval, and the code will work perfectly (again, ignoring external iterrupts, which our code can never account for anyway).

So, does that mean the first option is actually safe, or good code? Well, not without a comment it certainly isn't! If we're relying on this sort of subtlety, somebody is going to be tripped up by it, and it might even be us. It's also a bit delicate - suppose we were to need to use DateTime.Now in another part of the condition, or in the body? We'd have to think very hard about whether we're still "safe" and we simply don't need to. A temporary variable is much clearer, and the possible optimisation here is a) tiny and b) best left to our compiler anyway. Oh and c) possibly a de-optimisation anyway, depending on the cost of the DateTime.Now lookup. Also, it'd be easy to forget that the left-to-right ordering is important, and we might try this in a situation or language where that is not guaranteed.

Ultimately, I simply don't like relying on such a subtlety. It's too clever and it doesn't need to be. It's not that I don't like clever code, because I do. I like clever code that makes one exclaim "oh yes, that's so simple, it's brilliant", and this aint that. This is more "oh blimey, is that how it works". Keep it simple, and if you can't keep it simple, at least make it satisfying.

Postscript

Before we go, I hope something has been nagging at you throughout this post. I started off by saying we should think hard about what we mean and what we want to actually do. I introduced this idea of minute-0 and second-0 and then proposed it as a solution to doing something once per hour. Did it strike you as a pretty terrible solution to that problem? Because it is. Suppose we start running our code at 9:00:30 AM. Would we expect whatever this block controls to wait an entire hour to fire for the first time? What if the code is never running at the change of the hour? This block would never run. Suppose we start 1000 copies of this code. Do we really want all 1000 to execute this block at the same time? Suppose some other code took longer than expected and this block isn't reached once per second - some hours might be simply skipped. In nearly all cases we'd be better checking whether at least an hour has elapsed since last we did the thing, and we'd avoid all of these problems. We meant once per hour, not on the first second of every hour, and we should have coded that, not this.


This post is inspired by an article I read a while ago, namely https://thedailywtf.com/articles/an-hourly-rate In particular, a comment mentions the different behaviour of the two possible lines, which this post focuses on.


March 02, 2022

Learning to Program

We (Warwick RSE) love quotes, and we love analogies. We do always caution against letting your analogies leak, by supposing properties of the model will necessarily apply to reality, but nontheless, a carefully chosen story can illustrate a concept like nothing else. This post discusses some of my favorite analogies for the process of learning to program - which lasts one's entire programming life, by the way. We're never done learning.

Jigsaw Puzzles

Learning to program is a lot like trying to complete a jigsaw puzzle without the picture on the box. At first, you have a blank table and a heap of pieces. Perhaps some of the pieces have snippets you can understand already - writing perhaps, or small pictures. You know what these depict, but not where they might fit overall. You start by gathering pieces that somehow fit together - edges perhaps, or writing, or small amounts of distinctive colours. You build up little blocks (build understanding of isolated concepts) and start to fit them together. As the picture builds up, new pieces fit in far more easily - you use the existing picture to help you understand what they mean. Of course, the programming puzzle is never finished, which is where this analogy starts to leak.

Jigsaw image

I particularly like this one for two reasons. Firstly, one really does tend to build little isolated mini-pictures, and move them around the table until they start to connect together, both in jigsaws and in programming. Secondly, occasionally, you have these little moments - a piece that seemed to be just some weird lines fits into place and you say "oh is THAT what that is!". One gets the same thing when programming - "oh, is that how that works!" or "is that what that means".

Personally, a lot of these moments then become "Why didn't anybody SAY that!". This was the motivation behind our blog series of "Well I Never Knew That" - all those little things that make so much sense once you know.

The other motivation for the WINKT series is better illustrated using a different analogy - hence why we like to have plenty on hand.

A Walk in the Woods

Another way to look at the learning process, is as a process of exploring some new terrain, building your mental map of it. At first, you mostly stick to obvious paths, slowly venturing further and further. You find landmarks and add them to your map. Some paths turn out to link up to other paths, and you start to assemble a true connected picture of how things fit together. Yet still there is terrain just off the paths you haven't gone into. There could be anything out there, just outside your view. Even as you venture further and deeper into the woods, you must also make sure to look around you, and truly understand the areas you've already walked through, and how they connect together.

My badly done keynote picture is below. The paths are thin lines, the red fuzzy marks show those we have explored and the grey shading shows the areas we've actually been into and now "understand". Eventually we notice that the two paths in the middle are joined - a nice "Aha" moment of clarity ensues. But much of our map remains a mystery.
Schematic map with coverage marked

And that is one of the reasons I really like this analogy - all of the things you still don't know - all that unexplored terrain. This too is very like the learning process - you are able to use things once you have walked their paths, even though there may be details about them you don't understand. On the one hand, this is very powerful. On the other, it can lead to "magic incantations" - things you can repeat without actually understanding. This is inevitable in a task this complex, but it is important to go back and understand eventually. Don't let your map get too thin and don't always stick to the paths, or when you do step off them, you'll be lost.

This was the other main motivation behind our WINKT series - the moments when things you do without thinking suddenly make sense, and a new chunk of terrain gets filled in. Like approaching the same bit of terrain from a new angle, or discovering that two paths join, you gain a new perspective, and ideally, this new perspective is simpler than before. Rather than two paths, you see there is only one. Rather than two mysterious houses deep in the woods, you see there is one, seen from two angles.


Take Away Point

If you take one thing away from this post, make it this: the more you work on fitting the puzzle pieces together, the clearer things become. Rather than brown and green and blue and a pile of mysterious shapes, eventually you just have a horse.

And one final thing, if we may stretch the jigsaw analogy a little further: just because a piece seems to fit, doesn't mean it's always the right piece...


December 08, 2021

Advent of Code 2021: First days with R

The Advent of Code is a series of daily programming puzzles running up to Christmas. On 3 December, the Warwick R User Group met jointly with the Manchester R-thritis Statistical Computing Group to informally discuss our solutions to the puzzles from the first two days. Some of the participants shared their solutions in advance as shared in this slide deck.

In this post, Heather Turner (RSE Fellow, Statistics) shares her solutions and how they can be improved based on the ideas put forward at the meetup by David Selby (Manchester R-thritis organizer) and others, while James Tripp (Senior Research Software Engineer, Information and Digital Group) reflects on some issues raised in the meetup discussion.

R Solutions for Days 1 and 2

Heather Turner

Day 1: Sonar Sweep

For a full description of the problem for Day 1, see https://adventofcode.com/2021/day/1.

Day 1 - Part 1

How many measurements are larger than the previous measurement?

199  (N/A - no previous measurement)
200  (increased)
208  (increased)
210  (increased)
200  (decreased)
207  (increased)
240  (increased)
269  (increased)
260  (decreased)
263  (increased)

First create a vector with the example data:

x <- c(199, 200, 208, 210, 200, 207, 240, 269, 260, 263)

Then the puzzle can be solved with the following R function, that takes the vector x as input, uses diff() to compute differences between consecutive values of x, then sums the differences that are positive:

f01a <- function(x) {
  dx <- diff(x)
  sum(sign(dx) == 1)
}
f01a(x)
## [1] 7

Inspired by David Selby’s solution, this could be made slightly simpler by finding the positive differences with dx > 0, rather than using the sign() function.

Day 1 - Part 2

How many sliding three-measurement sums are larger than the previous sum?

199  A           607  (N/A - no previous sum)
200  A B         618  (increased)
208  A B C       618  (no change)
210    B C D     617  (decreased)
200  E   C D     647  (increased)
207  E F   D     716  (increased)
240  E F G       769  (increased)
269    F G H     792  (increased)
260      G H
263        H

This can be solved by the following function of x. First, the rolling sums of three consecutive values are computed in a vectorized computation, i.e. creating three vectors containing the first, second and third value in the sum, then adding the vectors together. Then, the function from Part 1 is used to sum the positive differences between these values.

f01b <- function(x) {
  n <- length(x)
  sum3 <- x[1:(n - 2)] + x[2:(n - 1)] + x[3:n]
  f01a(sum3)
}
f01b(x)
## [1] 5

David Schoch put forward a solution that takes advantage of the fact that the difference between consecutive rolling sums of three values is just the difference between values three places apart (the second and third values in the first sum cancel out the first and second values in the second sum). Putting what we’ve learnt together gives this much neater solution for Day 1 Part 2:

f01b_revised <- function(x) {
  dx3 <- diff(x, lag = 3)
  sum(dx3 > 0)
}
f01b_revised(x)
## [1] 5

Day 2: Dive!

For a full description of the problem see https://adventofcode.com/2021/day/2.

Day 2 - Part 1

  • forward X increases the horizontal position by X units.
  • down X increases the depth by X units.
  • up X decreases the depth by X units.
                  horizontal  depth
forward 5   -->            5      -
down 5      -->                   5
forward 8   -->           13
up 3        -->                   2
down 8      -->                  10
forward 2   -->           15

==> horizontal = 15, depth = 10

First create a data frame with the example data

x <- data.frame(direction = c("forward", "down", "forward",
                              "up", "down", "forward"),
                amount = c(5, 5, 8, 3, 8, 2))

Then the puzzle can be solved with the following function, which takes the variables direction and amount as input. The horizontal position is the sum of the amounts where the direction is “forward”. The depth is the sum of the amounts where direction is “down” minus the sum of the amounts where direction is “up”.

f02a <- function(direction, amount) {
  horizontal <- sum(amount[direction == "forward"])
  depth <- sum(amount[direction == "down"]) - sum(amount[direction == "up"])
  c(horizontal = horizontal, depth = depth)
}
f02a(x$direction, x$amount)
## horizontal      depth 
##         15         10

The code above uses logical indexing to select the amounts that contribute to each sum. An alternative approach from David Selby is to coerce the logical indices to numeric (coercing TRUE to 1 and FALSE to 0) and multiply the amount by the resulting vectors as required:

f02a_selby <- function(direction, amount) {
  horizontal_move <- amount * (direction == 'forward')
  depth_move <- amount * ((direction == 'down') - (direction == 'up'))
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}

Benchmarking on 1000 datasets of 1000 rows this alternative solution is only marginally faster (an average run-time of 31 μs vs 37 μs), but it has an advantage in Part 2!

Day 2 - Part 2

  • down X increases your aim by X units.
  • up X decreases your aim by X units.
  • forward X does two things:
    • It increases your horizontal position by X units.
    • It increases your depth by your aim multiplied by X.
                  horizontal  aim  depth
forward 5   -->            5    -      - 
down 5      -->                 5      
forward 8   -->           13          40
up 3        -->                 2
down 8      -->                10
forward 2   -->           15          60

==> horizontal = 15, depth = 60

The following function solves this problem by first computing the sign of the change to aim, which is negative if the direction is “up” and positive otherwise. Then for each change in position, if the direction is “forward” the function adds the amount to the horizontal position and the amount multiplied by aim to the depth, otherwise it adds the sign multiplied by the amount to the aim.

f02b <- function(direction, amount) {
  horizontal <- depth <- aim <- 0
  sign <- ifelse(direction == "up", -1, 1)
  for (i in seq_along(direction)){
    if (direction[i] == "forward"){
      horizontal <- horizontal + amount[i]
      depth <- depth + aim * amount[i]
      next
    }
    aim <- aim + sign[i]*amount[i]
  }
  c(horizontal = horizontal, depth = depth)
}
f02b(x$direction, x$amount)
## horizontal      depth 
##         15         60

As an interpreted language, for loops can be slow in R and vectorized solutions are often preferable if memory is not an issue. David Selby showed that his solution from Part 1 can be extended to solve the problem in Part 2, by using cumulative sums of the value that represented depth in Part 1 to compute the aim value in Part 2.

f02b_revised <- function(direction, amount) {
  horizontal_move <- amount * (direction == "forward")
  aim <- cumsum(amount * (direction == "down") - amount * (direction == "up"))
  depth_move <- aim * horizontal_move
  c(horizontal = sum(horizontal_move), depth = sum(depth_move))
}
f02b_revised(x$direction, x$amount)
## horizontal      depth 
##         15         60

Benchmarking these two solutions on 1000 data sets of 1000 rows, the vectorized solution is ten times faster (on average 58 μs vs 514 μs).

Reflections

James Tripp

How do we solve a problem with code? Writing an answer requires what some educators call computational thinking. We systematically conceptualise the solution to a problem and then work through a series of steps, drawing on coding conventions, to formulate an answer. Each answer is different and, often, a reflection of our priorities, experience, and domains of work. In our meeting, it was wonderful to see people with a wide range of experience and differing interests.

Our discussion considered the criteria of a ‘good solution’.

  • Speed is one criteria of success - a solution which takes 100 μs (microseconds) is better than a solution taking 150 μs.
  • Readability for both sharing with others (as done above) and to help future you, lest you forget the intricacies of your own solution.
  • Good practice such as variable naming and, perhaps, avoiding for loops where possible. Loops are slower and somewhat discouraged in the R community. However, some would argue they are more explicit and helpful for those coming from other languages, such as Python.
  • Debugging friendly.Some participants, including Heather Turner and David Selby, checked their solutions with tests comparing known inputs and outputs. I drew on my Psychology experience and opted for an explicit DataFrame where I can see each operation. Testing is almost certainly a better solution which I adopt in my packages.
  • Generalisability. A solution tailored for the Part 1 task on a given day may not be easily generalisable for the Part 2 task. It seemed desirable to refactor one’s code to create a solution which encompasses both tasks. However, the effort and benefits of doing so are certainly debatable.

We also discussed levels of abstraction. The tidyverse family of R packages is powerful, high-level and quite opinionated. Using tidyverse functions returned some intuitive, but slower solutions where we were unsure of the bottlenecks. Solutions built on R base (the functions which come with R) were somewhat faster, though others using libraries such as data.table were also rather quick. These reflections are certainly generalisations and prompted some discussion.

How does one produce fast, readable, debuggable, generalisable code which follows good practice and operates at a suitable level of abstraction? Our discussions did not produce a definitive answer. Instead, our discussions and sharing solutions helped us understand the pros and cons of different approaches and I certainly learned a few useful tricks.


May 06, 2020

Shebang shebang

The "shebang" is the combination of characters #!, called "hash bang", and sometimes shortened to "sh-bang" or commonly "shebang". You might see this in the first line of things like Bash scripts, Python scripts etc, and wonder what its for, and how to use it. if so, keep reading.

Where does it come from?

The hash-bang combined character was introduced in the world of computers around 1979, although it took a while for the use to standardise. It's a nice, human readable and memorable way to encode two Hex numbers which have a special meaning to the underlying operating system. The hash name is pretty standard, the bang is a bit unusual. According to Wikipedia, the name "bang" for the character "!" was common in the 50s, possibly from comic book gun sounds (Bang!). If the title reminds you of something else about the 50's, you're probably thinking of Sh-boom sh-boom (Life could be a dream). That seems to show up anywhere a TV show wants to indicate the time period.

Where does it go?

In normal bash scripts (or Python), the hash character means a comment. It might then look like the "special" character is the exclamation mark after it, but that's not quite it. The shebang goes far beyond just bash scripts, right into the depths of the operating system. This means it is vitally important that this be the first characters in the file. There can be NO space before them (unlike a simple comment), NO space between them, and NO other characters before or between.

What does it do?

So what does it actually do? If a script with a shebang is run (e.g. typing `./scriptname` in the shell), the hash bang tells the operating system that whatever follows on that line is the interpreter to use to run the script. That is, it should name a program (in full - no relative paths here) that can run the script. Usually, this will be something like `/bin/python` or `/bin/bash` for Python or bash respectively. You can also specify a particular Python version e.g. `/bin/python3` and can do some things with passing flags (see e.g. Wikipedia for some details).

The handy thing about this is you no longer have to specify how to invoke the script, you just run it. If you're a mousey-clicky sort of person, once you've put in the shebang, if you set your OS to run the file (on Linuxes, choose Run not Display in prefs, on OSX choose to open with Terminal, on Windows this depends how you installed Python, see e.g. here) you can double-click to run, which can be handy.

What are the problems?

So that sounds great. But there are a few problems. The way we described using this above, specifying `/bin/python`only works on computers where that specifies the Python the user wants invoked, and where that path is correct for the system at hand. Lots of people might have multiple Pythons, and prefer one be used, or they might have installs in custom places. On Unix-like systems (including OSX) there is a handy utility to deal with this, namely `/usr/bin/env`. This program "knows" what the user has configured to be invoked when they type a program name, such as "python". You can check it by typing e.g. `/usr/bin/env python --version` into a terminal (shell). I get 2.7.10, or if I try python3 I get 3.7.7. The "normal terminal" is probably "sh" or "bash" (commonly those give the same), and you may have other shells such as csh, tcsh or zsh.

You can use this to get the right program with the shebang method too. Instead of `/bin/python` in the first line of your script, use `#! /usr/bin/env python` or bash or python3 and your script will be run with that. Remember to `chmod u+x` (change access mode) to make the file executable too!

Excellent as this is, it has its own minor issue worth noting. On some systems, bin/env will be handed everything that follows as a single command, such as "python --version" in the example we did above. So rather than looking up the "python" program and handing it the --version argument, it tries to look up some non-existent program with a space in the name. It is best to avoid passing any arguments in the shebang line!

Got it?

Hopefully that all makes sense. The shebang is slightly mysterious at first, but very handy once you know what it does.


March 2024

Mo Tu We Th Fr Sa Su
Feb |  Today  |
            1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV