All entries for August 2022
August 03, 2022
Global data idioms in C++
We've spoken a fair bit about global variables and why they are risky, but sometimes they are the best solution available. For instance, a recent video discussed the idea of "tramp" data, which is passed through certain objects and functions simply to get to another place. This, as we talked about there, has a lot of problems, several of which are close to or identical to, the equivalent problems with globals. So... sometimes the simple fact is, you have to choose your evil, and sticking to broad design principles is not a good motivation for locally bad design. As I quoted in that video:
"But passing global data into every function, whether it needs it or not, imposes a great deal of overhead. This is known as "tramp data", and often reflects a design error. If these things are truly global, then they're global, and pretending that they aren't, in order to make your code "OOP", is a design error."
(Pete Becker, this StackOverflow answer)
So let us agree that there are cases where you have global data and/or objects and that representing them as global state is, in fact, the correct approach. This post is going to discuss some of the options for making this "as least-bad as we can".
Static - more slippery than stable
Static as a C++ keyword is another in the unfortunately long list of C++ keywords with subtly different meanings in different contexts. I am not going to try and explain what it technically means, since that's been done before (e.g. here) and has to be quite technical. In a moment, we'll go through a few uses of it and how they work, and after that the definition might make more sense. Let's just use one small bit of the implications - static on a variable or class member is a way to get a "storage location" (i.e. a variable) which lasts for the entire program. In other words, it's lifetime is the program duration, and as we discussed in another recent videothat is in many ways equivalent to being a global variable.
Because it's "basically a global" we have to caution against static (unless also const) in any code which is, or might in future be, multithreaded. Global mutable (able to be changed, aka not const) state is the anathema of threading. In nearly every case, you do not want to try and handle the combination!
Definition, Declaration and the One Definition Rule
C++ requires you to do two things so that a variable or function "works". Firstly, you need to have declared what it looks like, so that all parts of the code know how to handle it. This means giving a variable a type, or a function a prototype - so that a variable access or function call has the right "pattern" - knows how big something is, how many parameters it has, etc. But you also have to define, or "fill in" what something actually is - create the storage space (memory) for a variable, or define the body of the function.
For variables, expecially if you have been avoiding globals, you may have never encountered this because in most cases the declaration and definition are the same - int i;does both. Do note that this is nothing to do with initialising or giving a value to a variable. The only time they become separate is in certain cases where we want multiple parts of the code to be able to use the same variable, such as for globals. Method 1 below shows how we can do this.
For functions, it is a bit more familiar - to call a function we need to have access to its prototype, and for compilation to complete and the program to run, it has to be given a body that can actually execute. For simple functions, this is why we normally declare prototypes in a header, and function bodies in a cpp file. If we put the body in a header, and include that header several times, we get errors telling us the function is multiply defined. For classes though, we can happily define function bodies in a class definition without any problem. What's happening here?
First, lets just clarify what including a header does - it pretty literally dumps the code from the included file into the including file. This is old stuff - the compiler has nothing to do with it, as it happens at the "pre-processing" step while the code is still just text. By the time the compiler sees it, the definition you wrote might appear several times, and there is no way to know that they came from the same place.
Now, the problem arises because C++ disallows certain forms of ambiguity. Suppose a function could be defined in several places - and suppose these were different! Imagine the chaos if one form were used in some places and another in others. Or might the compiler to expected to pick one? Disallowing this is mostly a Good Thing, but sometimes it has issues. For instance, being unable to define any functions in a header would rule out "header-only" libraries, where you simply include them and it works. It would also severely limit templating.
Not being able to define something multiple times is called the One Definition Rule. Because of the drawbacks just mentioned though, there are several places where it is allowed to be violated with the strict requirement that the programmer takes care of the risks. That is, you may be allowed to have a repeated definition, but if it is not the same everywhere, that is your problem (and mostly undefined behaviour, so a Very Bad Thing). The details aren't important here, and it's enough for us to suppose that where the benefits were sufficient, the rule was allowed to be broken.
Inline
Inline is a special keyword, which, yet again, has a complicated history. Originally, "inline" was a hint to the compiler that a function should be inlined (think - pasted into place in source code, rather than executed with a jump). In order to do this, the compiler would need to have access to the function body in all the places in the code it might be used. This meant the function had to be fully defined in an included header (strictly, not true, but good enough for us), which violated the One Definition Rule. Inline was considered useful, thus the rule had to be bent - C++ generally tries to allow useful things where it can (where the compiler writers think they can make it work). Once inline was allowed to bend the rule, people used it for that purpose, such that now it pretty much only is used for this case.
Since a class member function defined within the body of a class can only really be meant to be inline, this was done by default too, hence why you may never have seen this keyword. Note that it applies to functions only - and see Example 1b below for the newer extension to inline - inline variables.
Global state example in C++
Let's assume we really do have some data which is accessed in many parts of our program, both for read and write, and that we have already decided this is the best design. Perhaps it is used so widely that we would end up passing it endlessly, usually as that "tramp data". Once it's passed, it's accessible, and only some very horrible tricks (const_cast...) can allow us to restrict where it can be changed. It's already global in implication, why not admit this and allow it to be global in design.
What options do we have to do this, and what are their pros and cons?
Method 1 - A "simple" global variable
For simple data, such as a single integer, we can use the simplest "global variable" idiom - that is we want a variable whose declaration ("pattern") is available to all parts of our program, which is defined (actually created) once and only once in our code. This means we can't do what we might first think of and just create it in a header file because that would define it several times and our compiler will complain about an ambiguous name. Nothing clever we can do with include guards or trickery can avoid this. What we can do though, is to declare it in a header, and define it in a cpp file, like this:
file.h:
#ifndef file_h
#define file_h
extern int my_var;
#endif
and in file.cpp
#include "file.h"
int my_var = 10;
So what does all that mean? Our header file uses standard include guards (we'll leave these out in future) which stop the header being included repeatedly. We have our my_var variable, an int, and we declare it "extern". This tells the compiler that there will be a definition for my_var by the time the code is linked (if you're not too familiar with linking, think: compiled files combined into an executable). This means all of the parts of the code which use this header will happily compile using what they know my_var has to look like, and not worry about where it might be actually created. Then, the C++ file does the actual creation. This can happen only once, or we would risk having two separate variables with the same name.
This is the simplest idiom to get what we want, but has several problems. It's a bit confusing: extern is an old keyword that many people will never encounter. We have to go looking for the actual initialisation step, and verify that we set a value. In a lot of cases, we might have no other reason to have the file.cpp file except to instantiate one variable - and our only alternative there is to demand it be defined in some other of our cpp files, which is confusing, risky and all around a Bad Thing.
Lastly, a name as generic as my_var has now been "claimed" throughout our entire program and we re-use it at our peril. Shadowing, where a local variable "covers up" a global one, is always confusing. What would this snippet do, for instance?
int main(){
std::cout<<my_var<<std::endl;
{ int my_var = 11;
std::cout<<my_var<<std::endl;
}
std::cout<<my_var<<std::endl;
}
Actually, the problem is a bit worse than this, because only files including our header will see my_var, and you can probably work out why this can be a maintenance nightmare if suddenly code changes mean two names which had been separate begin to collide!
Method 1a - A namespaced global
To avoid the shadowing issue, and improve this solution from "pretty awful" to "alright", we can at least restrict our variable name to a namespace. If we are careful with our namings and dividing things into coherent sets, we can quite usefully indicate more about our variable, and make it easier to find places that related variables are, or should be, changed. For example
namespace display_config{
extern int width;
extern int height;
}
and
int display_config::width = 720;
int display_config::height=1080;
Generally, if we're changing our "display height" we'd also expect the width to be changed, and we now have some ability to spot this.
Method 1b - [C++ 17 or newer] This, but better
From C++ 17, this kinda obviously common and useful idiom is supported much more, by the addition of "inline" variables. We discussed inline above for functions, and this is very similar. The definitions must match, but in our case of there being only a single actual line, which is included several times, this is fine. We end up with the much more elegant looking:
namespace display_config{
inline int width=1080;
inline int height=720;}
This has several advantages - not needing a cpp file definition, being able to see at a glance that our variables are initialised (instead of having to look in two places), and having one less keyword/idiom to remember. However, C++17 is perhaps a little too new to use without at least considering where you will be compiling/running your code, as you might have to put in a request for a compiler update for a few years yet.
Method 2 - A class with static members
At the point where we're talking about linking variables to each other, we are into the territory where a class becomes a good solution. This would let us, for example, relate the setting of height and width explicitly. The obvious step from our namespace example, to a basic class is this:
file.h
class display_data{
static int height;
static int width;
};
and file.cpp
int display_data::height=720;
int display_data::width=1080;
Now this is rather different looking and the keywords have changed. We no longer have any extern, but we have introduced static, and we have these slightly unexpected definitions without which we get compiler errors telling us the variable does not exist.
A static class member variable is shared by all instances of the class. If we read or write to it, this must have the same effect (in terms of how the bits in memory are changed) whichever class instance we would use - thus we don't need to specify. In fact, we don't need to have any instance, we can access those variables as display_data::height from anywhere. This sort of explains why the second bit is needed - if they're not associated with any class instance, the variables have to be "created" and given storage space somewhere. As before, we need those to be in some cpp file and often they are the only thing there, which is ugly.
So this also has problems. Firstly - if our class has only static members, that's weird, and often considered an "anti-pattern". It's a heavy solution and goes against some of the motivation for objects. However, if we want to control the setting of height and width (forcing them to be non-negative for instance), we can get the compilers help to enforce this by making them private and providing setter and getter functions.
On the whole though, entirely static class members is an oddball solution, and I'm not sure where it's really useful.
Method 3 - A static class instance
Going back a bit, our display has sort of taken on an "object like" existence, with several items of data and methods that act on them. This object really could have several instances, it's just that for our specific program, we want to have a global one referring to "the display". This is far more naturally represented by a global instance of a regular class, so we can use the first approach with a user-defined class rather than a plain int, and get some benefits.
However, we still have most of the drawbacks of global-ness, so we do need some strong motivation. And here we additionally have all of the drawbacks of method 1 like needing the C++ file. Moreover, there is a thing called the Static Initialisation Order Fiasco if we have interaction between static objects, and a lot of other stuff gets really messy.
We mention this option for completeness, but would probably never recommend it.
Method 4 - Function Local Statics
Possibly the best approach to allow us to have a single global class like the previous method, but more safely, is to exploit function-local static variables. These are a lot like global statics, but their scope is restricted to the function where they "live". This clears up a lot of the issues from Method 1 and is much, much better. We do this (showing the function body only, there would also be a header containing prototype):
int get_display_height(){
static int the_display_height=720;
return the_display_height;}
That lets us have this global variable, but gives us no way to set it. It's easy in this case to find a sentinel value to let us do this to fix that:
int get_display_height(int new_val=-1) //In header
int get_display_height(int new_val){
static int the_display_height=720;
if(new_val != -1){the_display_height = new_val;}
return the_display_height;}
but not all things have a sentinel.
In those cases, and some others where a sentinel is not appropriate, we could instead return a reference or pointer to the variable, such as this:
int * get_display_height(){
static int the_display_height=720;
return &the_display_height;}
This is perfectly valid for a static function local, in a way which it is absolutely not valid for a regular function local variable. Yet again, we're running into confusing idioms and must tread carefully! Worse still, we've now opened up setting of our variable to the entire code and can no longer restrict values to be positive, or anything else!
Worse still, a casual reading of those snippets might miss that the initialisation to 720 is done only the first time the function is encountered. Any subsequent calls refer to the same variable, the_display_height, but the setting is not redone. In this case, we absolutely rely on that behaviour, but if it is something more complicated like a heap allocation it can really confuse you.
Method 5 - A Singleton Class
The final method we'll discuss here is the most heavy weight, but can be really useful. Suppose we really do have a class containing data and operations on it, and we want there to exist precisely one of them. This will be our global object. Much like the previous idiom, we'll have the actual storage location be a function-local static variable. So we will do something like this:
static theClass * theClass::get_instance(){
static theClass * the_instance = new theClass();
return the_instance;}
Anywhere that we want to use the class, we simply get it using theClass::get_instance()->blah.
There's one more trick we need, because we said that we want only one of these to exist - which is to privatise the constructor for theClass so that only the one in this function can be created. Voila, a single entity! This is why the function above is a class-member function, so that it and only it is able to call the constructor.
This is generally know as the Singleton Pattern and is very handy. However, be aware that it does involve statics and shared entities and so must be handled carefully. In particular, you must make sure either all operations on the class leave it in a valid state, or two parts of your code might both use it and confuse each other - obviously this gets a lot more pressing in multithreaded code, but even serial code can have the problem.
Conclusion
Global data, if it truly is global, is global however you code it. You have to use caution, but whether you pass it about, or use one of these tricks, your data is subject to change in multiple places. Be wary. Use every tool the compiler gives you, such as const, and block scoping, and only globalise things that are worth the headache!