All entries for Wednesday 22 August 2018
August 22, 2018
Data structures 2 – Arrays Part 1
This time we're having a little bit on arrays. They're pretty much the prototypical data structure and there's a good chance that you're already familiar with them, but if you're not, arrays are collections of a datatype (like int or float) and you can access an individual element by it's index (a number (usually starting from either 0 or 1) that describes and individual element). They're pretty easy to use and there's lots of resources on arrays so I won't introduce them too much here, but there are some subtleties that you should worry about when using them. Most of them are to do with an element of modern computers called prefetching
In the early days of computers (and to a degree up until the 1980s) memory and processors ran at pretty much the same speed. The main job of a programmer was simply making sure that the processor could keep up with the data that was fed to it from the memory by simplifying the computations to be done as much as possible. But processors kept getting faster and faster and memory wasn't able to keep up with them, at least not while also increasing the amount of memory in the computer. To try and smooth out this effect processors were given caches, small areas of memory that are very close to the processor and that can run as fast as the processor. If your program and data is small enough to fit in this cache memory then it will run very fast indeed, but this isn't a very common situation, so the question is : what should be in the cache memory? There's a lot of work goes into deciding this, but in general the rule is : data that you've used recently and data that is "near" data that you've used recently. There are two reasons for this nearby data being used. Firstly data is loaded into the cache memory in fixed chunks (called cache lines) so if your data is smaller than a cache line then adjacent memory is always loaded anyway, but in more modern machines there is also a prefetch unit in the CPU that actively pulls in additional cache lines if it "thinks" that the processor might need them.
The net effect of this is that not all memory access is equal. Memory access patterns that make your processors prefetching strategy a success are likely to give better performing code than access patterns that fail to do so and this has implications for how you should access your arrays.
Behind the scenes all that arrays are is contiguous strips of memory of a fixed length. Since they are contiguous the most common prefetching patterns will be a success for arrays so long as you go through the array accessing element i+1 immediately after you've accessed element i. That at least is easy, but what happens when you have multidimensional arrays? (int a[100][100]; INTEGER, DIMENSION(100,100) :: a etc.)
Even multidimensional arrays are 1D strips of contiguous memory (in C you can build ones that aren't but I'm ignoring those here). The mutidimensionality is created by the fact that behind the scenes the two dimensional index is converted into a 1 dimensional index. For example, if you have an array nx x ny elements and you are accessing element (i,j) then you could convert this to a 1D index by calculating element = i * ny + j (assuming that the index goes from (0:nx-1, 0:ny-1)). As you increment j you move between adjacent elements and then when you reach j=ny-1, you increment i and set j=0 and you still go smoothly to the next index in the underlying 1D memory strip. The problem is that this is a completely arbitrary choice, I could just as easily have written element = i + j * nx and then to move element by element through the array you have to increment i first and then j when i=nx-1. In fact, different programming languages make different choices for this, and the difference is called either column major or row major order.
Fortran is column major. In order to move through a Fortran array in the right order you should change the last index in the array specifier first. C is row major, so you should change the first index in the array specifier first. As a consequence, despite being completely different both of these programs is operating correctly.
INTEGER, DIMENSION(100,100) :: a
DO iy = 1, 100
DO ix = 1, 100
a(ix,iy) = a(ix,iy) + 1
END DO
END DO
int a[100][100];
for (ix = 0; ix<100; ++ix){
for (iy =0; iy<100;++iy) {
a[ix][iy] = a[ix][iy] + 1;
}
}
Other languages make one or other of these choices, usually with an eye to the language that their "typical" original user was familiar with, so Matlab is column major, just like Fortran because it's so common in scientific use where Fortran is also very popular.
It's very important that you get this the right way round as it can lead to significant (order 2x) speedups compared to getting it wrong. So remember FORTRAN LAST INDEX FIRST, C/C++ FIRST INDEX FIRST!
That's pretty much everything for statically sized arrays (where the size is hard coded into the source code), but there are a few remaining wrinkles to do with dynamically sized arrays (arrays where the size is specified at runtime), so that'll be the topic of part 2 of this entry.