All entries for Thursday 14 March 2019

March 14, 2019

Multiprocessor Profiling

Super quick this time - a bit of background and then something I didn't know worked as well as it does - namely gprof on multi-core programs.

Profiling

Once code works (and NEVER before) you can start worrying about performance. Often, you can tell using a bit of intuition and a few test runs what takes the most time, especially in simple cases. But this is always a bit risky, because you can easily miss the real performance limiting steps. Sod's law (if it can go wrong, it will) usually means that if something is completely unexpectedly slow, it's either trivial to optimise, because it really oughn't to be slow, or it's almost impossible.

The answer to this is not to guess, but to use one of the many available profiling tools, which tell you, function by function, and line by line, where your code spends it time. You then know where to focus your efforts.

Our upcoming training course (Accelerating Python, 27/3/19, see our calendar) will discuss profiling in detail for Python programs, and there'll be a more specific post after that, so watch this space!

Profiling Compiled Code

For compiled codes, profiling is usually pretty simple, but does require you compile code specially. For C/Fortran etc, there is a very good profiler as part of the gcc toolchain, called gprof. Using gcc, you simple compile your code with `-pg -g`(the second g adds debugging symbols and is only required on older gcc) and run it. This produces a machine-readable file of all the timings for that specific run, usually gmon.out. You can then turn this data into nicer formats using the gprof tool. Something like

gprof <path to program executable> <path to profile file>

You have to provide the program file too so that gprof can report to you in terms of program source lines. Note that strange things happen if the program you give here isn't what you ran (wrong line numbers etc). This is terribly infuriating, so don't do it.

Old Dog New Trick

I have used gprof plenty in the past, but I recently found a really useful way to apply it to multi-core codes (using MPI). This doesn't tend to work very well, because the output from different processes gets all mixed up and the timings end up wonky. It turns out that there's a really, really, simple fix, decribed e.g. hereand reproduced below.

Firstly, you need to get the profiling data on a per-process basis. As the link above says, this isn't very well documented, but is easy to do. Before running, set the environment variable GMON_OUT_PREFIX. This has to be done where the code is running, so if on a cluster etc, you'll probably need to add an export or setenv to your submission script, and/or use the '-x' argument to mpiexec (see below). So we do something like

export GMON_OUT_PREFIX=gmon.out-

When the code runs, it will produce one file per process (so be very careful on a cluster not to overwhelm your home space) named 'gmon.out-<PID>' where PID is the process ID. In theory it is possible to have the files named by MPI rank, but apparently this doesn't work well.

Now you run the code as usual, although on some systems you may need to make sure MPI run is fed the above environment variable. You can do this using the '-x <env var>' flag to mpiexec. On a local machine, as long as you exported the variable, this shouldn't be needed.

Now you want to produce the function or line level profiles using the sum of all of them. You can supply all of the files to gprof, or you can have gprof sum them into a single file, by default gmon.sum, using

gprof myprog -s gmon.out-*

Make sure the wildcard matches the right set of files - when I first tried this I forgot to clean out the files from previous runs and ended up summing across runs, with very odd results. Of course, you can use this to sum across runs to very useful effect if you want to see average profiles across multiple sets of parameters etc.

Now you can run gprof as usual, using either