All 6 entries tagged Hpc

No other Warwick Blogs use the tag Hpc on entries | View entries tagged Hpc at Technorati | There are no images tagged Hpc on this blog

March 04, 2020

Scheduling of OpenMP

OpenMP is one of the most popular methods in academic programming to write parallel code. It's major advantage is that you can achieve performance improvements in a lot of cases by just putting "directives" into your code to tell the compiler "You can take this loop and split it up into sections for each processor". Other parallelism schemes like MPI or Intel Threaded Building Blocks or Coarray Fortran all involve designing your algorithm around splitting the work up, OpenMP makes it easy to simply add parallelism to bits where you want it. (There are also lots of bits of OpenMP programming that require you to make changes to your code but you can get further than in pretty much any other modern paradigm without having to alter your actual code).

So what does this look like in practice?

 
MODULE prime_finder
  USE ISO_FORTRAN_ENV
  IMPLICIT NONE

  INTEGER(INT64), PARAMETER ::  small_primes_len = 20
  INTEGER(INT64), DIMENSION(small_primes_len), PARAMETER :: &
      small_primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, &
      53, 59, 61, 67, 71]
  INTEGER, PARAMETER :: max_small_prime = 71

  CONTAINS

  FUNCTION check_prime(num)
    INTEGER(INT64), INTENT(IN) :: num
    LOGICAL :: check_prime

    INTEGER(INT64) :: index, end

    end = CEILING(SQRT(REAL(num, REAL64)))

    check_prime = .FALSE.
    !First check against the small primes
    DO index = 1, small_primes_len
      IF (small_primes(index) == num) THEN
        check_prime = .TRUE.
        RETURN
      END IF
      IF (MOD(num, small_primes(index)) == 0) THEN
        RETURN
      END IF
    END DO

    !Test higher numbers, skipping all the evens
    DO index = max_small_prime + 2, end, 2
      IF (MOD(num, index) == 0) RETURN
    END DO
    check_prime = .TRUE.
  END FUNCTION check_prime

END MODULE prime_finder

PROGRAM primes

  USE prime_finder
  IMPLICIT NONE
  INTEGER(INT64) :: ct, i

  ct = 0_INT64
!$OMP PARALLEL DO REDUCTION(+:ct)
  DO i = 2_INT64, 20000000_INT64
    IF (check_prime(i)) ct = ct + 1
  END DO
!$OMP END PARALLEL DO

  PRINT *, "Number of primes = ", ct

END PROGRAM primes

This is a Fortran 2008 program (although OpenMP works on Fortran, C and C++) that uses a very simple algorithm to count the number of prime numbers between 2 and 20,000,000. There are much better algorithms for this but this algorithm correctly counts this number of primes and is very suitable for parallelising since each number is checked for primality separately. This code is already OpenMP parallelised and as you can see the parallelism is very simple. The only lines of OpenMP code are !$OMP PARALLEL DO REDUCTION(+:ct)and$OMP END PARALLEL DO . The first line says that I want this loop to be parallel and that the variable ct should be calculated separately on each processor and then summed over all processors at the end. The second line just says that we have finished with parallelism and should switch back to running the code in serial after this point. Otherwise that is exactly the same program that I would have written if I was testing the problem on a single processor and I get the result that there are 1,270,607 primes less than 20,000,000 regardless of how many processors I run on. So far so good but look what happens when I look at the timings for different numbers of processors

Number of Processors Time(s)
1 11.3
2 7.2
4 3.9
8 2.2
16 1.13

It certainly speeds up! But not as much as it should since every single prime is being tested separately (the number for 16 processors would be 0.71 seconds if it was scaling perfectly). There are lots of reasons why parallel codes don't speed up as much as they should but in this case it isn't any underlying limitation of the hardware but is to do with how OpenMP chooses to split up my problem and a simple change gets the runtime down to 0.81s. The difference between 1.1 seconds and 0.8 seconds isn't much but if your code takes a day rather than a second then 26.5 hours vs 19.2 hours can be significant in terms of electricity costs and cost of buying computer time.

So what is the problem? The problem is in how OpenMP chooses to split the work up over the processors. It isn't trivial to work out how long it will take to check all numbers in a given range for primality (in fact that is a harder problem than just counting the primes in that range!) so if you just do the obvious way of splitting that loop (processor 1 gets 3->10,000,000 processor 2 gets 10,000,001 to 20,000,000) then one of those processors will be getting more work than the other one will and will have to wait until that second processor has finished before it can give you the total number of primes. By default OpenMP does exactly that simple way of splitting the loop up so you don't get all of the benefit that you should from the extra processors. The solution is to specify a SCHEDULE command on the !$OMP PARALLEL DO line. There are 3 main options currently for scheduling : STATIC (the default), DYNAMIC (each iteration of the loop is considered separately and handed off in turn to a processor that has finished it's previous work) and GUIDED (the iterations are split up into work blocks that are "proportional to" the number of currently undone iterations divided by the number of processors. When each processor has finished it's block it requests another one.). (There are also two others, AUTO that has OpenMP try to work out the best strategy based on your code and RUNTIME that allows you to specify one of the 3 main options when the code is running rather than when you compile it). You can also optionally specify a chunk size for each of these that for STATIC and DYNAMIC tries to gang together blocks of chunksizeand then split them off to processors in sequence and for GUIDED makes sure that the work blocks never get smaller than the specified chunk size. For this problem GUIDED gives the best results and switching !$OMP PARALLEL DO REDUCTION(+:ct) SCHEDULE(GUIDED)for !$OMP PARALLEL DO REDUCTION(+:ct)in that code gives you the final runtime of 0.8 seconds on 16 processors which is a good overall performance. But the message here is much more about what OpenMP is doing behind the scenes than the details of this toy problem. The OpenMP standard document (https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf) is quite explicit on some of what happens but other bits are left up to the implementer.

So what do these do in practice? We can only talk in detail about a single implementation so here we're going to talk about the implementation in gcc(gfortran) and I'm using version 9.2.1. You can write a very simple piece of code that fills an array with a symbol representing which processor worked on it to see what's happening. For a simple 20 element array with 2 processors you get the following results (* = processor 1, # = processor 2)

 PROGRAM test
  
  USE omp_lib
  IMPLICIT NONE
  INTEGER, PARAMETER :: nels = 20
  INTEGER, DIMENSION(:), ALLOCATABLE :: ct, proc_used
  CHARACTER(LEN=*), PARAMETER :: symbols &
      = "*#$%ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890"
  INTEGER :: nproc, proc, ix

  nproc = omp_get_max_threads()
  ALLOCATE(ct(nproc))
  ct = 0
  ALLOCATE(proc_used(nels))

!$OMP PARALLEL PRIVATE(proc, ix)
  proc = omp_get_thread_num() + 1
!$OMP DO SCHEDULE(GUIDED)
  DO ix = 1, nels
    ct(proc) = ct(proc) + 1
    proc_used(ix) = proc
  END DO
!$OMP END DO
!$OMP END PARALLEL

 DO ix = 1, nproc
   PRINT "(A, I0, A, I0, A)", "Processor ", ix, " handled ", ct(ix), " elements"
 END DO

 DO ix = 1, nels
   IF (proc_used(ix) <= LEN(symbols)) THEN
     WRITE(*,'(A)', ADVANCE = 'NO') symbols(proc_used(ix):proc_used(ix))
   ELSE
     WRITE(*,'(A,I0,A)', ADVANCE = 'NO') "!", proc_used(ix),"!"
   END IF
 END DO

 PRINT *,""
END PROGRAM test 
SCHEDULE command Pattern
OMP DO SCHEDULE(STATIC) **********##########
OMP DO SCHEDULE(STATIC,4) ****####****####****
OMP DO SCHEDULE(DYNAMIC) *#********#*#*#*##*#
OMP DO SCHEDULE(DYNAMIC) *#*#*************#*#
OMP DO SCHEDULE(DYNAMIC,4) ****####****####****
OMP DO SCHEDULE(GUIDED) ##########*****#####

You can see immediately that the scheduling behaviour is very different for the different commands. The simple STATIC scheduler simply splits the array into two with each processor getting half of the domain. STATIC,4 specifies a chunk size of 4 and does the same type of splitting but with processors getting adjacent chunks of 4 items. DYNAMIC produces quite complex interleaved patterns of processor responsibility but as you can see two runs with the same SCHEDULE command produce different patterns so this really is a dynamic system - you can't predict what processor is going to run what index of a loop (mostly this doesn't matter but there are plenty of real-world cases where efficiency dicatates that you always want the same processor working on the same data). Finally, GUIDED produced a rather strange looking pattern where processor 2 does most of the work. This pattern is not guaranteed but did drop out of multiple runs quite regularly. It appears to be that processor 1 gets a lot of housekeeping work associated with running in parallel (i.e. actually running the OpenMP library code itself) so the system gives more of the work in my code to processor 2.

An interesting thing that you can immediately see from these results is that I should be able to do something other than GUIDED for my prime number code. Since the problem is that certain chunks of the space of numbers to test are harder to process than other chunks I should be able to fix the STATIC schedule just by using a small-ish chunk size rather than giving every processor 1/16 of the entire list. This would mean that all of the processors would get bits of the easy numbers and bits of the hard numbers. Since there is always a cost for starting a new chunk of data I'm guessing that with 20,000,000 primes to test a chunk size of 1000 would seem suitable, so I go back to my prime code and put in SCHEDULE(STATIC, 1000). If I now run on 16 processors I get a time of 0.80 seconds, slightly faster than my GUIDED case indicating that it was the fact that some processors have an easier time of it than others is the problem.

Take away message for this is that when running parallel code it is vitally important that you understand the question of how your problem is being split up and whether that means that all of your processors will have the same amount of work. You have a decent array of tools currently to control how work is distributed and future releases of OpenMP should also have the option of specifying your own scheduling systems!


March 14, 2019

Multiprocessor Profiling

Super quick this time - a bit of background and then something I didn't know worked as well as it does - namely gprof on multi-core programs.

Profiling

Once code works (and NEVER before) you can start worrying about performance. Often, you can tell using a bit of intuition and a few test runs what takes the most time, especially in simple cases. But this is always a bit risky, because you can easily miss the real performance limiting steps. Sod's law (if it can go wrong, it will) usually means that if something is completely unexpectedly slow, it's either trivial to optimise, because it really oughn't to be slow, or it's almost impossible.

The answer to this is not to guess, but to use one of the many available profiling tools, which tell you, function by function, and line by line, where your code spends it time. You then know where to focus your efforts.

Our upcoming training course (Accelerating Python, 27/3/19, see our calendar) will discuss profiling in detail for Python programs, and there'll be a more specific post after that, so watch this space!

Profiling Compiled Code

For compiled codes, profiling is usually pretty simple, but does require you compile code specially. For C/Fortran etc, there is a very good profiler as part of the gcc toolchain, called gprof. Using gcc, you simple compile your code with `-pg -g`(the second g adds debugging symbols and is only required on older gcc) and run it. This produces a machine-readable file of all the timings for that specific run, usually gmon.out. You can then turn this data into nicer formats using the gprof tool. Something like

gprof <path to program executable> <path to profile file>

You have to provide the program file too so that gprof can report to you in terms of program source lines. Note that strange things happen if the program you give here isn't what you ran (wrong line numbers etc). This is terribly infuriating, so don't do it.

Old Dog New Trick

I have used gprof plenty in the past, but I recently found a really useful way to apply it to multi-core codes (using MPI). This doesn't tend to work very well, because the output from different processes gets all mixed up and the timings end up wonky. It turns out that there's a really, really, simple fix, decribed e.g. hereand reproduced below.

Firstly, you need to get the profiling data on a per-process basis. As the link above says, this isn't very well documented, but is easy to do. Before running, set the environment variable GMON_OUT_PREFIX. This has to be done where the code is running, so if on a cluster etc, you'll probably need to add an export or setenv to your submission script, and/or use the '-x' argument to mpiexec (see below). So we do something like

export GMON_OUT_PREFIX=gmon.out-

When the code runs, it will produce one file per process (so be very careful on a cluster not to overwhelm your home space) named 'gmon.out-<PID>' where PID is the process ID. In theory it is possible to have the files named by MPI rank, but apparently this doesn't work well.

Now you run the code as usual, although on some systems you may need to make sure MPI run is fed the above environment variable. You can do this using the '-x <env var>' flag to mpiexec. On a local machine, as long as you exported the variable, this shouldn't be needed.

Now you want to produce the function or line level profiles using the sum of all of them. You can supply all of the files to gprof, or you can have gprof sum them into a single file, by default gmon.sum, using

gprof myprog -s gmon.out-*

Make sure the wildcard matches the right set of files - when I first tried this I forgot to clean out the files from previous runs and ended up summing across runs, with very odd results. Of course, you can use this to sum across runs to very useful effect if you want to see average profiles across multiple sets of parameters etc.

Now you can run gprof as usual, using either

gprof <path to executable> gmon.out-*

or

gprof <path to executable> gmon.sum

and voila, properly added up profiling information for multiple cores. Super.


February 07, 2019

New Training

New RSE training opportunity - intermediate level MPI. Following on from a basic MPI course, such as our Intro (Dec. 2018) or Warwick's PX425, we have a 1-day workshop on some trickier topics, such as MPI types. See here for details.


October 02, 2018

Upcoming Training Opportunities

Warwick RSE's autummn term training is now available for signup for any University of Warwick Staff or students, and anybody from the HPC-Midlands-Plus consortium.

This time we have two options.

The first is aimed mainly at Warwick Researchers who wish to use HPC facilities. We'll go through getting access and some essential info you'll want to know, as well as briefly mention where else you can get computing resources.

Secondly, we have a short 3-hour seminar going over all the bits of Software Development researchers should know about. This will be a pretty rapid spin through a lot of tools and words you'll need to know. Hopefully, you'll then spot when you should go and learn more about these things as they come up in your research etc.

For dates, signup etc, see our calendar


March 09, 2018

Odd MPI IO bug in Open–MPI

Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)

First you create a type representing the total array using MPI_Type_create_subarray

  sizes = (/nx_global, ny_global/)
  subsizes = (/nx_local, ny_local/)
  starts = (/x_cell_min, y_cell_min/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)

This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.

The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again

  sizes = (/nx_local+2, ny_local+2/)
  subsizes = (/nx_local, ny_local/)
  starts = (/1, 1/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)

When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.

The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.

Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.


There are a couple of take away messages from this

  1. If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
  2. Sometimes there really are bugs in well tested and widely used libraries
  3. It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.

February 21, 2018

SOUP: Function Pointers

Today's snippet demos function pointers (objects in Python), in particular an array of function pointers. We use them to print a table of arithmetic operations on all of the IEEE special values.

IEEE 754 defines the behaviour of floating point numbers, in particular what happens when numbers get unrepresentable, whether that is too large, too small or plain not-numbers.

Infinity is a familiar enough concept and in floating-point it mostly means a number which is too large to be represented. There's a positive and a negative infinity and most arithmetic works as expected.

Negative zero (-0.0) seems quite odd at a first glance. The sign-bit is set, but in every other way, and in general, it is equal to positive zero. Comparisons like `0.0 == -0.0` are defined to be true and `-0.0 < 0.0` is false. Most languages have a function to copy the sign from one number to another though, and -0.0 works as expected.

NaN, Not a Number, mostly appears as a lament, "Why is it NaN!?" or worse "Why is it NaN again!?" Any operations involving NaN also give NaN as do several others.

It seems strange that `Inf * 0.0` or `Inf + -Inf` both give NaN where they could both reasonably give zero. Philosophically, both are numbers, but completely unknown ones. Inf isn't mathematical infinity, it is just too large to represent, and 0.0 stands in for any number too small to represent. Their product then can be absolutely anything, hence defining it as NaN.

Code snippets in C, Fortran and Python are in our SOUP repo under 001_IEEE. All three use named functions to "dry out" the code: in fact they use an array of them. Note in Fortran this needs us to use a custom type to hold the function pointer, as we can't have an array of simply pointers.

The core of all snippets is the loop over operations and inputs

In C:

  float (*op)(float, float);/*Holds operation function*/
  float(*allOps[4])(float, float);

  allOps[0] = &add; allOps[1] = &sub; allOps[2] = &mul; allOps[3] = &div;

  for(opNum=0; opNum< 4; opNum++){
    op = allOps[opNum];
    
    for(rowNum = 0; rowNum < 7; rowNum++){
      row = allRows[rowNum];
      /*print result of op(row, column)*/
    }
  }

In Fortran (where f_ptr is our type holding a pointer):

  TYPE(f_ptr), DIMENSION(4) :: allOps
  TYPE(f_ptr) :: currOp

  allOps(1)%ptr => add
  allOps(2)%ptr => sub
  allOps(3)%ptr => mult
  allOps(4)%ptr => div
  DO opNum = 1, 4
    currOp%ptr => allOps(opNum)%ptr

    DO rowNum = 1, 7
      row = allRows(rowNum)
      !Print results of currOp%ptr applied to row, column
    ENDDO
  ENDDO

And in Python (using range-based for loops on lists):

  allOps = [add, sub, mul, div]

  for opNum in range(0, 4):
    op = allOps[opNum]

    for rowNum in range(0, 7):
      row = allRows[rowNum]
      #Print result of op(row, column)

Note all three are subtly different in how one assigns the active operation. In C you just get a pointer to a function with the usual address-of operator, &. In Fortran you point your pointer to the proper target using the points-to operator, =>. In Python we just set the operator to equal the function, omitting the brackets () turning this into an assignment, not a call.

Function pointers also have a powerful use in High-Performance code. A long if-else chain inside a loop, which calls a different function in each branch, can be replaced with the same if-chain outside the loop setting a function pointer and then a simple call inside the loop, eliminating the branch. As pseudo-code:

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

becomes

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

April 2024

Mo Tu We Th Fr Sa Su
Mar |  Today  |
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30               

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV