### Scheduling of OpenMP

OpenMP is one of the most popular methods in academic programming to write parallel code. It's major advantage is that you can achieve performance improvements in a lot of cases by just putting "directives" into your code to tell the compiler "You can take this loop and split it up into sections for each processor". Other parallelism schemes like MPI or Intel Threaded Building Blocks or Coarray Fortran all involve designing your algorithm around splitting the work up, OpenMP makes it easy to simply add parallelism to bits where you want it. (There are also lots of bits of OpenMP programming that require you to make changes to your code but you can get further than in pretty much any other modern paradigm without having to alter your actual code).

So what does this look like in practice?

MODULE prime_finder USE ISO_FORTRAN_ENV IMPLICIT NONE INTEGER(INT64), PARAMETER :: small_primes_len = 20 INTEGER(INT64), DIMENSION(small_primes_len), PARAMETER :: & small_primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, & 53, 59, 61, 67, 71] INTEGER, PARAMETER :: max_small_prime = 71 CONTAINS FUNCTION check_prime(num) INTEGER(INT64), INTENT(IN) :: num LOGICAL :: check_prime INTEGER(INT64) :: index, end end = CEILING(SQRT(REAL(num, REAL64))) check_prime = .FALSE. !First check against the small primes DO index = 1, small_primes_len IF (small_primes(index) == num) THEN check_prime = .TRUE. RETURN END IF IF (MOD(num, small_primes(index)) == 0) THEN RETURN END IF END DO !Test higher numbers, skipping all the evens DO index = max_small_prime + 2, end, 2 IF (MOD(num, index) == 0) RETURN END DO check_prime = .TRUE. END FUNCTION check_prime END MODULE prime_finder PROGRAM primes USE prime_finder IMPLICIT NONE INTEGER(INT64) :: ct, i ct = 0_INT64 !$OMP PARALLEL DO REDUCTION(+:ct) DO i = 2_INT64, 20000000_INT64 IF (check_prime(i)) ct = ct + 1 END DO !$OMP END PARALLEL DO PRINT *, "Number of primes = ", ct END PROGRAM primes

This is a Fortran 2008 program (although OpenMP works on Fortran, C and C++) that uses a very simple algorithm to count the number of prime numbers between 2 and 20,000,000. There are much better algorithms for this but this algorithm correctly counts this number of primes and is very suitable for parallelising since each number is checked for primality separately. This code is already OpenMP parallelised and as you can see the parallelism is very simple. The only lines of OpenMP code are !$OMP PARALLEL DO REDUCTION(+:ct)and$OMP END PARALLEL DO . The first line says that I want this loop to be parallel and that the variable ct should be calculated separately on each processor and then summed over all processors at the end. The second line just says that we have finished with parallelism and should switch back to running the code in serial after this point. Otherwise that is exactly the same program that I would have written if I was testing the problem on a single processor and I get the result that there are 1,270,607 primes less than 20,000,000 regardless of how many processors I run on. So far so good but look what happens when I look at the timings for different numbers of processors

Number of Processors | Time(s) |

1 | 11.3 |

2 | 7.2 |

4 | 3.9 |

8 | 2.2 |

16 | 1.13 |

It certainly speeds up! But not as much as it should since every single prime is being tested separately (the number for 16 processors would be 0.71 seconds if it was scaling perfectly). There are lots of reasons why parallel codes don't speed up as much as they should but in this case it isn't any underlying limitation of the hardware but is to do with how OpenMP chooses to split up my problem and a simple change gets the runtime down to 0.81s. The difference between 1.1 seconds and 0.8 seconds isn't much but if your code takes a day rather than a second then 26.5 hours vs 19.2 hours can be significant in terms of electricity costs and cost of buying computer time.

So what is the problem? The problem is in how OpenMP chooses to split the work up over the processors. It isn't trivial to work out how long it will take to check all numbers in a given range for primality (in fact that is a harder problem than just counting the primes in that range!) so if you just do the obvious way of splitting that loop (processor 1 gets 3->10,000,000 processor 2 gets 10,000,001 to 20,000,000) then one of those processors will be getting more work than the other one will and will have to wait until that second processor has finished before it can give you the total number of primes. By default OpenMP does exactly that simple way of splitting the loop up so you don't get all of the benefit that you should from the extra processors. The solution is to specify a SCHEDULE command on the !$OMP PARALLEL DO line. There are 3 main options currently for scheduling : STATIC (the default), DYNAMIC (each iteration of the loop is considered separately and handed off in turn to a processor that has finished it's previous work) and GUIDED (the iterations are split up into work blocks that are "proportional to" the number of currently undone iterations divided by the number of processors. When each processor has finished it's block it requests another one.). (There are also two others, AUTO that has OpenMP try to work out the best strategy based on your code and RUNTIME that allows you to specify one of the 3 main options when the code is running rather than when you compile it). You can also optionally specify a chunk size for each of these that for STATIC and DYNAMIC tries to gang together blocks of **chunksize**and then split them off to processors in sequence and for GUIDED makes sure that the work blocks never get smaller than the specified chunk size. For this problem GUIDED gives the best results and switching !$OMP PARALLEL DO REDUCTION(+:ct) SCHEDULE(GUIDED)for !$OMP PARALLEL DO REDUCTION(+:ct)in that code gives you the final runtime of 0.8 seconds on 16 processors which is a good overall performance. But the message here is much more about what OpenMP is doing behind the scenes than the details of this toy problem. The OpenMP standard document (https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf) is quite explicit on some of what happens but other bits are left up to the implementer.

So what do these do in practice? We can only talk in detail about a single implementation so here we're going to talk about the implementation in gcc(gfortran) and I'm using version 9.2.1. You can write a very simple piece of code that fills an array with a symbol representing which processor worked on it to see what's happening. For a simple 20 element array with 2 processors you get the following results (* = processor 1, # = processor 2)

PROGRAM test USE omp_lib IMPLICIT NONE INTEGER, PARAMETER :: nels = 20 INTEGER, DIMENSION(:), ALLOCATABLE :: ct, proc_used CHARACTER(LEN=*), PARAMETER :: symbols & = "*#$%ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890" INTEGER :: nproc, proc, ix nproc = omp_get_max_threads() ALLOCATE(ct(nproc)) ct = 0 ALLOCATE(proc_used(nels)) !$OMP PARALLEL PRIVATE(proc, ix) proc = omp_get_thread_num() + 1 !$OMP DO SCHEDULE(GUIDED) DO ix = 1, nels ct(proc) = ct(proc) + 1 proc_used(ix) = proc END DO !$OMP END DO !$OMP END PARALLEL DO ix = 1, nproc PRINT "(A, I0, A, I0, A)", "Processor ", ix, " handled ", ct(ix), " elements" END DO DO ix = 1, nels IF (proc_used(ix) <= LEN(symbols)) THEN WRITE(*,'(A)', ADVANCE = 'NO') symbols(proc_used(ix):proc_used(ix)) ELSE WRITE(*,'(A,I0,A)', ADVANCE = 'NO') "!", proc_used(ix),"!" END IF END DO PRINT *,"" END PROGRAM test

SCHEDULE command | Pattern |

OMP DO SCHEDULE(STATIC) | **********########## |

OMP DO SCHEDULE(STATIC,4) | ****####****####**** |

OMP DO SCHEDULE(DYNAMIC) | *#********#*#*#*##*# |

OMP DO SCHEDULE(DYNAMIC) | *#*#*************#*# |

OMP DO SCHEDULE(DYNAMIC,4) | ****####****####**** |

OMP DO SCHEDULE(GUIDED) | ##########*****##### |

You can see immediately that the scheduling behaviour is very different for the different commands. The simple STATIC scheduler simply splits the array into two with each processor getting half of the domain. STATIC,4 specifies a chunk size of 4 and does the same type of splitting but with processors getting adjacent chunks of 4 items. DYNAMIC produces quite complex interleaved patterns of processor responsibility but as you can see two runs with the same SCHEDULE command produce different patterns so this really is a dynamic system - you can't predict what processor is going to run what index of a loop (mostly this doesn't matter but there are plenty of real-world cases where efficiency dicatates that you always want the same processor working on the same data). Finally, GUIDED produced a rather strange looking pattern where processor 2 does most of the work. This pattern is not guaranteed but did drop out of multiple runs quite regularly. It appears to be that processor 1 gets a lot of housekeeping work associated with running in parallel (i.e. actually running the OpenMP library code itself) so the system gives more of the work in my code to processor 2.

An interesting thing that you can immediately see from these results is that I should be able to do something other than GUIDED for my prime number code. Since the problem is that certain chunks of the space of numbers to test are harder to process than other chunks I should be able to fix the STATIC schedule just by using a small-ish chunk size rather than giving every processor 1/16 of the entire list. This would mean that all of the processors would get bits of the easy numbers and bits of the hard numbers. Since there is always a cost for starting a new chunk of data I'm guessing that with 20,000,000 primes to test a chunk size of 1000 would seem suitable, so I go back to my prime code and put in SCHEDULE(STATIC, 1000). If I now run on 16 processors I get a time of 0.80 seconds, slightly faster than my GUIDED case indicating that it was the fact that some processors have an easier time of it than others is the problem.

Take away message for this is that when running parallel code it is vitally important that you understand the question of how your problem is being split up and whether that means that all of your processors will have the same amount of work. You have a decent array of tools currently to control how work is distributed and future releases of OpenMP should also have the option of specifying your own scheduling systems!