Scheduling and processor affinity, 19/09/18, Research Software Engineering at Warwick

September 19, 2018

Scheduling and processor affinity

We're taking a brief break from data structures to talk about something a bit different : scheduling and processor affinity. This is dealing with a problem that a user or developer doesn't usually think about, but is crucial to the performance of modern computers. Describing the problem is quite simple : if you have a computer that is doing multiple tasks what order should they happen in and what processor (if you have more than one) should take on each task? Working this out is the preserve of the very core part of the operating system (OS), generalled called the kernel (after the part of the nut). While the term is most familiar from the Linux/Unix world (technically Linux is the name for just the kernel and not the rest of the OS), macOS has the Mach microkernel at it's core and ntoskrnl.exe (the Windows New Technology Kernel! It was new in the mid 90s at least) is at the core of modern versions of Windows.

The simplest solution is for the kernel to just create a queue of programs that have to be done and then run them one after the other, with each new program being handed to the next free processor and programs running until they are finished. This is typically called First In, First Out scheduling (FIFO, which is a very common acronym in computing. Do not confuse this with GIGO (garbage in, garbage out) which is also a very common acronym in computing). FIFO scheduling works very well for computers that are working on lists of tasks having equal priority. So batch processing of data for example works very well with FIFO scheduling. But you can immediately see that it won't work well for all computers that humans actually work with interactively. If all of your processors are working on long, slow processes then your computer would stop working completely and you wouldn't even be able to stop the processes because the computer wouldn't be processing your keyboard or mouse input.

Normal computers tend to work using a system of preemptive scheduling. This works by taking a program and letting it run for a while on a processor and then saving its state, stopping it and then giving the processor another program to run on that processor for a while until stopping it etc. etc. Eventually you get back to a process that has been running before and you restore it's state and start it running again. Because the state of the program is stored completely it is entirely unaware that this has happened to it, so it just runs on regardless. (as a historical note, this is called preemptive scheduling because the OS kernel preempts the programs running on it. It basically says "you're done, get out of the way" and lets other programs run. Older systems used cooperative scheduling where a program would have code in it where it said to the kernel "I'm ready to be switched away from". This worked fine until a badly behaved program didn't reach this point whereupon every other program on the computer stopped working.)

There are disadvantages to preemptive scheduling. The saving and restoring of the state of programs is generally called "context switching" and it does take time for this to happen. (FIFO schedulers don't have this problem because programs run until they complete, but even now you can't really use FIFO schedulers because there tend to be dozens of programs running on a computer at once and some of the effectively nevercomplete until the computer shuts down). If you only have one processor then context switching is the inevitable price of allowing multiple programs to run at once, but if you have two processors and two programs then they can happily run alongside each other so long as they are each running on their own processor.

But what if you have four programs and two processors? As you'd guess, in general you'll put two programs on each processor. But what happens if the processors aren't running through their jobs at quite the same rate? This can happen because real schedulers are rather more complex than I suggested here. In that case, you can come to a situation where the program that was running on processor 1 reaches the point where it should run again but processor 1 is still busy. Should it then be run on processor 2? The answer, is "maybe". Sometimes programs benefit a lot from data being held in the fast CPU cache memory and if it now runs on a different processor then that benefit is lost. Under that circumstance it would be better to keep that program running on the same processor even if it has to wait a bit longer until it runs again.

All common operating systems let you do this and it is called "processor" affinity. You tell the kernel which processor (or processors) you want a given program to run on and it guarantees that it will respect that rather than giving that program to the next available processor.

I've glossed over a lot of things about scheduling here (in particular Intel's Hyperthreading technology is a nightmare for scheduling because it creates "virtual" processors to try and take advantage of the fact that most programs don't use all of the parts of a processor at once. The problem is that these virtual processors don't behave the same way as real processors so the scheduler has to be much more careful about what programs it should run on them.), but this is a decent overview of how scheduling and process affinity works on modern computers and generally it works quite well. General purpose programs don't set CPU affinity and just run as quickly as possible when a processor is free, and programs that do heavy numerical computation set CPU affinity to try and maximise cache performance. But there are interesting cases where it can fail.

Recently we had an interesting discovery using the OpenMPI parallel programming library. This is a distributed programming library intended to write programs to run on large "cluster" computers that consist of lots of single compute nodes connected by a network. But it does also work to run programs on a single computer. We were testing a computer with 16 cores and found that running a single 16 processor job was about 16 times faster than the same problem on a single core. When we tried the same problem with 4 simultaneous 4 processor jobs things were very different. The 4 processor jobs were all slower than they were when running on a single core. After a couple of hours we found out why: OpenMPI sets processor affinity for all of the programs that it runs, but it always sets them for the lowest numbered processors. So running a 16 core job ran them on cores 0 to 15, but running 4 sets of 4 core jobs ran them allon processors 0 to 3, so they all basically got a quarter of 4 cores. Add in some overhead from context switching and you get that it's slower than running on a single core. One quick parameter to the library to tell it not to use processor affinity and we were back to where we expected to be, but the lesson to take away is that you can get far to used to things like schedulers just working. If you're finding that things are running slower than you expected, check whether or not there's something odd going on with how your processors are being used.

Christopher Brady : 19 Sep 2018 17:16 | Tags: Programming | Comments (0) | Report a problem