All entries for March 2018
March 21, 2018
Branch By Boolean
Once upon a time, all programming was for performance. With computer power at a premium, every instruction was a cost. Oddly, the cost of branching (acting differently based on a condition) has got both more and less expensive. Branches are evaluated more quickly now, but improvements in branch prediction (see below) mean branches can be more costly relative to the branchless option. Modern processors are in some ways more similar to the original supercomputers than ever, in particular, their use of vector operations. In this post we talk a bit about what branch prediction means and show a couple of tricks which can be quite useful in code in important loops.
What is Branch Prediction?
Branch prediction refers to the ability of the processor to predict (aha!) which branch is likely, and to get ready to take it. This can happen before the condition is actually evaluated. Once the condition has been checked, if the guess was correct, the operation stream continues, but if it was wrong the incorrect instructions in the queue must be dumped and the correct instructions for the actual branch assembled. This can sometimes be quite costly. Some systems instead execute both possible branches, and throw away the unneeded result, which can halve their possible throughput.
Generally branch predictors are dumb. The commonest option is to assume whatever track was taken previously will be taken again. This is great when you have one rare case (for example, an error condition) and a common one. Sometimes the predictor can 'understand' simple alternation (option 1, then option 2, then option 1...) or similar patterns if they are 'obvious enough'.
How much does it cost, really?
The costs of branch misprediction vary widely, depending on hardware, vectorisation, surrounding code etc. Most profiling tools will give you a miss rate, and reducing this in regularly repeated pieces of code can be worthwhile. A simple demo of branch misses is given here. I have tried to make all paths as similar as possible. The final print statement ensures that the total is used so the loop cannot be simply optimised away.
On my machine, the example compiled without optimisation takes about 0.8 seconds with compile-time option -DBR_3, i.e. the maximum possible number of addition operations, and the case which would take the longest if there was nothing more than simple looping and addition in play. Options 1 and 2 both take about 1 second, and the option with half true, half false, just over 0.8 seconds. This suggests something like the 'same as last time' prediction occuring.
Avoiding Branches
When it does matter, there are several ways to avoid true branches and greatly improve performance in a piece of code. Always profile first, though, to be sure this piece warrants improving.
A Simple Rejig
The first thing to consider is whether a simple rewrite of the code can eliminate a branch, or improve its prediction. Many branch predictors are quite dumb and assume whatever branch was taken the previous iteration will be taken again. If you can rewrite to ensure longer 'runs' of one branch, this may help. The previous section showed one example where this can be quite dramatic.
Do More to Do Less
Sometimes it may even be more effective to do more work, but without a condition, rather than check if the work needs to be done. A trivial example is something like
DO i = 0, 10000 IF (B[i] /= 0) A = A + B[i] ENDDO
where the check is redundant because if B[i] is zero, the addition does nothing.
Function Pointers
Another useful trick is to use function pointers (or named functions), as discussed in this previous Soup. Rather than putting an 'if' into a vital loop, first work out which function needs to be called and then call it. As given in the previous post, we can replace
DO i = 0, 10000 IF (condition) function_1() ELSE IF (condition) function_2() ELSE IF (condition) function_3() ... ENDDO
with
IF (condition) ptr = function_1 ELSE IF (condition) ptr = function_2 ELSE IF (condition) ptr = function_3 ... DO i = 0, 10000 ptr() ENDDO
As usual, this may or may not help: in some languages function pointers are slow themselves (not C or Fortran though, but e.g. std::function in C++ can be), and sometimes the called function is already complex enough to break any vectorisation or similar optimisations.
Boolean Branch
One particularly sneaky one is the Boolean Branch, most useful in languages which equate 'True' with 1 and False with '0', but still viable with integers otherwise. The trick is to note that
IF condition THEN result = value1 ELSE result = value2 ENDIF
is the same as
flag = condition !Flag may be an integer; in C it can be Boolean result = value1*flag + value2*(1 - flag)
The conditional assignment in the first case is replaced with two simple assignments. This may or may not be faster! Profiling is essential to find out. It is more likely to pay off in a case where the two options alternate, especially when this is irregular, so the branch prediction fails regularly.
March 09, 2018
Odd MPI IO bug in Open–MPI
Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)
First you create a type representing the total array using MPI_Type_create_subarray
sizes = (/nx_global, ny_global/) subsizes = (/nx_local, ny_local/) starts = (/x_cell_min, y_cell_min/) CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, & MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)
This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.
The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again
sizes = (/nx_local+2, ny_local+2/) subsizes = (/nx_local, ny_local/) starts = (/1, 1/) CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, & MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)
When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.
The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.
Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.
There are a couple of take away messages from this
- If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
- Sometimes there really are bugs in well tested and widely used libraries
- It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.