All 3 entries tagged Code

View all 35 entries tagged Code on Warwick Blogs | View entries tagged Code at Technorati | There are no images tagged Code on this blog

March 09, 2018

Odd MPI IO bug in Open–MPI

Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)

First you create a type representing the total array using MPI_Type_create_subarray

  sizes = (/nx_global, ny_global/)
  subsizes = (/nx_local, ny_local/)
  starts = (/x_cell_min, y_cell_min/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)

This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.

The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again

  sizes = (/nx_local+2, ny_local+2/)
  subsizes = (/nx_local, ny_local/)
  starts = (/1, 1/)
  CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, &
      MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)

When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.

The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.

Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.


There are a couple of take away messages from this

  1. If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
  2. Sometimes there really are bugs in well tested and widely used libraries
  3. It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.

February 21, 2018

SOUP: Function Pointers

Today's snippet demos function pointers (objects in Python), in particular an array of function pointers. We use them to print a table of arithmetic operations on all of the IEEE special values.

IEEE 754 defines the behaviour of floating point numbers, in particular what happens when numbers get unrepresentable, whether that is too large, too small or plain not-numbers.

Infinity is a familiar enough concept and in floating-point it mostly means a number which is too large to be represented. There's a positive and a negative infinity and most arithmetic works as expected.

Negative zero (-0.0) seems quite odd at a first glance. The sign-bit is set, but in every other way, and in general, it is equal to positive zero. Comparisons like `0.0 == -0.0` are defined to be true and `-0.0 < 0.0` is false. Most languages have a function to copy the sign from one number to another though, and -0.0 works as expected.

NaN, Not a Number, mostly appears as a lament, "Why is it NaN!?" or worse "Why is it NaN again!?" Any operations involving NaN also give NaN as do several others.

It seems strange that `Inf * 0.0` or `Inf + -Inf` both give NaN where they could both reasonably give zero. Philosophically, both are numbers, but completely unknown ones. Inf isn't mathematical infinity, it is just too large to represent, and 0.0 stands in for any number too small to represent. Their product then can be absolutely anything, hence defining it as NaN.

Code snippets in C, Fortran and Python are in our SOUP repo under 001_IEEE. All three use named functions to "dry out" the code: in fact they use an array of them. Note in Fortran this needs us to use a custom type to hold the function pointer, as we can't have an array of simply pointers.

The core of all snippets is the loop over operations and inputs

In C:

  float (*op)(float, float);/*Holds operation function*/
  float(*allOps[4])(float, float);

  allOps[0] = &add; allOps[1] = &sub; allOps[2] = &mul; allOps[3] = &div;

  for(opNum=0; opNum< 4; opNum++){
    op = allOps[opNum];
    
    for(rowNum = 0; rowNum < 7; rowNum++){
      row = allRows[rowNum];
      /*print result of op(row, column)*/
    }
  }

In Fortran (where f_ptr is our type holding a pointer):

  TYPE(f_ptr), DIMENSION(4) :: allOps
  TYPE(f_ptr) :: currOp

  allOps(1)%ptr => add
  allOps(2)%ptr => sub
  allOps(3)%ptr => mult
  allOps(4)%ptr => div
  DO opNum = 1, 4
    currOp%ptr => allOps(opNum)%ptr

    DO rowNum = 1, 7
      row = allRows(rowNum)
      !Print results of currOp%ptr applied to row, column
    ENDDO
  ENDDO

And in Python (using range-based for loops on lists):

  allOps = [add, sub, mul, div]

  for opNum in range(0, 4):
    op = allOps[opNum]

    for rowNum in range(0, 7):
      row = allRows[rowNum]
      #Print result of op(row, column)

Note all three are subtly different in how one assigns the active operation. In C you just get a pointer to a function with the usual address-of operator, &. In Fortran you point your pointer to the proper target using the points-to operator, =>. In Python we just set the operator to equal the function, omitting the brackets () turning this into an assignment, not a call.

Function pointers also have a powerful use in High-Performance code. A long if-else chain inside a loop, which calls a different function in each branch, can be replaced with the same if-chain outside the loop setting a function pointer and then a simple call inside the loop, eliminating the branch. As pseudo-code:

  DO i = 0, 10000
    IF (condition) function_1()
    ELSE IF (condition) function_2()
    ELSE IF (condition) function_3()
    ...
  ENDDO

becomes

  IF (condition) ptr = function_1
  ELSE IF (condition) ptr = function_2
  ELSE IF (condition) ptr = function_3
  ...

  DO i = 0, 10000
    ptr()
  ENDDO

December 20, 2017

SOUP (Snippet of the Undefined Period)

Every-so-often we come across an interesting, imaginative, instructive or idiomatic code snippet. These can be amusing bugs, a task completed in several languages, or exemplars of how to do something well. Because we post them as and when we find them, they appear at unspecifed times, so we named them Snippet of the <Undefined Period> (SOUP). This gives us a nice acronym and is about as humorous as software engineering gets.

Often the interesting part of the code is buried in boilerplate etc, so we only post the interesting snippet directly. The full code is posted to the WarwickRSE github repository at WarwickRSE/SOUP. As a rule we try and produce a C, Fortran and a Python version of everything, with other languages on request. Comments and suggestions are welcome!


October 2024

Mo Tu We Th Fr Sa Su
Sep |  Today  |
   1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31         

Search this blog

Tags

Galleries

Blog archive

Loading…
RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder
© MMXXIV