All 3 entries tagged Mpi
No other Warwick Blogs use the tag Mpi on entries | View entries tagged Mpi at Technorati | There are no images tagged Mpi on this blog
November 27, 2024
How does MPI parallel code actually run?
One of those things I wondered about for a long time, before just getting on and finding out is this:
when I run an MPI parallel code, what actually happens?
And, before forging ahead to answer my question I should clarify why I want to know - well basically so I can understand what happens in error cases and edge cases, such as starting a program without mpiexec, or mpiexec-ing a non MPI program.
I know that MPI lets me run multiple, interacting, instances of my program, and lets those communicate with each other. But I also know that I can start my program without using mpiexec (or mpirun, srun or any other invocation) and it might work as a serial code - but it doesn't always. I know that MPI_Init is really important, but I don't know what a program can and can't do before that line, or how a completely empty program would behave. I don't understand what an MPI program without any comms in would actually do. I am not certain whether any bits of my program are somehow shared - data or state or communicators.
As usual, I could answer all of these questions individually, but there's a good chance I can answer them all if I can just work out what's missing in my mental model. It turns out this is what I hadn't realised:
mpiexec starts N independent copies of my program. When my code reaches the MPI_Init function, communication is established (using some info provided by the launcher) - the copies are made aware of each other and assigned their ranks. MPI_Finalize is where the comms is shut down. *
Obvious in retrospect, but it answers all of my questions.
- Starting a single instance of my code (a serial version) will work as long as my algorithm _can_ work on a single processor, without deadlocks etc. But starting N independent copies won't make for a parallel run, because the information (or daemons etc) MPI_Init needs won't be present - it wont know about the other copies, or even how many copies there are.
- Before the MPI_Init line, my code can do anything that doesn't use communication - no MPI calls, no use of communicators etc. That means I can't know how many processors I am running on (we don't know ranks), or if I am the root (proc 0) or anything like that.
- A completely empty program, or one where I never call MPI_Init, will run N independent copies, but they will never know about each other. Just like the parts of my program before the Init. This also tells me what happens if I mpiexec a completely non-MPI program.
- A program that calls MPI_Init but has no actual comms can still be a parallel program - if I can split up my work with nothing other than MPI_Comm_size or MPI_Comm_rank, for instance dividing my work into N blocks, I can do that work in parallel (as long as I am careful about outputting the final product of my work blocks).
- The one thing I can't definitively answer using this is whether I can mpiexec a program that wasn't compiled as an MPI program. But I can guess, based off the fact that a program without MPI_Init can be valid, that I would probably get N independent programs, and I'd be right, as it happens.
- Finally, I can easily see that no program state can possibly be shared, because my program copies are independent, with their own memory spaces. Things like communicators must contain information sufficient for the message passing "layer" to pass information between copies of my program.
Note that I put a '*' on my statement of what actually happens - this is "correct from the perspective of my program", but a little incomplete in general. The mpiexec launcher can, and generally does, do some elements of setting up comms, but this is lower level than my program and doesn't affect how it behaves or what it can do. I also omitted anything about the compiler step - since I know MPI uses compiler wrapper scripts, something could happen at this stage, which is why I can't completely answer that penultimate question without using some more information.
February 07, 2019
New Training
New RSE training opportunity - intermediate level MPI. Following on from a basic MPI course, such as our Intro (Dec. 2018) or Warwick's PX425, we have a 1-day workshop on some trickier topics, such as MPI types. See here for details.
March 09, 2018
Odd MPI IO bug in Open–MPI
Quite often working with research code leads you to find the unusual edge cases where even well tested libraries break down a bit. This example that we ran across with the Open-MPI parallel library is pretty typical. We wanted to use MPI-IO to write an array that was distributed across multiple processors, with each processor holding a fraction of the array. Because of how we were using the arrays, the array on each processor had a strip of "guard" cells along each edge that were used to exchange information with the neighbouring processor and these had to be clipped off before the array was written. MPI makes this very easy to achieve using MPI types. (This example is given in Fortran, because that was the language that we encountered the problem in. C would have the same problems)
First you create a type representing the total array using MPI_Type_create_subarray
sizes = (/nx_global, ny_global/) subsizes = (/nx_local, ny_local/) starts = (/x_cell_min, y_cell_min/) CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, & MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, global_type, ierr)
This specifies an array that's globally 1:nx_global x 1:ny_global, and locally 1:nx_local x 1:ny_local. The starts array specifies where the local part of the global array starts in the global array, and depends on how you split your global array over processors. Then you pass this type as a fileview to MPI_File_set_viewto tell MPI that this is how data is arranged across your processors.
The actual data is in an array one bigger on each end (0:nx_local+1x 0:ny_local+1), so we need another type representing how to cut off those additional cells. That's MPI_Type_create_subarray again
sizes = (/nx_local+2, ny_local+2/) subsizes = (/nx_local, ny_local/) starts = (/1, 1/) CALL MPI_TYPE_CREATE_SUBARRAY(ndims, sizes, subsizes, starts, & MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, core_type, ierr)
When you pass this as the datatype to a call to MPI_File_write or MPI_File_write_allyou pass MPI only the 1:nx_localx 1:ny_local array that you promised it when you called MPI_File_set_view. The final result will be an array 1:nx_globalx 1:ny_globalno matter how many processors you run the code on.
The problem was that it wasn't working. When we ran the code we found that everything worked as expected on files < 512MB/processor in size, but when we got beyond that the files were always systematically smaller than expected. They weren't a fixed size, but they were always smaller than expected. As we always advise other people to do we started from the assumption that we had made a mistake somewhere, so we went over our IO code and our MPI types. They all appeared normal, so we started taking parts out of our code. After removing a few bits, we found that the critical element was using the MPI derived type to clip out the guard cells from the local arrays. If we just wrote an entire array using primitive MPI types the problem didn't occur. This was about the point where it started to look like it might, just possibly, be an MPI error.
Next, we created the simplest possible test case in Fortran that replicated the problem and it turned out to be very simple indeed. Just create the types for the filetype to MPI_File_set_view and the datatype to MPI_File_write and write an array larger than 512MB/processor. It even failed if you just coded it up for a single processor! It was unlikely at this stage that we'd just made a mistake in our trivial example code. As a final check, we then replicated it in C and found the same problem happened there. Finally, with working examples and some evidence that the problem wasn't in our code, we contacted the Open-MPI mailing list. Within a few hours, it was confirmed that we'd managed to find an edge case in the Open-MPI library and they'd created patches for the new versions of Open-MPI.
There are a couple of take away messages from this
- If you are using MPI-IO and are finding problems when files get larger than 512MB/processor you might need to update your MPI installation
- Sometimes there really are bugs in well tested and widely used libraries
- It's a good idea to be as sure as possible that you aren't making a mistake before assuming that you've found one.