|
Table of contents:
- What is the sm BTL?
- How do I specify use of sm for MPI messages?
- How does the sm BTL work?
- Why does my MPI job no longer start when there are too many processes on
one node?
- How do I know what MCA parameters are available for tuning MPI performance?
- How can I tune these parameters to improve performance?
- Where is the file that sm will mmap in?
- Can I use SysV instead of mmap?
- How much shared memory will my job use?
- How much shared memory do I need?
- How can I decrease my shared-memory usage?
The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
The sm BTL has high exclusivity. That is, if one process can reach another
process via sm, then no other BTL will be considered for that connection.
Note that with OMPI 1.3.2, the sm so-called "FIFOs" were reimplemented and
the sizing of the shared-memory area was changed. So, much of this FAQ will
distinguish between releases up to OMPI 1.3.1 and releases starting with OMPI 1.3.2.
| 2. How do I specify use of sm for MPI messages? |
Typically, it is unnecessary to do so; OMPI will use the best BTL available
for each communication.
Nevertheless, you may use the MCA parameter btl. You should also specify the
self BTL for communications between a process and itself. Further, if not all
processes in your job will run on the same, single node, then you also need
to specify a BTL for internode communications. For example:
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
|
| 3. How does the sm BTL work? |
A point-to-point user message is broken up by the PML into fragments.
The sm BTL only has to transfer individual fragments. The steps are:
- The sender pulls a shared-memory fragment out of one of its free lists.
Each process has one free list for smaller (e.g., 4Kbyte) eager
fragments and another free list for larger (e.g., 32Kbyte) max fragments.
- The sender packs the user-message fragment into this shared-memory
fragment, including any header information.
- The sender posts a pointer to this shared fragment into the
appropriate FIFO (first-in-first-out) queue of the receiver.
- The receiver polls its FIFO(s). When it finds a new fragment
pointer, it unpacks data out of the shared-memory fragment and notifies
the sender that the shared fragment is ready for reuse (to be
returned to the sender's free list).
On each node where an MPI job has two or more processes running, the job creates
a file that each process mmaps into its address space. Shared-memory
resources that the job needs -- such as FIFOs and fragment free lists -- are
allocated from this shared-memory area.
| 4. Why does my MPI job no longer start when there are too many processes on
one node? |
If you are using OMPI 1.3.1 or earlier, it is possible that the shared-memory
area set aside for your job was not created large enough. Make sure you're running
in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size
to be very large -- even several Gbytes. Exactly how large is discussed further
below.
Regardless of which OMPI release you're using, make sure that there is sufficient
space for a large file to back the shared memory -- typically in /tmp.
| 5. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info command can display all the parameters available for the
sm BTL and sm mpool:
shell$ ompi_info --param btl sm
shell$ ompi_info --param mpool sm
|
| 6. How can I tune these parameters to improve performance? |
Mostly, the default values of the MCA parameters have already been
chosen to give good performance. To improve performance further is a little
bit of an art. Sometimes, it's a matter of trading off performance for memory.
btl_sm_eager_limit:
Below this size, messages are sent "eagerly" -- that is, a sender attempts
to write its entire message to shared buffers without waiting for a receiver
to be ready. Above this size, a sender will only write the first part of a
message, then wait for the receiver to acknowledge its ready before continuing.
Eager sends can improve performance by decoupling senders from receivers.
btl_sm_max_send_size:
Large messages are sent in fragments of this size. Larger segments can
lead to greater efficiencies, though they could perhaps also inhibit
pipelining between sender and receiver.
btl_sm_num_fifos:
Starting in OMPI 1.3.2, this is the number of FIFOs per receiving process.
By default, there is only one FIFO per process. Conceivably, if many senders
are all sending to the same process and contending for a single FIFO, there
will be congestion. If there are many FIFOs, however, the receiver must
poll more FIFOs to find in-coming messages. Therefore, you might try
increasing this parameter slightly if you have many (at least dozens) of
processes all sending to the same process. For example, if 100 senders are
all contending for a single FIFO for a particular receiver, it may suffice
to increase btl_sm_num_fifos from 1 to 2.
btl_sm_fifo_size:
Starting in OMPI 1.3.2, FIFOs could no longer grow. If you believe the
FIFO is getting congested because a process falls far behind in reading
in in-coming message fragments, increase this size manually.
btl_sm_free_list_num:
This is the initial number of fragments on each (eager and max) free list.
The free lists can grow in response to resource congestion, but you can
increase this parameter to pre-reserve space for more fragments.
mpool_sm_min_size:
You can reserve headroom for the shared-memory area to grow by increasing
this parameter.
| 7. Where is the file that sm will mmap in? |
The file will be in the OMPI session directory, which is typically
something like /tmp/openmpi-sessions-myusername@mynodename* .
The file itself will have the name
shared_mem_pool.mynodename. For example, the full path could be
/tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.
| 8. Can I use SysV instead of mmap? |
Currently (through OMPI 1.3.2), shared memory is established
via mmap. In future releases, there may be an option for using SysV
shared memory.
| 9. How much shared memory will my job use? |
Your job will create a shared-memory area on each node where it has
two or more processes. This area will be fixed during the lifetime of your
job. Shared-memory allocations (for FIFOs and fragment free lists) will be
made in this area. Here, we look at the size of that shared-memory area.
If you want just one, hard number, then go with approximately 128 Mbytes per
node per job, shared by all the job's processes on that node. That is, an OMPI
job will need more than a few Mbytes per node, but typically less than a few Gbytes.
Better yet, read on.
Up through OMPI 1.3.1, the shared-memory file would basically be sized:
nbytes = n * mpool_sm_per_peer_size
if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size
|
where n is the number of processes in the job running on that particular node
and the mpool_sm_* are MCA parameters.
For small n, this size is typically excessive. For large n (e.g.,
128 MPI processes on the same node), this size may not be sufficient for the job
to start.
Starting in OMPI 1.3.2, a more sophisticated formula was introduced to model more
closely how much memory was actually needed. That formula is somewhat complicated
and subject to change. It guarantees that there will be at least enough shared
memory for the program to start up and run. See this
FAQ item to see how much is needed. Alternatively, the motivated user can
examine the OMPI source code to see the formula used -- for example, here is the formula in OMPI revision SVN r20906.
OMPI 1.3.2 also uses the MCA parameter mpool_sm_min_size to set a minimum size
-- e.g., so that there is not only enough shared memory for the job to start, but
additionally headroom for further shared-memory allocations (e.g., of more eager
or max fragments).
Once the shared-memory area is established, it will not grow further during the
course of the MPI job's run.
| 10. How much shared memory do I need? |
In most cases, OMPI will start your job with sufficient shared memory.
Nevertheless, if OMPI doesn't get you enough shared memory (e.g., you're using OMPI 1.3.1
or earlier with roughly 128 processes or more on a single node) or you want to
trim shared-memory consumption, you may want to know how much shared memory is really needed.
As we saw earlier, the shared memory area contains:
- FIFOs
- eager fragments
- max fragments
In general, you need only enough shared memory for the FIFOs and fragments
that are allocated during MPI_Init().
Beyond that, you might want additional shared memory for performance reasons,
so that FIFOs and fragment lists can grow if your program's message traffic encounters
resource congestion. Even if there is no room to grow, however, your correctly
written MPI program should still run to complete in the face of congestion;
performance simply degrades somewhat. Note that while shared-memory resources
can grow after MPI_Init(), they cannot shrink.
So, how much shared memory is needed during MPI_Init() ?
You need approximately the total of:
- FIFOs:
- (≤ OMPI 1.3.1):
3 × n × n × pagesize
- (≥ OMPI 1.3.2):
n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
- eager fragments:
n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
- max fragments:
n × btl_sm_free_list_num × btl_sm_max_send_size
where
-
n is the number of MPI processes in your job on the node
-
pagesize is the OS page size (4K for Linux and 8K for Solaris)
-
btl_sm_* are MCA parameters
| 11. How can I decrease my shared-memory usage? |
There are two parts to this question.
First, how does one reduce how big the mmap file is? The answer is:
- up to OMPI 1.3.1: reduce
mpool_sm_per_peer_size, mpool_sm_min_size,
and mpool_sm_max_size
- starting with OMPI 1.3.2: reduce
mpool_sm_min_size
Second, how does one reduce how much shared memory is needed? (Just making
the mmap file smaller doesn't help if then your job won't start up.) The
answers are:
- For small values of
n -- that is, for few processes per node --
shared-memory usage during MPI_Init() is predominantly for max free lists.
So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively,
you could reduce btl_sm_free_list_num, but it is already pretty small by
default.
- For large values of
n -- that is, for many processes per node -- there
are two cases:
- up to OMPI 1.3.1: shared-memory usage is dominated by the
FIFOs, which consume a certain number of pages. Usage is
high and cannot be reduced much via MCA parameter tuning.
- starting with OMPI 1.3.2: shared-memory usage is dominated
by the eager free lists. So, you can reduce the MCA parameter
btl_sm_eager_limit.
|