|
` Table of contents:
- What is MPI? What is Open MPI?
- Where can I learn about MPI? Are there tutorials available?
- What are the goals of the Open MPI Project?
- Will you allow external involvement?
- How is this software licensed?
- I want to redistribute Open MPI. Can I?
- Preventing forking is a goal; how will you enforce that?
- How are 3rd party contributions handled?
- Is this just YAMPI (yet another MPI implementation)?
- But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]!
Why should I use Open MPI?
- What will happen to the prior projects?
- What operating systems does Open MPI support?
- What hardware platforms does Open MPI support?
- What network interconnects does Open MPI support?
- What run-time environments does Open MPI support?
- Does Open MPI support LSF?
- How much MPI does Open MPI support?
- Is Open MPI thread safe?
- Does Open MPI support 64 bit environments?
- Does Open MPI support execution in heterogeneous environments?
- Does Open MPI support parallel debuggers?
- Can I contribute to Open MPI?
- I found a bug! How do I report it?
- What license is Open MPI distributed under?
- How do I contribute code to Open MPI?
- I can't submit an Open MPI Third Party Contribution Agreement;
how can I contribute to Open MPI?
- What if I don't want my contribution to be free / open source?
- I want to fork the Open MPI code base. Can I?
- Rats! My contribution was not accepted into the main Open MPI
code base. What now?
- Open MPI terminology
- How do I get a copy of the most recent source code?
- Ok, I got a Subversion checkout. Now how do I build it?
- What is the main tree layout of the Open MPI source tree? Are
there directory name conventions?
- Is there more information available?
- More coming...
- I'm a sysadmin; what do I care about Open MPI?
- What hardware / software / run-time environments / networks
does Open MPI support?
- Do I need multiple Open MPI installations?
- What are MCA Parameters? Why would I set them?
- Do my users need to have their own installation of Open MPI?
- I have power users who will want to override my global MCA
parameters; is this possible?
- What MCA parameters should I, the system administrator, set?
- I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps?
- I just upgraded my Myrinet|Infiniband network; do I need to
recompile all my MPI apps?
- We just upgraded our version of Open MPI; do I need to
recompile all my MPI apps?
- I have an MPI application compiled for another MPI; will it
work with Open MPI?
- What is "fault tolerance"?
- What fault tolerance techniques does Open MPI plan on supporting?
- Does Open MPI support checkpoint and restart of parallel jobs (similar
to LAM/MPI)?
- Where can I find the fault tolerance development work?
- Does Open MPI support end-to-end data reliability in MPI
message passing?
- How do I build Open MPI?
- Wow -- I see a lot of errors during
configure.
Is that normal?
- What are the default build options for Open MPI?
- Open MPI was pre-installed on my machine; should I overwrite it with a new version?
- Where should I install Open MPI?
- Should I install a new version of Open MPI over an old version?
- Can I disable Open MPI's use of plugins?
- How do I build an optimized version of Open MPI?
- Are VPATH and/or parallel builds supported?
- Do I need any special tools to build Open MPI?
- How do I build Open MPI as a static library?
- When I run 'make', it looks very much like the build system is going into a loop.
- Configure issues warnings about sed and unterminated
commands
- Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errrs when building
- Open MPI seems to default to building with the GNU compiler set. Can I use other compilers?
- Can I pass specific flags to the compilers / linker used to build Open MPI?
- I'm trying to build with the Intel compilers, but Open MPI
eventually fails to compile with really long error messages. What do
I do?
- When I build with the Intel compiler suite, linking user MPI
applications with the wrapper compilers results in warning messages.
What do I do?
- I'm trying to build with the IBM compilers, but Open MPI
eventually fails to compile. What do I do?
- I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI
eventually fails to compile. What do I do?
- What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers?
- When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround?
- How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers?
- I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do?
- All MPI C++ API functions return errors (or otherwise fail)
when Open MPI is compiled with the PathScale compilers. What do I do?
- How do I build Open MPI with support for Open IB (Infiniband),
mVAPI (Infiniband), GM (Myrinet), and/or MX (Myrinet)?
- How do I build Open MPI with support for SLURM / XGrid?
- How do I build Open MPI with support for SGE?
- How do I build Open MPI with support for PBS Pro / Open PBS / Torque?
- How do I build Open MPI with support for LoadLeveler?
- How do I build Open MPI with support for Platform LSF?
- How do I build Open MPI with processor affinity support?
- How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)?
- How do I build Open MPI with support for sending CUDA device memory?
- How do I not build a specific plugin / component for Open MPI?
- What other options to [configure] exist?
- Why does compiling the Fortran 90 bindings take soooo long?
- Does Open MPI support MPI_REAL16 and MPI_COMPLEX32?
- Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source?
- I'm still having problems / my problem is not listed here. What do I do?
- In general, how do I build MPI applications with Open MPI?
- Wait -- what is
mpifort? Shouldn't I use
mpif77 and mpif90?
- I can't / don't want to use Open MPI's wrapper compilers.
What do I do?
- How do I override the flags specified by Open MPI's wrapper
compilers? (v1.0 series)
- How do I override the flags specified by Open MPI's wrapper
compilers? (v1.1 series and beyond)
- How can I tell what the wrapper compiler default flags are?
- Why does "mpicc --showme <some flags>" not show any
MPI-relevant flags?
- Are there ways to just add flags to the wrapper compilers?
- Why don't the wrapper compilers add "-rpath" (or similar)
flags by default?
- Can I build 100% static MPI applications?
- Can I build 100% static OpenFabrics / OpenIB / OFED MPI
applications on Linux?
- Why does it take soooo long to compile F90 MPI applications?
- How do I build BLACS with Open MPI?
- How do I build ScaLAPACK with Open MPI?
- How do I build PETSc with Open MPI?
- How do I build VASP with Open MPI?
- Are other language / application bindings available for Open MPI?
- What pre-requisites are necessary for running an Open MPI job?
- What ABI guarantees does Open MPI provide?
- Do I need a common filesystem on all my nodes?
- How do I add Open MPI to my
PATH and LD_LIBRARY_PATH?
- What if I can't modify my
PATH and/or LD_LIBRARY_PATH?
- How do I launch Open MPI parallel jobs?
- How do I run a simple SPMD MPI job?
- How do I run an MPMD MPI job?
- I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?
- When I build Open MPI with the Intel compilers, I get warnings
about "orted" or my MPI application not finding libimf.so. What do I do?
- When I build Open MPI with the PGI compilers, I get warnings
about "orted" or my MPI application not finding libpgc.so. What do I do?
- When I build Open MPI with the Pathscale compilers, I get warnings
about "orted" or my MPI application not finding libmv.so. What do I do?
- Can I run non-MPI programs with
mpirun / mpiexec?
- Can I run GUI applications with Open MPI?
- Can I run ncurses-based / curses-based / applications with
funky input schemes with Open MPI?
- What other options are available to
mpirun?
- How do I use the
--host option to mpirun?
- How do I control how my processes are scheduled across nodes?
- I'm not using a hostfile. How are slots calculated?
- Can I run multiple parallel processes on a uniprocessor machine?
- Can I oversubscribe nodes (run more processes than processors)?
- Can I force Agressive or Degraded performance modes?
- How do I run with the TotalView parallel debugger?
- How do I run with the DDT parallel debugger?
- What launchers are available?
- How do I specify to the
rsh launcher to use rsh or ssh?
- How do I run with the SLURM and PBS/Torque launchers?
- How do I run with the SGE launcher?
- Can I suspend and resume my job?
- Does the SGE tight integration support the -notify flag to qsub?
- How do I run with LoadLeveler?
- How do I load libmpi at runtime?
- What MPI environmental variables exist?
- How do I get my MPI job to wireup its MPI connections right away?
- What kind of CUDA support exists in Open MPI?
- Open MPI tells me that it fails to load components with a "file not found" error -- but the file is there! Why does it say this?
- I see strange messages about missing symbols in my application; what do these mean?
- What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?
- Can I build shared libraries on AIX with the IBM XL compilers?
- Why am I getting a seg fault in libopal?
- Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?
- All my MPI applications segv! Why? (Intel Linux 12.1 compiler)
- Why can't I attach my parallel debugger (TotalView, DDT, fx2,
etc.) to parallel jobs?
- When launching large MPI jobs, I see messages like:
mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying
- How do I find out what MCA parameters are being seen/used by my job?
- How do I debug Open MPI processes in parallel?
- What tools are available for debugging in parallel?
- How do I run with parallel debuggers?
- What controls does Open MPI have that aid in debugging?
- Do I need to build Open MPI with compiler/linker debugging
flags (such as
-g) to be able to debug MPI applications?
- Can I use serial debuggers (such as
gdb) to debug MPI
applications?
- My process dies without any output. Why?
- What is Memchecker?
- What kind of errors can Memchecker find?
- How can I use Memchecker?
- How to run my MPI application with Memchecker?
- Does Memchecker cause performance degradation to my application?
- Is Open MPI 'Valgrind-clean' or how can I identify real errors?
- Can I make Open MPI use
rsh instead of ssh?
- What pre-requisites are necessary for running an Open MPI job
under
rsh/ssh?
- How can I make
ssh not ask me for a password?
- What is a
.rhosts file? Do I need it?
- Should I use
+ in my .rhosts file?
- What versions of BProc does Open MPI work with?
- What pre-requisites are necessary for running an Open MPI job under BProc?
- How do I run jobs under SLURM?
- Doe Open MPI support "srun -n X my_mpi_application"?
- I use SLURM on a cluster with the OpenFabrics network stack. Do I need to do anything special?
- How do I reduce startup time for jobs on large clusters?
- Where should I put my libraries: Network vs. local filesystems?
- Static vs shared libraries?
- How do I reduce the time to wireup OMPI's out-of-band communication system?
- Why is my job failing because of file descriptor limits?
- I know my cluster's configuration - how can I take advantage of that knowledge?
- What is the Modular Component Architecture (MCA)?
- What are MCA parameters?
- What frameworks are in Open MPI?
- What frameworks are in Open MPI v1.2 (and prior)?
- What frameworks are in Open MPI v1.3?
- How do I know what components are in my Open MPI installation?
- How do I install my own components into an Open MPI installation?
- How do I know what MCA parameters are available?
- How do I set the value of MCA parameters?
- What are Aggregate MCA (AMCA) parameter files?
- How do I select which components are used?
- What is processor affinity? Does Open MPI support it?
- What is memory affinity? Does Open MPI support it?
- How do I tell Open MPI to use processor and/or memory affinity?
- How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.2.x? (What is mpi_paffinity_alone?)
- How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.3.x? (What are rank files?)
- How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?)
- How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.5.x?
- Does Open MPI support calling fork(), system(), or popen() in MPI processes?
- I want to run some performance benchmarks with Open MPI. How do I do that?
- I am getting a MPI_Win_free error from IMB-EXT -- what do I do?
- What is the sm BTL?
- How do I specify use of sm for MPI messages?
- How does the sm BTL work?
- Why does my MPI job no longer start when there are too many processes on
one node?
- How do I know what MCA parameters are available for tuning MPI performance?
- How can I tune these parameters to improve performance?
- Where is the file that sm will mmap in?
- Why am I seeing incredibly poor performance with the sm BTL?
- Can I use SysV instead of mmap?
- How much shared memory will my job use?
- How much shared memory do I need?
- How can I decrease my shared-memory usage?
- How do I specify to use the TCP network for MPI messages?
- But wait -- I'm using a high-speed network. Do I have to
disable the TCP BTL?
- How do I know what MCA parameters are available for tuning MPI performance?
- Does Open MPI use the TCP loopback interface?
- I have multiple TCP networks on some/all of my cluster nodes. Which ones will Open MPI use?
- I'm getting TCP-related errors. What do they mean?
- How do I tell Open MPI which TCP networks to use?
- Does Open MPI open a bunch of sockets during MPI_INIT?
- How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.2?
- How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.3 (and beyond)?
- Does Open MPI ever close TCP sockets?
- Does Open MPI support IP interfaces that have more than one IP address?
- Does Open MPI support virtual IP interfaces?
- What Myrinet-based components does Open MPI have?
- How do I specify to use the Myrinet GM network for MPI messages?
- How do I specify to use the Myrinet MX network for MPI messages?
- But wait -- I also have a TCP network. Do I need to explicitly
disable the TCP BTL?
- How do I know what MCA parameters are available for tuning MPI performance?
- I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help?
- How do I adjust the MX first fragment size? Are there constraints?
- What is different between Sun Microsystems ClusterTools 7 and Open
MPI in regards to the uDAPL BTL?
- What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude mca parameter?
- Where is the static uDAPL Registry found?
- How come the value reported by "ifconfig" is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter?
- I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris, what can I do?
- What is special about MPI performance analysis?
- What are "profiling" and "tracing"?
- How do I sort out busy wait time from idle wait, user time from system
time, and so on?
- What is PMPI?
- Should I use those switches --enable-mpi-profile and --enable-trace when
I configure OMPI?
- What support does OMPI have for performance analysis?
- How do I view VampirTrace output?
- Are there MPI performance analysis tools for OMPI that I can download for free?
- Any other kinds of tools I should know about?
- How does Open MPI handle HFS+ / UFS filesystems?
- How do I use the Open MPI wrapper compilers in XCode?
- How do I run jobs under XGrid?
- Where do I get more information about running under XGrid?
- Is Open MPI included in OS X?
- How do I not use the OS X-bundled Open MPI?
- Is AIX a supported operating system for Open MPI?
- Does Open MPI work on AIX?
- What is VampirTrace?
- Where can I find the complete documentation of VampirTrace?
- How to instrument my MPI application with VampirTrace?
- Does VampirTrace cause overhead to my application?
- How can I change the underlying compiler of the mpi*-vt wrappers?
- How can I pass VampirTrace related configure options through the
Open MPI configure?
- How to disable the integrated VampirTrace, completely?
| 1. What is MPI? What is Open MPI? |
MPI stands for the Message Passing Interface. Written by the
MPI Forum (a large committee comprising of a cross-section between
industry and research representatives), MPI is a standardized API
typically used for parallel and/or distributed computing. The MPI
standard is comprised of 2 documents: MPI-1 (published in 1994) and
MPI-2 (published in 1996). MPI-2 is, for the most part, additions and
extensions to the original MPI-1 specification.
The MPI-1 and MPI-2 documents can be downloaded from the official MPI
Forum web site: http://www.mpi-forum.org/.
Open MPI is an open source, freely available implementation of both
the MPI-1 and MPI-2 documents. The Open MPI software achieves high
performance; the Open MPI project is quite receptive to community
input.
| 2. Where can I learn about MPI? Are there tutorials available? |
There are many resources available on the internet for
learning MPI.
- The definitive reference for MPI is the MPI Forum Web site. It has
copies of the MPI standards documents and all of the errata. This is
not recommended for beginners, but is an invaluable reference.
- Several books on MPI are available (search your favorite book
sellers for availability):
- MPI: The Complete Reference, Marc Snir et al. (an annotated
version of the MPI-1 and MPI-2 standard; a 2 volume set,
also known as "The orange book" and "The yellow
book")
- Using MPI, William Gropp et al. (2nd edition, also known as
"The purple book")
- Parallel Programming With MPI, Peter Pacheco
- ...and others. This is not a definitive list!
- The "Introduction to MPI" and "Intermediate MPI" tutorials are
excellent web-based MPI instruction offered by the NCSA.
This is a great place for
beginners.
- The LAM/MPI web site has links to a
few tutorials.
- Last but not least, searching for "MPI tutorial" on Google turns up a wealth of
information (some good, some bad)
| 3. What are the goals of the Open MPI Project? |
We have several top-level goals:
- Create a free, open source, peer-reviewed, production-quality
complete MPI-2 implementation.
- Provide extremely high, competitive performance (latency,
bandwidth, ...pick your favorite metric).
- Directly involve the HPC community with external development
and feedback (vendors, 3rd party researchers, users, etc.).
- Provide a stable platform for 3rd party research and commercial
development.
- Help prevent the "forking problem" common to other MPI
projects.
- Support a wide variety of HPC platforms and environments.
In short, we want to work with and for the HPC community to make a
world-class MPI-2 implementation that can be used on a huge number and
kind of systems.
| 4. Will you allow external involvement? |
ABSOLUTELY.
Bringing together smart researchers and developers to work on a common
product is not only a good idea, it's the open source model. Merging
the multiple MPI implementation teams has worked extremely well for us
over the past year -- extending this concept to the HPC open source
community is the next logical step.
The component architecture that Open MPI is founded upon (see the
"Publications" link for papers about this) is designed to foster 3rd
party collaboration by enabling independent developers to use Open MPI
as a production quality research platform. Although Open MPI is a
relatively large code base, it is rarely necessary to learn much more
than the interfaces for the component type which you are
implementing. Specifically, the component architecture was designed
to allow small, discrete implementations of major portions of MPI
functionality (e.g., point-to-point messaging, collective
communications, run-time environment support, etc.).
We envision at least the following forms of collaboration:
- Peer review of the Open MPI code base
- Discussion with Open MPI developers on public mailing lists
- Direct involvement from HPC software and hardware vendors
- 3rd parties writing and providing their own Open MPI
components
| 5. How is this software licensed? |
The Open MPI code base is licensed under the new BSD license.
That being said, although we are an open source project, we recognize
that the everyone does not provide free, open source software. Our
collaboration models allow (and encourage!) 3rd parties to write and
distribute their own components -- perhaps with a different license,
and perhaps even as closed source. This is all perfectly acceptable
(and desirable!).
| 6. I want to redistribute Open MPI. Can I? |
Absolutely.
NOTE: We are not lawyers and this is not legal advice.
Please read the Open MPI
license (the BSD license). It contains extremely liberal
provisions for redistribution.
| 7. Preventing forking is a goal; how will you enforce that? |
By definition, we can't. If someone really wants to fork the Open MPI code base, they can.
By virtue of our extremely liberal license, it is possible for
anyone to fork at any time.
However, we hope that no one does.
We intend to distinguish ourselves from other projects by:
- Working with the HPC community to accept best-in-breed
improvements and functionality enhancements.
- Provide a flexible framework and set of APIs that allow a
wide-variety of different goals within the same code base through the
combinatorial effect of mixing-and-matching different components.
Hence, we hope that no one ever has a reason to fork the main code
base. We intend to work with the community to accept the best
improvements back into the main code base. And if some developers
want to do things to the main code base that are different than the
goals of the main Open MPI Project, it is our hope that they can do
what they need in components that can be distributed without forking
the main Open MPI code base.
Only time will tell if this ambitious plan is feasible, but we're going
to work hard to make it a reality!
| 8. How are 3rd party contributions handled? |
Before accepting any code from 3rd parties, we require an original
signed contribution agreement from the donator.
These agreements assert that the contributor has the right to donate
the code and allow the Open MPI Project to perpetually distribute it
under the project's
licensing terms.
This prevents a situation where intellectual property gets into the
Open MPI code base and then someone later claims that we owe them
money for it. Open MPI is a free, open source code base. And we
intend it to remain that way.
The Contributing to Open MPI FAQ
topic contains more information on this issue.
| 9. Is this just YAMPI (yet another MPI implementation)? |
No!
Open MPI initially represented the merger between three well-known MPI
implementations:
- FT-MPI from the
University of Tennessee
- LA-MPI from Los
Alamos National Laboratory
- LAM/MPI from Indiana
University
with contributions from the PACX-MPI team at the University of Stuttgart.
Each of these MPI implementations excelled in one or more areas. The
driving motivation behind Open MPI is to bring the best ideas and
technologies from the individual projects and create one world-class
open source MPI implementation that excels in all areas.
Open MPI was started with the best of the ideas from these four MPI
implementations and ported them to an entirely new code base: Open
MPI. This also had the simultaneous effect of enabling us to jettison
old, crufty code that was only maintained for historical reasons from
each project. We started with a clean slate and decided to "do it
Right this time." As such, Open MPI also contains many new designs
and methodologies based on (literally) years of MPI implementation
experience.
After version 1.0 was released, the Open MPI Project grew to include
many other
members who have each brought their knowledge, expertise, and
resources to Open MPI. Open MPI is now far more than just
the best ideas of the founding for MPI implementation projects.
| 10. But I love [FT-MPI | LA-MPI | LAM/MPI | PACX-MPI]!
Why should I use Open MPI? |
Here's a few reasons:
- Open MPI represents the next generation of each of these
implementations.
- Open MPI effectively contains the union of features from each of
the previous MPI projects. If you find a feature in one of the prior
projects that is not in Open MPI, chances are that it will be
soon.
- The vast majority of our future research and development work will
be in Open MPI.
- All the same developers from your favorite project are working on
Open MPI.
Not to worry -- each of the respective teams has a vested interest in
bringing over the "best" parts of their prior implementation to Open
MPI. Indeed, we would love to migrate each of our current user bases
to Open MPI as their time, resources, and constraints allow.
In short: we believe that Open MPI -- its code, methodology, and open
source philosophy -- is the future.
| 11. What will happen to the prior projects? |
Only time will tell (we cannot predict the future), but it is
likely that each project will eventually either end when funding stops
or be used exclusively as a research vehicle. Indeed, some of the
projects must continue to exist at least until their existing
funding expires.
| 12. What operating systems does Open MPI support? |
We primarily develop Open MPI on Linux,
OS X, Solaris (both 32 and 64 on all platforms) and
Windows (Windows XP, Windows HPC Server 2003/2008 and also Windows 7 RC).
Open MPI is fairly POSIX-neutral, so it will run without too many
modifications on most POSIX-like systems. Hence, if we haven't listed
your favorite operating system here, it should not be difficult to get
Open MPI to compile and run properly. The biggest obstacle is
typically the assembly language, but that's fairly modular and we're
happy to provide information about how to port it to new platforms.
It should be noted that we are quite open to accepting patches for
operating systems that we do not currently support. If we do not have
systems to test these on, we probably will only claim to
"unofficially" support those systems.
Microsoft Windows support has been added in v1.3.3, please see the file
README.WINDOWS.
| 13. What hardware platforms does Open MPI support? |
Essentially all the common platforms that the operating
systems listed in the previous question support.
For example, Linux runs on a wide variety of platforms, and we
certainly can't claim to support all of them (e.g., Open MPI does not
run in an embedded environment), but we include assembly for support
Intel, AMD, and PowerPC chips, for example.
| 14. What network interconnects does Open MPI support? |
Open MPI is based upon a component architecture; support for its MPI
point-to-point functionality only utilize a small number of components
at run-time. Adding native support for a new network interconnect was
specifically designed to be easy.
Here's the list of networks that we natively support for
point-to-point communication:
- TCP / ethernet
- Shared memory
- Loopback (send-to-self)
- Myrinet / GM
- Myrinet / MX
- Infiniband / OpenIB
- Infiniband / mVAPI
- Portals
Is there a network that you'd like to see supported that is not shown
above? Contributions are
welcome!
| 15. What run-time environments does Open MPI support? |
Open MPI is layered on top of the Open Run-Time Environment (ORTE),
which originally started as a small portion of the Open MPI code base.
However, ORTE has effectively spun off into its own sub-project.
ORTE is a modular system that was specifically architected to abstract
away the back-end run-time environment (RTE) system, providing a
neutral API to the upper-level Open MPI layer. Components can be
written for ORTE that allow it to natively utilize a wide variety of
back-end RTEs.
ORTE currently natively supports the following run-time environments:
- Recent versions of BProc (e.g., Clustermatic)
- Sun Grid Engine
- PBS Pro, Torque, and Open PBS (the TM system)
- LoadLeveler
- LSF
- POE
- rsh / ssh
- SLURM
- XGrid
- Yod (Red Storm)
Is there a run-time system that you'd like to use Open MPI with that
is not listed above? Component
contributions are welcome!
| 16. Does Open MPI support LSF? |
Starting with Open MPI v1.3, yes!
Prior to Open MPI v1.3, Platform released a script-based integration
in the LSF 6.1 and 6.2 maintenance packs around November of 2006. If
you want this integration, please contact your normal Platform support
channels.
| 17. How much MPI does Open MPI support? |
Open MPI 1.2 supports all of MPI-2.0.
Open MPI 1.3 supports all of MPI-2.1.
| 18. Is Open MPI thread safe? |
Support for MPI_THREAD_MULTIPLE (i.e., multiple threads
executing within the MPI library) and asynchronous message passing
progress (i.e., continuing message passing operations even while no
user threads are in the MPI library) has been designed into Open MPI
from its first planning meetings.
Support for MPI_THREAD_MULTIPLE is included in the first version of
Open MPI, but it is only lightly tested and likely still has some
bugs. Support for asynchronous progress is included in the TCP
point-to-point device, but it, too, has only had light testing and
likely still has bugs.
Completing the testing for full support of MPI_THREAD_MULTIPLE and
asynchronous progress is planned in the near future.
| 19. Does Open MPI support 64 bit environments? |
Yes, Open MPI is 64 bit clean. You should be able to use Open
MPI on 64 bit architectures and operating systems with no
difficulty.
| 20. Does Open MPI support execution in heterogeneous environments? |
As of v1.1, Open MPI requires that the size of C, C++, and
Fortran datatypes be the same on all platforms within a single
parallel application with the exception of types represented by
MPI_BOOL and MPI_LOGICAL -- size differences in these types
between processes are properly handled. Endian differences between
processes in a single MPI job are properly and automatically handled.
Prior to v1.1, Open MPI did not include any support for data size or
endian heterogeneity.
| 21. Does Open MPI support parallel debuggers? |
Yes. Open MPI supports the TotalView API for parallel process
attaching, which several parallel debuggers support (e.g., DDT, fx2).
As part of v1.2.4 (released in September 2007), Open MPI also supports the
TotalView API for viewing message queues in running MPI processes.
See this FAQ entry for
details on how to run Open MPI jobs under TotalView, and this FAQ entry for
details on how to run Open MPI jobs under DDT.
NOTE: The integration of Open
MPI message queue support is problematic with 64 bit versions of
TotalView prior to v8.3:
- The message queues views will be truncated
- Both the communicators and requests list will be incomplete
- Both the communicators and requests list may be filled with wrong
values (such as an MPI_Send to the destination ANY_SOURCE)
There are two workarounds:
- Use a 32 bit version of TotalView
- Upgrade to TotalView v8.3
| 22. Can I contribute to Open MPI? |
YES!
One of the main goals of the Open MPI project is to involve the
greater HPC community.
There are many ways to contribute to Open MPI. Here are a few:
- Subscribe to the mailing
lists and become active in the discussions
- Obtain a source code checkout of Open
MPI's code base and start looking through the code (be sure to see the Developers category for
technical details about the code base)
- Write your own components and distribute them yourself (i.e.,
outside of the main Open MPI distribution)
- Write your own components and contribute them back to the main
code base
- Contribute bug fixes and feature enhancements to the main code
base
| 23. I found a bug! How do I report it? |
First check that this is not already a known issue by checking
the FAQ and the
mailing list archives. If you
can't find your problem mentioned anywhere, it is most helpful if you
can create a "recipe" to replicate the bug.
Please see the Getting
Help page for more details on submitting bug reports.
| 24. What license is Open MPI distributed under? |
Open MPI is distributed under the BSD license.
| 25. How do I contribute code to Open MPI? |
Similar to the Apache projects, before you contribute any code
to the Open MPI code base, you must first print out, sign, and submit
an Open MPI Third Party Contribution Agreement.
NOTE: We are not lawyers and this is not legal advice.
We need to have an established intellectual property pedigree of the
code in Open MPI. This means being able to ensure that all code
included in Open MPI is free, open source, and able to be distributed
under the BSD license.
This prevents a situation where intellectual property gets into the
Open MPI code base and then someone later claims that we owe them
money for it. Open MPI is a free, open source code base. And we
intend it to remain that way.
We enforce this policy by requiring all code contributors to submit a
signed Open MPI Third Party Contribution Agreement before we can
accept any code from them. These agreements assert that the
contributor has the right to donate the code and allow the Open MPI
Project to perpetually distribute it under the project's licensing
terms.
There are two versions of this agreement:
one for
individuals, and one for
organizations. Ensure that you use the correct form; for example,
some companies own all the code produced by their employees, so even
if you write code in your spare time, it may still be the intellectual
property of your employer.
Send and original, signed copy to the address on the form.
We must have a copy of this agreement on file before we can accept
code into the Open MPI code base.
| 26. I can't submit an Open MPI Third Party Contribution Agreement;
how can I contribute to Open MPI? |
Fear not.
Although we cannot accept code from you, there are still plenty of
other ways to contribute to Open MPI. Here are some examples:
- Become an active participant in the mailing lists
- Write and distribute your own components (remember: Open MPI
components can be distributed completely separately from the main Open
MPI distribution -- they can be added to existing Open MPI
installations, and don't even need to be open source)
- Report bugs
- Do a good deed daily
| 27. What if I don't want my contribution to be free / open source? |
No problem.
While we are creating free / open-source software, and we would prefer
if everyone's contributions to Open MPI were also free / open-source,
we certainly recognize that other organizations have difference goals
than us. Such is the reality of software development in today's
global economy.
As such, it is perfectly acceptable to make non-free / non-open-source
contributions to Open MPI.
We obviously cannot accept such contributions into the main code base,
but you are free to distribute plugins, enhancements, etc. as you see
fit. Indeed, the the BSD
license is extremely liberal in its redistribution provisions.
Please also see this FAQ entry about forking
the Open MPI code base.
| 28. I want to fork the Open MPI code base. Can I? |
Yes... but we'd prefer if you didn't.
Although Open MPI's
license allows third parties to fork the code base, we would
strongly prefer if you did not. Forking is not necessarily a Bad
Thing, but history has shown that creating too many forks in MPI
implementations leads to massive user and system administrator
confusion. We have personally seen parallel environments loaded with
tens of MPI implementations, each only slightly different from the
others. The users then become responsible for figuring out which MPI
they want / need to use, which can be a daunting and confusing task.
We do periodically have "short" forks. Specifically, sometimes an
origanization needs to release a version of Open MPI with a specific
feature.
If you're thinking of forking the Open MPI code base, please let us
know -- let's see if we can work something out so that it is not
necessary.
| 29. Rats! My contribution was not accepted into the main Open MPI
code base. What now? |
If your contribution was not accepted into the main Open MPI
code base, there are likely to be good reasons for it (perhaps
technical, perhaps due to licensing restrictions, etc.).
If you wrote a standalone component, you can still distribute this
component independent of the main Open MPI distribution. Open MPI
components can be installed into existing Open MPI installations. As
such, you can distribute your component -- even if it is closed source
(e.g., distributed as binary-only) -- via any mechanism you choose,
such as on a web site, FTP site, etc.
Open MPI is a large project containing many different
sub-systems and a relatively large code base. Let's first cover some
fundamental terminology in order to make the rest of the discussion
easier.
Open MPI has three sections of code:
- OMPI: The MPI API and supporting logic
- ORTE: The Open Run-Time Environment (support for different
back-end run-time systems)
- OPAL: The Open Portable Access Layer (utility and "glue" code
used by OMPI and ORTE)
There are strict abstraction barriers in the code between these
sections. That is, they are compiled into three separate libraries:
libmpi, liborte, and libopal with a strict dependency order:
OMPI depends on ORTE and OPAL, and ORTE depends on OPAL. More
specifically, OMPI executables are linked with:
shell$ mpicc myapp.c -o myapp
# This actually turns into:
shell$ cc myapp.c -o myapp -lmpi -lopen-rte -lopen-pal ...
|
More system-level libraries may listed after -lopal, but you get the
idea.
Strictly speaking, these are not "layers" in the classic software
engineering sense (even though it is convenient to refer to them as
such). They are listed above in dependency order, but that does not
mean that, for example, the OMPI code must go through the ORTE and
OPAL code in order to reach the operating system or a network
interface.
As such, this code organization more reflects abstractions and
software engineering, not a strict hierarchy of functions that must be
traversed in order to reach lower layer. For example, OMPI can call
OPAL functions directly -- it does not have to go through ORTE.
Indeed, OPAL has a different set of purposes than ORTE, so it wouldn't
even make sense to channel all OPAL access through ORTE. OMPI can
also directly call the operating system as necessary. For example,
many top-level MPI API functions are quite performance sensitive; it
would not make sense to force them to traverse an abritrarily deep
call stack just to move some bytes across a network.
Here's a list of terms that are frequently used in discussions about
the Open MPI code base:
- MCA: The Modular Component Architecture (MCA) is the foundation
upon which the entire Open MPI project is built. It provides all the
component architecture services that the rest of the system use.
Although it is the fundamental heart of the system, it's
implementation is actually quite small and lightweight -- it is
nothing like CORBA, COM, JINI, or many other well-known component
architectures. It was designed for HPC -- meaning that it is small,
fast, and resonably efficient -- and therefore offers few services
other finding, loading, and unloading components.
- Framework: An MCA framework is a construct that is created
for a single, targeted purpose. It provides a public interface that
is used by external code, but it also its own internal services. A
list of Open MPI frameworks is available here. An MCA
framework uses the MCA's services to find and load components at run
time -- implementations of the framework's interface. An easy example
framework to discuss is the MPI framework named "
btl", or the Byte
Transfer Layer. It is used to sends and receive data on different
kinds of networks. Hence, Open MPI has btl components for shared
memory, TCP, Infiniband, Myrinetc, etc.
- Component: An MCA component is an implementation of a
framework's interface. Another common word for component is
"plugin." It is a standalone collection of code that can be bundled
into a plugin that can be inserted into the Open MPI code base, either
at run-time and/or compile-time.
- Module: An MCA module is an instance of a component (in the
C++ sense of the word "instance"; an MCA component is analogous to a
C++ class). For example, if a node running an Open MPI application has
multiple ethernet NICs, the Open MPI application will contain one TCP
btl component, but two TCP btl modules. This difference between
components and modules is important becaue modules have private state;
components do not.
Frameworks, components, and modules can be dynamic or static. That is,
they can be available as plugins or they may be compiled statically
into libraries (e.g., libmpi).
| 31. How do I get a copy of the most recent source code? |
See the instructions here.
| 32. Ok, I got a Subversion checkout. Now how do I build it? |
See the instructions here.
| 33. What is the main tree layout of the Open MPI source tree? Are
there directory name conventions? |
There are a few notable top-level directories in the source
tree:
- config/: M4 scripts supporting the top-level
configure script
mpi.h)
- etc/: Some miscellaneous text files
- include/: Top-level include files that will be installed
- ompi/: The Open MPI code base
- orte/: The Open RTE code base
- opal/: The OPAL code base
Each of the three main source directories ([ompi/], orte/, and
opal/) generate a top-level library named libmpi, liborte, and
libopal, respectively. They can be built as either static or shared
libraries. Executables are also produced in subdirectories of some of
the trees.
Each of the sub-project source directories have similar (but not
identical) directory structures under them:
- class/: C++-like "classes" (using the OPAL class system)
specific to this project
- include/: Top-level include files specific to this project
- mca/: MCA frameworks and components specific to this project
- runtime/: Startup and shutdown of this project at runtime
- tools/: Executables specific to this project (currently none in
OPAL)
- util/: Random utility code
There are other top-level directories in each of the three
sub-projects, each having to do with specific logic and code for that
project. For example, the MPI API implementations can be found under
ompi/mpi/LANGUAGE, where
LANGUAGE is c, cxx, f77, and f90.
The layout of the mca/ trees are strictly defined. They are of the
form:
<project>/mca/<framework name>/<component name>/
|
To be explicit: it is forbidden to have a directory under the mca
trees that does not meet this template (with the execption of base
directories, explained below). Hence, only framework and component
code can be in the mca/ trees.
That is, framework and component names must be valid directory names
(and C variables; more on that later). For example, the TCP BTL
component is located in the following directory:
The name base is reserved; there cannot be a framework or component
named "base." Directories named base are reserved for the
implementatio of the MCA and frameworks. Here are a few examples:
# Main implementation of the MCA
opal/mca/base
# Implementation of the paffinity framework
opal/mca/paffinity/base
# Implementation of the pls framework
orte/mca/pls/base
# Implementation of the pml framework
ompi/mca/pml/base
|
Under these mandated directories, frameworks and/or components may have
arbitrary directory structures, however.
| 34. Is there more information available? |
Yes. In early 2006, Cisco hosted an Open MPI workshop where
the Open MPI Team provided several days of intensive
dive-into-the-code tutorials. The slides from these tutorials are available here.
Additionally, an OpenRTE (ORTE) workshop was held for similar purposes
in late 2006. The slides from the ORTE workshop are available here.
There are more questions / answers coming... stay tuned...
| 36. I'm a sysadmin; what do I care about Open MPI? |
Several members of the Open MPI team have strong system
administrator backgrounds; we recognize the value of having software
that is friendly to system administrators. Here are some of the reasons
that Open MPI is attractive for system administrators:
- Simple, standards-based installation
- Help reduce the number of MPI installations
- Ability to set system-level and user-level parameters
- Scriptable information sources about the Open MPI installation
See the rest of the questions in the FAQ section for more details.
| 37. What hardware / software / run-time environments / networks
does Open MPI support? |
See this FAQ
category for more information
| 38. Do I need multiple Open MPI installations? |
Yes and no.
Open MPI can handle a variety of different run-time environments
(e.g., rsh/ssh, SLURM, PBS, etc.) and a variety of different
interconnection networks (e.g., ethernet, Myrinet, Infiniband, etc.)
in a single installation. Specifically: because Open MPI is
fundamentally powered by a component architecture, plug-ins for all
these different run-time systems and interconnect networks can be
installed in a single installation tree. The relevant plug-ins will
only be used in the environments where they make sense.
Hence, there is no need to have one MPI installation for Myrinet, one
MPI installation for Ethernet, one MPI installation for PBS, one MPI
installation for rsh, etc. Open MPI can handle all of these in a
single installation.
However, there are some issues that Open MPI cannot solve. Binary
compatibility between different compilers is such an issue. Let's
examine this in a per-language basis (be sure see the big caveat at
the end):
The big caveat to all of this is that Open MPI will only work with
different compilers if all the datatype sizes are the same. For
example, even though Open MPI supports all 4 name mangling schemes,
the size of the Fortran LOGICAL type may be 1 byte in some compilers
and 4 bytes in others. This will likely cause Open MPI to perform
unpredictably.
The bottom line is that Open MPI can support all manner of run-time
systems and interconnects in a single installation, but supporting
multiple compilers "sort of" works (i.e., is subject to trial and
error) in some cases, and definitely does not work in other cases.
There's unfortunately little that we can do about this -- it's a
compiler compatibility issue, and one that compiler authors have
little incentive to resolve.
| 39. What are MCA Parameters? Why would I set them? |
MCA parameters are a way to tweak Open MPI's behavior at
run-time. For example, MCA parameters can specify:
- Which interconnect networks to use
- Which interconnect networks not to use
- The size different between eager sends and rendezvous protocol
sends
- How many registered buffers to pre-pin (e.g., for GM or mVAPI)
- The size of the pre-pinned registered buffers
- ...etc.
It can be quite valuable for a system administrator to play with such
values a bit and find an "optimal" setting for a particular
operating environment. These values can then be set in a global text
file that all users will, by default, inherit when then run Open MPI
jobs.
For example, say that you have a cluster with 2 ethernet networks --
one for NFS and other system-level operations, and one for MPI jobs.
The system administrator can tell Open MPI to not use the NFS TCP
network at a system level, such that when users invoke mpirun or
mpiexec to launch their jobs, they will automatically only be using
the network meant for MPI jobs.
See the run-time tuning FAQ
category for information how to set global MCA parameters.
| 40. Do my users need to have their own installation of Open MPI? |
Usually not. It is typically sufficient for a single Open MPI
installation (or perhaps a small number of Open MPI installations,
depending on compiler interoperability) to serve an entire parallel
operating environment.
Indeed, a system-wide Open MPI installation can be customized on a
per-user basis in two important ways:
- Per-user MCA parameters: Each user can set their own set of MCA
parameters, potentially overriding system-wide defaults.
- Per-user plug-ins: Users can install their own Open MPI
plug-ins under
$HOME/.openmpi/components. Hence, developers can
experiment with new components without de-stabilizing the rest of the
users on the system. Or power users can download 3rd party components
(perhaps even research-quality components) without affecting other users.
| 41. I have power users who will want to override my global MCA
parameters; is this possible? |
Absolutely.
See the run-time tuning FAQ
category for information how to set MCA parameters, both at the
system level and on a per-user (or per-MPI-job) basis.
| 42. What MCA parameters should I, the system administrator, set? |
This is a difficult question and depends on both your specific
parallel setup and the applications that typically run there.
The best thing to do is to use the ompi_info command to see what
parameters are available and relevant to you. Specifically,
ompi_info can be used to show all the parameters that are available
for each plug-in. Two common places that system administrators like
to tweak are:
- Only allow specific networks: Say you have a cluster with a
high-speed interconnect (such as Myrinet or Infiniband) and an
ethernet network. The high-speed network is intended for MPI jobs;
the ethernet network is intended for NFS and other
administrative-level jobs. In this case, you can simply turn off Open
MPI's TCP support. The "btl" framework contains Open MPI's network
support; in this case, you want to disable the
tcp plug-in. You can
do this by adding the following line in the file
$prefix/etc/openmpi-mca-params.conf:
This tells Open MPI to load all BTL components except tcp.
Consider another example: your cluster has two TCP networks, one for
NFS and administration-level jobs, and another for MPI jobs. You can
tell Open MPI to ignore the TCP network used by NFS by adding the
following line in the file $prefix/etc/openmpi-mca-params.conf:
btl_tcp_if_exclude = lo,eth0
|
The value of this parameter is the device names to exclude. In this
case, we're excluding lo (localhost, because Open MPI has its own
internal loopback device) and eth0.
- Tune the parameters for specific networks: Each network plug-in
has a variety of different tunable parameters. Use the
ompi_info
command to see what is available. You show all available parameters
with:
shell$ ompi_info --param all all
|
Beware: there are many variables available. You can limit the
output by showing all the parameters in a specific framework or in a
specific plug-in with the command line parameters:
shell$ ompi_info --param btl all
|
Shows all the parameters of all BTL components, and:
shell$ ompi_info --param btl mvapi
|
Shows all the parameters of just the mvapi BTL component.
| 43. I just added a new plugin to my Open MPI installation; do I need to recompile all my MPI apps? |
If your installation of Open MPI uses shared libraries and
components are standalone plug-in files, then no. If you add a new
component (such as support for a new network), Open MPI will simply
open the new plugin at run-time -- your applications do not need to be
recompiled or re-linked.
| 44. I just upgraded my Myrinet|Infiniband network; do I need to
recompile all my MPI apps? |
If your installation of Open MPI uses shared libraries and
components are standalone plug-in files, then no. You simply need to
recompile the Open MPI components that support that network and
re-install them.
More specifically, Open MPI shifts the dependency on the underlying
network away from the MPI applications and to the Open MPI plug-ins.
This is a major advantage over many other MPI implementations.
MPI applications will simply open the new plugin when they run.
| 45. We just upgraded our version of Open MPI; do I need to
recompile all my MPI apps? |
It is unlikely. Most MPI applications solely interact with
Open MPI through the standardized MPI API and the constant values it
publishes in mpi.h. The MPI-2 API will not change until the MPI
Forum changes it.
We will try hard to make Open MPI's mpi.h stable such that the
values will not change from release-to-release. While we cannot
guarantee that they will stay the same forever, we'll try hard to make
it so.
| 46. I have an MPI application compiled for another MPI; will it
work with Open MPI? |
It is strongly unlikely. Open MPI does not attempt to
interface to other MPI implementations, nor executables that were
compiled for them. Sorry!
MPI applications need to be compiled and linked with Open MPI in order
to run under Open MPI.
| 47. What is "fault tolerance"? |
The phrase "fault tolerance" means many things to many
people. Typical definitions range from user processes dumping vital
state to disk periodically to checkpoint/restart of running processes
to elaborate recreate-process-state-from-incremental-pieces schemes to
... (you get the idea).
In the scope of Open MPI, we typically define "fault tolerance" to
mean the ability to recover from one or more component failures in a
well defined manner with either a transparent or application-directed
mechanism. Component failures may exhibit themselves as a corrupted
transmission over a faulty network interface or the failure of one or
more serial or parallel processes due to a processor or node failure.
Open MPI strives to provide the application with a consistent system
view while still providing a production quality, high performance
implementation.
Yes, that's pretty much as all-inclusive as possible -- intentionally
so! Remember that in addition to being a production-quality MPI
implementation, Open MPI is also a vehicle for research. So while
some forms of "fault tolerance" are more widely accepted and used,
others are certainly of valid academic interest.
| 48. What fault tolerance techniques does Open MPI plan on supporting? |
Open MPI plans on supporting the following fault tolerance
techniques:
- Coordinated and uncoordinated process checkpoint and
restart. Similar to those implemented in LAM/MPI and MPICH-V,
respectively.
- Message logging techniques. Similar to those implemented in
MPICH-V
- Data Reliability and network fault tolerance. Similar to those
implemented in LA-MPI
- User directed, and communicator driven fault tolerance. Similar to
those implemented in FT-MPI.
The Open MPI team will not limit their fault tolerance techniques to
those mentioned above, but intend on extending beyond them in the
future.
| 49. Does Open MPI support checkpoint and restart of parallel jobs (similar
to LAM/MPI)? |
Yes. The v1.3 series was the first release series of Open MPI to include
support for the transparent, coordinated checkpointing and restarting of MPI
processes (similar to LAM/MPI).
Open MPI supports both the the BLCR
checkpoint/restart system and a "self" checkpointer that allows
applications to perform their own checkpoint/restart functionality while taking
advantage of the Open MPI checkpoint/restart infrastructure.
For both of these, Open MPI provides a coordinated checkpoint/restart protocol
and integration with a variety of network interconnects including shared memory,
Ethernet, InfiniBand, and Myrinet.
The implementation introduces a series of new frameworks and
components designed to support a variety of checkpoint and restart
techniques. This allows us to support the methods described above
(application-directed, BLCR, etc.) as well as other kinds of
checkpoint/restart systems (e.g., Condor, libckpt) and protocols
(e.g., uncoordinated, message induced).
Note:
The checkpoint/restart support was last released as part of the v1.6 series.
The v1.7 series and the Open MPI trunk do not support this functionality (most of
the code is present in the repository, but it is known to be non-functional in
most cases).
This feature is looking for a maintainer. Interested parties should inquire
on the developers mailing list.
| 50. Where can I find the fault tolerance development work? |
The end-to-end MPI message data reliability work is being actively
developed on the subversion trunk (i.e., reliable message passing over
unreliable networks). See this FAQ entry
for more details.
The coordinated checkpoint and restart process fault tolerance work is
currently available on the Open MPI development trunk and in the v1.3
release series. For more information about how to use this feature see
the following websites:
For information on the Fault Tolerant MPI prototype in Open MPI see the
links below:
| 51. Does Open MPI support end-to-end data reliability in MPI
message passing? |
The current release of Open MPI does not support end-to-end
data reliability in message passing any more than the underlying
network already guarantees. Future releases of Open MPI will include
explicit data reliability support (i.e., more functionality than is
provided by the underlying network).
Specifically, the data reliability ("dr") PML component (available
on the trunk, but not yet in a stable release) assumes that the
underlying network is unreliable. It can drop / restart connections,
retransmit corrupted or lost data, etc. The end effect is that data
sent through MPI API functions will be guaranteed to be reliable.
For example, if you're using TCP as a message transport, chances of
data corruption are fairly low. However, other interconnects do not
guarantee that data will be uncorrupted when traveling across the
network. Additionally, there are nonzero possibilities that data can
be corrupted while traversing PCI buses, etc. (some corruption errors
at this level can be caught/fixed, others cannot). Such errors are
not uncommon at high altitudes (!).
Note that such added reliability does incur a performance cost --
latency and bandwidth suffer when Open MPI performs the consistency
checks that are necessary to provide such guarantees.
Many clusters/networks will not need data reliability. But some do
(e.g., those operating at high altitudes). The dr PML is intended for
environments where reliability is an issue; users are willing to
tolerate slightly slower applications in order to guarantee that their
job does not crash (or worse, produce wrong answers).
| 52. How do I build Open MPI? |
If you have obtained a developer's checkout from Subversion, skip this
FAQ question and consult these
directions.
For everyone else, in general, all you need to do is expand the
tarball, run the provided configure script, and then run "make all
install". For example:
shell$ gunzip -c openmpi-1.6.4.tar.gz | tar xf -
shell$ cd openmpi-1.6.4
shell$ ./configure --prefix=/usr/local
<...lots of output...>
shell$ make all install
|
Note that the configure script supports a lot of different command
line options. For example, the --prefix option in the above example
tells Open MPI to install under the directory /usr/local/.
Other notable configure options are required to support specific
network interconnects and back-end run-time environments. More
generally, Open MPI supports a wide variety of hardware and
environments, but it sometimes needs to be told where support
libraries and header files are located.
Consult the README file in the Open MPI tarball and the output of
"configure --help" for specific instructions regarding Open MPI's
configure command line options.
53. Wow -- I see a lot of errors during configure.
Is that normal? |
If configure finishes successfully -- meaning that it
generates a bunch of Makefiles at the end -- then yes, it is
completely normal.
The Open MPI configure script tests for a lot of things, not all of
which are expected to succeed. For example, if you do not have
Myrinet's GM library installed, you'll see failures about trying to
find the GM library. You'll also see errors and warnings about
various operating-system specific tests that are not aimed that the
operating system you are running.
These are all normal, expected, and nothing to be concerned about. It
just means, for example, that Open MPI will not build Myrinet GM
support.
| 54. What are the default build options for Open MPI? |
If you have obtained a developer's checkout from Subversion,
you must consult these directions.
The default options for building an Open MPI tarball are:
- Compile Open MPI with all optimizations enabled
- Build shared libraries
- Build components as standalone dynamic shared object (DSO) files
(i.e., run-time plugins)
- Try to find support for all hardware and environments by looking
for support libraries and header files in standard locations; skip them if not found
Open MPI's configure script has a large number of options, several of
which are of the form --with-<FOO>(=DIR), usually with a
corresponding --with-<FOO>-libdir=DIR option. The (=DIR)
part means that specifying the directory is optional. Here are some
examples (explained in more detail below):
-
--with-openib(=DIR) and --with-openib-libdir=DIR
-
--with-mx(=DIR) and --with-mx-libdir=DIR
-
--with-psm(=DIR) and --with-psm-libdir=DIR
- ...etc.
As mentioned above, by default, Open MPI will try to build support for
every feature that it can find on your system. If support for a given
feature is not found, Open MPI will simply skip building support for
it (this usually means not building a specific plugin).
"Support" for a given feature usually means finding both the
relevant header and library files for that feature. As such, the
command-line switches listed above are used to override default
behavior and allow specifying whether you want support for a given
feature or not, and if you do want support, where the header files
and/or library files are located (which is useful if they are not
located in compiler/linker default search paths). Specifically:
- If
--without-<FOO> is specified, Open MPI will not even
look for support for feature FOO. It will be treated as if support
for that feature was not found (i.e., it will be skipped).
- If
--with-<FOO> is specified with no optional directory,
Open MPI's configure script will abort if it cannot find support for
the FOO feature. More specifically, only compiler/linker default
search paths will be searched while looking for the relevant header
and library files. This option essentially tells Open MPI, "Yes, I
want support for FOO -- it is an error if you don't find support for
it."
- If
--with-<FOO>=/some/path is specified, it is
essentially the same as specifying --with-<FOO> but also
tells Open MPI to add -I/some/path/include to compiler search paths,
and try (in order) adding -L/some/path/lib and -L/some/path/lib64
to linker search paths when searching for FOO support. If found,
the relevant compiler/linker paths are added to Open MPI's general
build flags. This option is helpful when support for feature FOO is
not found in default search paths.
- If
--with-<FOO>-libdir=/some/path/lib is specified, it
only specifies that if Open MPI searches for FOO support, it
should use /some/path/lib for the linker search path.
In general, it is usually sufficient to run Open MPI's configure
script with no --with-<FOO> options if all the features you
need supported are in default compiler/linker search paths. If the
features you need are not in default compiler/linker search paths,
you'll likely need to specify --with-<FOO> kinds of flags.
However, note that it is safest to add --with-<FOO> types of
flags if you want to guarantee that Open MPI builds support for
feature FOO, regardless of whether support for FOO can be found in
default compiler/linker paths or not -- configure will abort if you
can't find the appropriate support for FOO. This may be preferable
to unexpectedly discovering at run-time that Open MPI is missing
support for a critical feature.
Be sure to note the difference in the directory specification between
--with-<FOO> and --with-<FOO>-libdir. The former
takes a top-level directory (such that "/include", "/lib", and
"/lib64" are appended to it) while the latter takes a single
directory where the library is assumed to exist (i.e., nothing is
suffixed to it).
Finally, note that starting with Open MPI v1.3, configure will
sanity check to ensure that any directory given to
--with-<FOO> or --with-<FOO>-libdir actually exists
and will error if it does not. This prevents typos and mistakes in
directory names, and prevents Open MPI from accidentally using a
compiler/linker-default path to satisfy FOO's header and library
files.
| 55. Open MPI was pre-installed on my machine; should I overwrite it with a new version? |
Probably not.
Many systems come with some version of Open MPI pre-installed (e.g.,
many Linuxes, BSD variants, and OS X. If you download a newer version
of Open MPI from this web site (or one of the Open MPI mirrors), you
probably do not want to overwrite the system-installed Open MPI.
This is because the system-installed Open MPI is typically under the
control of some software package management system (rpm, yum, etc.).
Instead, you probably want to install your new version of Open MPI to
another path, such as /opt/openmpi- (or whatever is
appropriate for your system).
This FAQ
entry also has much more information about strategies for where to
install Open MPI.
| 56. Where should I install Open MPI? |
A common environment to run Open MPI is in a "Beowulf"-class
or similar cluster (e.g., a bunch of 1U servers in a bunch of racks).
Simply stated, Open MPI can run on a group of servers or workstations
connected by a network. As mentioned above, there are several
prerequisites, however (for example, you typically must have an
account on all the machines, you can ssh or ssh between the nodes
without using a password etc.).
This raises the question for Open MPI system administrators: where to
install the Open MPI binaries, header files, etc.? This discussion
mainly addresses this question for homogeneous clusters (i.e., where
all nodes and operating systems are the same), although elements of
this discussion apply to heterogeneous clusters as well.
Heterogeneous admins are encouraged to read this discussion and then
see the heterogeneous section of this FAQ.
There are two common approaches:
- Have a common filesystem, such as NFS, between all the machines
to be used. Install Open MPI such that the installation directory is
the same value on each node. This will greatly simplify user's
shell startup scripts (e.g.,
.bashrc, .cshrc, .profile .etc.)
-- the PATH can be set without checking which machine the user is
on. It also simplifies the system administrator's job; when the time
comes to patch or otherwise upgrade OMPI, only one copy needs to be
modified.
For example, consider a cluster of four machines: inky, blinky,
pinky, and clyde.
There is a bit of a disadvantage in this approach; each of the remote
nodes have to incur NFS (or whatever filesystem is used) delays to
access the Open MPI directory tree. However, both the administration
ease and low cost (relatively speaking) of using a networked file
system usually greatly outweighs the cost. Indeed, once an MPI
application is past MPI_INIT, it doesn't use the Open MPI binaries
very much.
NOTE: Open MPI, by default, uses a plugin
system for loading functionality at run-time. Most of Open MPI's
plugins are opened during the call to MPI_INIT. This can cause a lot
of filesystem traffic, which, if Open MPI is installed on a networked
filesystem, may be noticable. Two common options to avoid this extra
filesystem traffic are to build Open MPI to not use plugins (see this FAQ entry for details) or to install
Open MPI locally (see below).
- If you are concerned with networked filesystem costs of accessing
the Open MPI binaries, you can install Open MPI on the local hard
drive of each node in your system. Again, it is highly advisable to
install Open MPI in the same directory on each node so that each
user's
PATH can be set to the same value, regardless of the node
that a user has logged on to.
This approach will save some network latency of accessing the Open MPI
binaries, but is typically only used where users are very concerned
about squeezing every spare cycle out of their machines, or are
running at extreme scale where a networked filesystem may get
overwhelmed by filesystem requests for Open MPI binaries when running
very large parallel jobs.
| 57. Should I install a new version of Open MPI over an old version? |
We do not recommend this.
Before discussing specifics, here are some definitions that are
necessary understand:
- Source tree: The tree where the Open MPI source
code is located. It is typically the result of expanding an Open MPI
distribution source code bundle, such as a tarball.
- Build tree: The tree where Open MPI was built.
It is always related to a specific source tree, but may actually be a
different tree (since Open MPI supports VPATH builds). Specifically,
this is the tree where you invoked
configure, make, etc. to build
and install Open MPI.
- Installation tree: The tree where Open MPI was
installed. It is typically the "prefix" argument given to Open MPI's
configure script; it is the directory from which you run installed Open
MPI executables.
In its default configuration, an Open MPI installation consists of
several shared libraries, header files, executables, and plugins
(dynamic shared objects -- DSOs). These installation files act
together as a single entity. The specific filenames and
contents of these files are subject to change between different
versions of Open MPI.
KEY POINT: Installing one
version of Open MPI does not uninstall another version.
If you install a new version of Open MPI over an older version, this
may not remove or overwrite all the files from the older version.
Hence, you may end up with an incompatible muddle of files from two
different installations -- which can cause problems.
The Open MPI team recommends one of the following methods for
upgrading your Open MPI installation:
- Install newer versions of Open MPI into a different
directory. For example, install into
/opt/openmpi-a.b.c and
/opt/openmpi-x.y.z for versions a.b.c and x.y.z, respectively.
- Completely uninstall the old version of Open MPI before
installing the new version. The
make uninstall process from Open
MPI a.b.c build tree should completely uninstall that version from
the installation tree, making it safe to install a new version (e.g.,
version x.y.z) into the same installation tree.
- Remove the old installation directory entirely and then install
the new version. For example "
rm -rf /opt/openmpi" (assuming
that there is nothing else of value in this tree!) The installation
of Open MPI x.y.z will safely re-create the /opt/openmpi tree. This
method is preferable if you no longer have the source and build trees
to Open MPI a.b.c available from which to "make uninstall".
- Go into the Open MPI a.b.c installation directory and manually
remove all old Open MPI files. Then install Open MPI x.y.z into the
same installation directory. This can be a somewhat painful,
annoying, and error-prone process. We do not recommend it. Indeed,
if you no longer have access to the original Open MPI a.b.c source and
build trees, it may be far simpler to download Open MPI version a.b.c
again from the Open MPI web site, configure it with the same
installation prefix, and then run "
make uninstall". Or use one of
the other methods, above.
| 58. Can I disable Open MPI's use of plugins? |
Yes.
Open MPI uses plugins for much of its functionality. Specifically,
Open MPI looks for and loads plugins as dynamically shared objects
(DSOs) during the call to MPI_INIT. However, these plugins can be
compiled and installed in several different ways:
- As DSOs: In this mode (the default), each of Open MPI's plugins
are compiled as a separate DSO that is dynamically loaded at run
time.
- Advantage: this approach is highly flexible -- it gives system
developers and administrators fine-grained approach to install new
plugins to an existing Open MPI installation, and also allows the
removal of old plugins (i.e., forcibly disallowing the use of specific
plugins) simply by removing the corresponding DSO(s).
- Disadvantage: this approach causes additional filesystem
traffic (mostly during MPI_INIT). If Open MPI is installed on a
networked filesystem, this can cause noticable network traffic when a
large parallel job starts, for example.
- As part of a larger library: In this mode, Open MPI "slurps
up" the plugins includes them in libmpi (and other libraries).
Hence, all plugins are included in the main Open MPI libraries
that are loaded by the system linker before an MPI process even
starts.
- Advantage: Significantly less filesystem traffic than the DSO
approach. This model can be much more performant on network
installations of Open MPI.
- Disadvantage: Much less flexible than the DSO approach; system
administrators and developers have significantly less ability to
add/remove plugins from the Open MPI installation at run-time. Note
that you still have some ability to add/remove plugins (see below),
but there are limitations to what can be done.
To be clear: Open MPI's plugins can be built either as standalone DSOs
or included in Open MPI's main libraries (e.g., libmpi).
Additionally, Open MPI's main libraries can be built either as static
or shared libraries.
You can therefore choose to build Open MPI in one of several different
ways:
- --disable-mca-dso: Using the
--disable-mca-dso switch to Open
MPI's configure script will cause all plugins to be built as part of
Open MPI's main libraries -- they will not be built as standalone
DSOs. However, Open MPI will still look for DSOs in the filesystem at
run-time. Specifically: this option significantly decreases (but
does not eliminate) filesystem traffic during MPI_INIT, but does allow
the flexibility of adding new plugins to an existing Open MPI
installation.
Note that the --disable-mca-dso option does not affect whether Open
MPI's main libraries are built as static or shared.
- --enable-static: Using this option to Open MPI's
configure
script will cause the building of static libraries (e.g., libmpi.a).
This option automatically implies --disable-mca-dso.
Note that --enable-shared is also the default; so if you use
--enable-static, Open MPI will build both static and shared
libraries that contain all of Open MPI's plugins (i.e., libmpi.so and
libmpi.a). If you want only static libraries (that contain all of
Open MPI's plugins), be sure to also use --disable-shared.
- --disable-dlopen: Using this option to Open MPI's
configure
script will do two things:
- Imply
--disable-mca-dso, meaning that all plugins will be
slurped into Open MPI's libraries.
- Cause Open MPI to not look for / open any DSOs at run time.
Specifically: this option makes Open MPI not incur any additional
filesystem traffic during MPI_INIT. Note that the --disable-dlopen
option does not affect whether Open MPI's main libraries are built as
static or shared.
| 59. How do I build an optimized version of Open MPI? |
If you have obtained a developer's checkout from Subversion
(or Mercurial), you must consult these
directions.
Building Open MPI from a tarball defaults to building an optimized
version. There is no need to do anything special.
| 60. Are VPATH and/or parallel builds supported? |
Yes, both VPATH and parallel builds are supported. This
allows Open MPI to be built in a different directory than where its
source code resides (helpful for multi-architecture builds). Open MPI
uses Automake for its build system, so
For example:
shell$ gtar zxf openmpi-1.2.3.tar.gz
shell$ cd openmpi-1.2.3
shell$ mkdir build
shell$ cd build
shell$ ../configure ...
<... lots of output ...>
shell$ make -j 4
|
Running configure from a different directory from where it actually
resides triggers the VPATH build (i.e., it will configure and built
itself from the directory where configure was run, not from the
directory where configure resides).
Some versions of make support parallel builds. The example above
shows GNU make's "-j" option, which specifies how many compile
processes may be executing any any given time. We, the Open MPI Team,
have found that doubling or quadrupling the number of processors in a
machine can significantly speed up an Open MPI compile (since
compiles tend to be much more IO bound than CPU bound).
| 61. Do I need any special tools to build Open MPI? |
If you are building Open MPI from a tarball, you need a C
compiler, a C++ compiler, and make. If you are building the Fortran
77 and/or Fortran 90 MPI bindings, you will need compilers for these
languages as well. You do not need any special version of the GNU
"Auto" tools (Autoconf, Automake, Libtool).
If you are building Open MPI from a Subversion checkout, you need some
additional tools. See the
Subversion access pages for more information.
| 62. How do I build Open MPI as a static library? |
As noted above, Open MPI defaults to building shared libraries
and building components as dynamic shared objects (DSOs, i.e.,
run-time plugins). Changing this build behavior is controlled via
command line options to Open MPI's configure script.
Building static libraries: You can disable building shared libraries
and enable building static libraries with the following options:
shell$ ./configure --enable-static --disable-shared ...
|
Similarly, you can build both static and shared libraries by simply
specifying --enable-static (and not specifying
--disable-shared), if desired.
Including components in libraries: Instead of building components as
DSOs, they can also be "rolled up" and included in their respective
libraries (e.g., libmpi). This is controlled with the
--enable-mca-static option. Some examples:
shell$ ./configure --enable-mca-static=pml ...
shell$ ./configure --enable-mca-static=pml,btl-openib,btl-self ...
|
Specifically, entire frameworks and/or individual components can be
specified to be rolled up into the library in a comma-separated list
as an argument to --enable-mca-static.
| 63. When I run 'make', it looks very much like the build system is going into a loop. |
Open MPI uses the GNU Automake software to build itself.
Automake uses a tightly-woven set of file timestamp-based
dependencies to compile and link software. This behavior, frequently
paired with messages similar to:
Warning: File `Makefile.am' has modification time 3.6e+04 s in the future
|
typically means that you are building on a networked filesystem where
the local time of the client machine that you are building on does not
match the time on the network filesystem server. This will result in
files with incorrect timestamps, and Automake degenerates into undefined
behavior.
Two solutions are possible:
- Ensure that the time between your network filesystem server and
client(s) is the same. This can be accomplished in a variety of ways
and is dependent upon your local setup; one method is to use an NTP
daemon to synchronize all machines to a common time server.
- Build on a local disk filesystem where network timestamps are not
a factor.
After implementing one of the two options, you will likely need to
re-run configure. Then Open MPI should build successfully.
| 64. Configure issues warnings about sed and unterminated
commands |
Some users have reported seeing warnings like this in the
final output from configure:
*** Final output
configure: creating ./config.status
config.status: creating ompi/include/ompi/version.h
sed: file ./confstatA1BhUF/subs-3.sed line 33: unterminated `s' command
sed: file ./confstatA1BhUF/subs-4.sed line 4: unterminated `s' command
config.status: creating orte/include/orte/version.h
|
These messages usually indicate a problem in the user's local shell
configuration. Ensure that when you run a new shell, no output is
sent to stdout. For example, if the output of this simple shell
script is more than just the hostname of your computer, you need to go
check your shell startup files to see where the extraneous output is
coming from (and eliminate it):
#!/bin/sh
`hostname`
exit 0
|
| 65. Open MPI configured ok, but I get "Makefile:602: *** missing separator" kinds of errrs when building |
This is usually an indication that configure succeeded but
really shouldn't have. See this FAQ
entry for one possible cause.
| 66. Open MPI seems to default to building with the GNU compiler set. Can I use other compilers? |
Yes.
Open MPI uses a standard Autoconf "configure" script to probe the
current system and figure out how to build itself. One of the choices
it makes it which compiler set to use. Since Autoconf is a GNU
product, it defaults to the GNU compiler set. However, this is easily
overridden on the configure command line. For example, to build
Open MPI with the Intel compiler suite:
shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...
|
Note that you can include additional parameters to configure,
implied by the "..." clause in the example above.
In particular, 4 switches on the configure command line are used to
specify the compiler suite:
- CC: Specifies the C compiler
- CXX: Specifies the C++ compiler
- F77: Specifies the Fortran 77 compiler
- FC: Specifies the Fortran 90 compiler
NOTE: The Open MPI team recommends using a
single compiler suite whenever possible. Unexpeced or undefined
behavior can occur when you mix compiler suites in unsupported ways
(e.g., mixing Fortran 77 and Fortran 90 compilers between different
compiler suites is almost guaranteed not to work).
Here are some more examples for common compilers:
# Portland compilers
shell$ ./configure CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90
# Pathscale compilers
shell$ ./configure CC=pathcc CXX=pathCC F77=pathf90 FC=pathf90
# Oracle Solaris Studio (Sun) compilers
shell$ ./configure CC=cc CXX=CC F77=f77 FC=f90
|
In all cases, the compilers must be found in your PATH and be able to
successfully compile and link non-MPI applications before Open MPI
will be able to be built properly.
| 67. Can I pass specific flags to the compilers / linker used to build Open MPI? |
Yes.
Open MPI uses a standard Autoconf configure script to set itself up
for building. As such, there are a number of command line options
that can be passed to configure to customize flags that are passed
to the underlying compiler to build Open MPI:
- CFLAGS: Flags passed to the C compiler.
- CXXFLAGS: Flags passed to the C++ compiler.
- FFLAGS: Flags passed to the Fortran 77 compiler.
- FCFLAGS: Flags passed to the Fortran 90 compiler.
- LDFLAGS: Flags passed to the linker (not language-specific).
This flag is rarely required; Open MPI will usually pick up all
LDFLAGS that it needs by itself.
- LIBS: Extra libraries to link to Open MPI (not
language-specific). This flag is rarely required; Open MPI will
usually pick up all LIBS that it needs by itself.
- LD_LIBRARY_PATH: Note that we do not recommend setting
LD_LIBRARY_PATH via configure, but it is worth noting that you
should ensure that your LD_LIBRARY_PATH value is appropriate for
your build. Some users have been tripped up, for example, by
specifying a non-default Fortran compiler to FC and F77, but then
having Open MPI's configure script fail because the LD_LIBRARY_PATH
wasn't set properly to point to that Fortran compiler's support
libraries.
Note that the flags you specify must be compatible across all the
compilers. In particular, flags specified to one language compiler
must generate code that can be compiled and linked against code that
is generated by the other language compilers. For example, on a 64
bit system where the compiler default is to build 32 bit executables:
# Assuming the GNU compiler suite
shell$ ./configure CFLAGS=-m64 ...
|
will produce 64 bit C objects, but 32 bit objects for C++, Fortran 77,
and Fortran 90. These codes will be incompatible with each other, and
Open MPI will build successfully. Instead, you must specify to build
64 bit objects for all languages:
# Assuming the GNU compiler suite
shell$ ./configure CFLAGS=-m64 CXXFLAGS=-m64 FFLAGS=-m64 FCFLAGS=-m64 ...
|
The above command line will pass "-m64" to all four compilers, and
therefore will produce 64 bit objects for all languages.
| 68. I'm trying to build with the Intel compilers, but Open MPI
eventually fails to compile with really long error messages. What do
I do? |
A common mistake when building Open MPI with the Intel
compiler suite is to accidentally specify the Intel C compiler as the
C++ compiler. Specifically, recent versions of the Intel compiler
renamed the C++ compiler "icpc" (it used to be "icc", the same
as the C compiler). Users accustomed to the old name tend to specify
"icc" as the C++ compiler, which will then cause a failure late in
the Open MPI build process because a C++ code will be compiled with
the C compiler. Bad Things then happen.
The solution is to be sure to specify that the C++ compiler is
"icpc", not "icc". For example:
shell$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort ...
|
For Googling purposes, here's some of the error messages that may be
issued during the Open MPI C++ codes with the Intel C compiler
(icc), in no particular order:
IPO Error: unresolved : _ZNSsD1Ev
IPO Error: unresolved : _ZdlPv
IPO Error: unresolved : _ZNKSs4sizeEv
components.o(.text+0x17): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::basic_string()'
components.o(.text+0x64): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::basic_string()'
components.o(.text+0x70): In function `ompi_info::open_components()':
: undefined reference to `std::string::size() const'
components.o(.text+0x7d): In function `ompi_info::open_components()':
: undefined reference to `std::string::reserve(unsigned int)'
components.o(.text+0x8d): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(char const*, unsigned int)'
components.o(.text+0x9a): In function `ompi_info::open_components()':
: undefined reference to `std::string::append(std::string const&)'
components.o(.text+0xaa): In function `ompi_info::open_components()':
: undefined reference to `std::string::operator=(std::string const&)'
components.o(.text+0xb3): In function `ompi_info::open_components()':
: undefined reference to `std::basic_string, std::allocator >::~basic_string()'
|
There are many more error messages, but the above should be sufficient
for someone trying to find this FAQ entry via a web crawler.
| 69. When I build with the Intel compiler suite, linking user MPI
applications with the wrapper compilers results in warning messages.
What do I do? |
When Open MPI was built with some versions of the Intel
compilers on some platforms, you may see warnings similar to the
following when compiling MPI applications with Open MPI's wrapper
compilers:
shell$ mpicc hello.c -o hello
libimf.so: warning: warning: feupdateenv is not implemented and will always fail
shell$
|
This warning is generally harmless, but it can be alarming to some
users. To remove this warning, pass either the -shared-intel or
-i-dynamic options when linking your MPI application (the specific
option depends on your version of the Intel compilers; consult your
local documentation):
shell$ mpicc hello.c -o hello -shared-intel
shell$
|
You can also change the
default behavior of Open MPI's wrapper compilers to automatically
include this -shared-intel flag so that it is unnecessary to specify it
on the command line when linking MPI applications.
| 70. I'm trying to build with the IBM compilers, but Open MPI
eventually fails to compile. What do I do? |
Unfortunately there are some problems between Libtool (which
Open MPI uses for library support) and the IBM compilers when creating
shared libraries. Currently the only workaround is to disable shared
libraries and build Open MPI statically. For example:
shell$ ./configure CC=xlc CXX=xlc++ F77=xlf FC=xlf90 --disable-shared --enable-static ...
|
For Googling purposes, here's a error message that may be
issued when the build fails:
xlc: 1501-216 command option --whole-archive is not recognized - passed to ld
xlc: 1501-216 command option --no-whole-archive is not recognized - passed to ld
xlc: 1501-218 file libopen-pal.so.0 contains an incorrect file suffix
xlc: 1501-228 input file libopen-pal.so.0 not found
|
| 71. I'm trying to build with the Oracle Solaris Studio (Sun) compilers on Linux, but Open MPI
eventually fails to compile. What do I do? |
Below are some known issues that impact Oracle Solaris Studio 12 Open MPI
builds. The easiest way to work around them is simply to use the
Oracle Solaris Studio Express compilers, which are
essentially a development branch of the Oracle Solaris Studio 12 compilers with fixes for the issues below.
- Oracle Solaris Studio defects:
- Open MPI defects:
- #747 bool problem with Sun compilers on Linux
- #875
mpiCC wrapper compiler problem with Sun CC, f77, f90 on Linux
- #916 gnu
ld versions < 2.17 do not support /dev/null being passed in
For the mpiCC, mpif90, or mpif77 wrapper compilers to function, #875 requires that -Wl,--export-dynamic be
removed from the following three files:
-
share/openmpi/mpic++-wrapper-data.txt
-
share/openmpi/mpif77-wrapper-data.txt
-
share/openmpi/mpif90-wrapper-data.txt
For older versions of OMPI with more recent Studio Fortran releases, #2632
describes a workaround to the configure file to handle improper autogen support for the Studio Fortran
compiler.
| 72. What configure options should I use when building with the Oracle Solaris Studio (Sun) compilers? |
The below configure options are suggested for use with the Oracle Solaris Studio (Sun) compilers:
--enable-heterogeneous
--enable-cxx-exceptions
--enable-shared
--enable-orterun-prefix-by-default
--enable-mpi-f90
--with-mpi-f90-size=small
--disable-mpi-threads
--disable-progress-threads
--disable-debug
|
Linux only:
--with-openib
--without-udapl
--disable-openib-ibcm (only in v1.5.4 and earlier)
|
Solaris x86 only:
CFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=generic -xarch=sse2 -xprefetch -xprefetch_level=2 -xvector=simd -stackvar -xO5"
|
Solaris SPARC only:
CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -xdepend=yes -xbuiltin=%all -xO5"
FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -stackvar -xO5"
|
| 73. When building with the Oracle Solaris Studio 12 Update 1 (Sun) compilers on x86 Linux, the compiler loops on btl_sm.c. Is there a workaround? |
Apply Sun patch 141859-04.
You may also consider updating your Oracle Solaris Studio compilers to
the latest Oracle Solaris Studio Express.
| 74. How do I build OpenMPI on IBM QS22 cell blade machines with GCC and XLC/XLF compilers? |
You can use two following scripts (contributed by IBM) to build Open MPI on QS22.
Script to build OpenMPI using the GCC compiler
#!/bin/bash
export PREFIX=/usr/local/openmpi-1.2.7_gcc
./configure \
CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m64 \
CXXFLAGS=-m64 FC=ppu-gfortran FCFLAGS=-m64 \
FFLAGS=-m64 CCASFLAGS=-m64 LDFLAGS=-m64 \
--prefix=$PREFIX \
--with-platform=optimized \
--disable-mpi-profile \
--with-openib=/usr \
--enable-ltdl-convenience \
--with-wrapper-cflags=-m64 \
--with-wrapper-ldflags=-m64 \
--with-wrapper-fflags=-m64 \
--with-wrapper-fcflags=-m64
make
make install
cat <> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF
cp config.status $PREFIX/config.status
|
Script to build OpenMPI using XLC and XLF compilers
#!/bin/bash
#
export PREFIX=/usr/local/openmpi-1.2.7_xl
./configure --prefix=$PREFIX \
--with-platform=optimized \
--disable-shared --enable-static \
CC=ppuxlc CXX=ppuxlc++ F77=ppuxlf FC=ppuxlf90 LD=ppuld \
--disable-mpi-profile \
--disable-heterogeneous \
--with-openib=/usr \
CFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
CXXFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
FFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
FCFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
CCASFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
LDFLAGS="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
--enable-ltdl-convenience \
--with-wrapper-cflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
--with-wrapper-ldflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
--with-wrapper-fflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
--with-wrapper-fcflags="-q64 -O3 -qarch=cellppu -qtune=cellppu" \
--enable-contrib-no-build=libnbc,vt
make
make install
cat <> $PREFIX/etc/openmpi-mca-params.conf
mpi_paffinity_alone = 1
mpi_leave_pinned = 1
btl_openib_want_fork_support = 0
EOF
cp config.status $PREFIX/config.status
|
| 75. I'm trying to build with the PathScale 3.0 and 3.1 compilers on Linux, but all Open MPI commands seg fault. What do I do? |
The PathScale compiler authors have identified a bug in the
v3.0 and v3.1 versions of their compiler; you must disable certain
"builtin" functions when building Open MPI:
- With PathScale 3.0 and 3.1 compilers use the workaround options
-O2 and -fno-builtin in CFLAGS across the Open MPI build. For
example:
shell$ ./configure CFLAGS="-O2 -fno-builtin" ...
|
- With PathScale 3.2 beta and later, no workaround options are
required.
| 76. All MPI C++ API functions return errors (or otherwise fail)
when Open MPI is compiled with the PathScale compilers. What do I do? |
This is an old issue that seems to be a problem when
Pathscale uses a back-end GCC 3.x compiler. Here's a proposed
solution from the Pathscale support team (from July 2010):
The proposed work-around is to install gcc-4.x on the system and use
the pathCC -gnu4 option. Newer versions of the compiler (4.x and
beyond) should have this fixed, but we'll have to test to confirm it's
actually fixed and working correctly.
We don't anticipate that this will be much of a problem for Open MPI
users these days (our informal testing shows that not many users are
still using GCC 3.x), but this information is provided so that it is
Google-able for those still using older compilers.
| 77. How do I build Open MPI with support for Open IB (Infiniband),
mVAPI (Infiniband), GM (Myrinet), and/or MX (Myrinet)? |
To build support for high-speed interconnect networks, you
generally only have to specify the directory where its support header
files and libraries were installed to Open MPI's configure script.
You can specify where multiple packages were installed if you have
support for more than one kind of interconnect -- Open MPI will build
support for as many as it can.
You tell configure where support libraries are with the appropriate
--with command line switch. Here is the list of available switches:
- --with-openib=<dir>: Build support for OpenFabrics
(previously known as "Open IB", for Infiniband and iWARP networks --
note that iWARP support was added in the v1.3 series).
- --with-mvapi=<dir>: Build support for mVAPI (Infiniband
-- note that mVAPI support has been removed in the v1.3 series).
- --with-gm=<dir>: Build support for GM (Myrinet).
- --with-mx=<dir>: Build support for MX (Myrinet).
For example:
shell$ ./configure --with-mvapi=/path/to/mvapi/installation \
--with-gm=/path/to/gm/installation
|
These switches enable Open MPI's configure script to automatically
find all the right header files and libraries to support the various
networks that you specified.
You can verify that configure found everything properly by examining
its output -- it will test for each network's header files and
libraries and report whether it will build support (or not) for each
of them. Examining configure's output is the first place you
should look if you have a problem with Open MPI not correctly
supporting a specific network type.
If configure indicates that support for your networks will be
included, after you build and install Open MPI, you can run the
"ompi_info" command and look for components for your networks.
The v1.2 (and earlier) series has two openib components (your exact
version numbers may be different):
shell$ ompi_info | grep openib
MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0)
MCA btl: openib (MCA v1.0, API v1.0, Component v1.0)
|
mVAPI components will be named "mvapi", GM components will be
named "gm", and MX components will be named "mx".
Note that the v1.3 series removed the "openib" mpool component and
also removed all support for mVAPI.
| 78. How do I build Open MPI with support for SLURM / XGrid? |
SLURM support is built automatically; there is nothing that
you need to do.
XGrid support is built automatically if the XGrid tools are installed.
| 79. How do I build Open MPI with support for SGE? |
Support for SGE first appeared in the Open MPI v1.2 series.
The method for configuring it is slightly different between Open MPI
v1.2 and v1.3.
For Open MPI v1.2, no extra configure arguments are needed as SGE
support is built in automatically. After Open MPI is installed, you
should see two components named gridengine.
shell$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.5)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.5)
|
For Open MPI v1.3, you need to explicitly request the SGE support with
the "--with-sge" command line switch to the Open MPI configure
script. For example:
shell$ ./configure --with-sge
|
After Open MPI is installed, you should see one component named
gridengine.
shell$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
|
Open MPI v1.3 only has the one specific gridengine component as the
other functionality was rolled into other components.
Component versions may vary depending on the version of Open MPI 1.2 or
1.3 you are using.
| 80. How do I build Open MPI with support for PBS Pro / Open PBS / Torque? |
Support for PBS Pro, Open PBS, and Torque must be explicitly requested
with the "--with-tm" command line switch to Open MPI's configure
script. In general, the procedure is the same building support for high-speed interconnect
networks, except that you use --with-tm. For example:
shell$ ./configure --with-tm=/path/to/pbs_or_torque/installation
|
After Open MPI is installed, you should see two components named
"tm":
shell$ ompi_info | grep tm
MCA pls: tm (MCA v1.0, API v1.0, Component v1.0)
MCA ras: tm (MCA v1.0, API v1.0, Component v1.0)
|
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
NOTE: Update to the note below
(May 2006), Torque 2.1.0p0 now includes support for shared libraries
and the workarounds listed below are no longer necessary. However,
this version of Torque changed other things that require upgrading
Open MPI to 1.0.3 or higher (as of this writing, v1.0.3 has not yet
been released -- nightly snapshot tarballs of what will become 1.0.3
are available at http://www.open-mpi.org/nightly/v1.0/).
NOTE: As of this writing
(October 2006), Open PBS, and PBS Pro do not (i.e., they only include
static libraries). Because of this, you may run into linking errors
when Open MPI tries to create dynamic plugin components for TM support
on some platforms. Notably, on at least some 64 bit Linux platforms
(e.g., AMD64), trying to create a dynamic plugin that links against a
static library will result in error messages such as:
relocation R_X86_64_32S against `a local symbol' can not be used when
making a shared object; recompile with -fPIC
|
Note that recent versions of Torque (as of October 2006) have started
shipping shared libraries and this issue does not occur.
There are two possible solutions in Open MPI 1.0.x:
- Recompile your PBS implementation with "
-fPIC" (or whatever
the relevant flag is for your compiler to generate
position-independent code) and re-install. This will allow Open MPI
to generate dynamic plugins with the PBS/Torque libraries properly.
PRO: Open MPI enjoys the benefits of shared libraries and dynamic
plugins.
CON: Dynamic plugins can use more memory at run-time (e.g.,
operating systems tend to align each plugin on a page, rather than
densely packing them all into a single library).
CON: This is not possible for binary-only vendor distributions
(such as PBS Pro).
- Configure Open MPI to build a static library that includes all of
its components. Specifically, all of Open MPI's components will be
included in its libraries -- none will be discovered and opened at
run-time. This does not affect user MPI code at all (i.e., the
location of Open MPI's plugins is transparent to MPI applications).
Use the following options to Open MPI's
configure script:
shell$ ./configure --disable-shared --enable-static ...
|
Note that this option only changes the location of Open MPI's default
set of plugins (i.e., they are included in libmpi and friends
rather than being standalone dynamic shared objects that are
found/opened at run-time). This option does not change the fact
that Open MPI will still try to open other dynamic plugins at
run-time.
PRO: This works with binary-only vendor distributions (e.g., PBS
Pro).
CON: User applications are statically linked to Open MPI; if Open
MPI -- or any of its default set of components -- is updated, users
will need to re-link their MPI applications.
Both methods work equally well, but there are tradeoffs; each site
will likely need to make its own determination of which to use.
| 81. How do I build Open MPI with support for LoadLeveler? |
Support for LoadLeveler will be automatically built if the LoadLeveler
libraries and headers are in the default path. If not, support
must be explicitly requested with the "--with-loadleveler" command
line switch to Open MPI's configure script. In general, the procedure
is the same building support for high-speed
interconnect networks, except that you use --with-loadleveler.
For example:
shell$ ./configure --with-loadleveler=/path/to/LoadLeveler/installation
|
After Open MPI is installed, you should see one or more components
named "loadleveler":
shell$ ompi_info | grep loadleveler
MCA ras: loadleveler (MCA v1.0, API v1.3, Component v1.3)
|
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
| 82. How do I build Open MPI with support for Platform LSF? |
Note that only Platform LSF 7.0.2 and later is supported.
Support for LSF will be automatically built if the LSF libraries and
headers are in the default path. If not, support must be explicitly
requested with the "--with-lsf" command line switch to Open MPI's
configure script. In general, the procedure is the same building support for high-speed interconnect
networks, except that you use --with-lsf. For example:
shell$ ./configure --with-lsf=/path/to/lsf/installation
|
After Open MPI is installed, you should see a component named
"lsf":
shell$ ompi_info | grep lsf
MCA ess: lsf (MCA v2.0, API v1.3, Component v1.3)
MCA ras: lsf (MCA v2.0, API v1.3, Component v1.3)
MCA plm: lsf (MCA v2.0, API v1.3, Component v1.3)
|
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
| 83. How do I build Open MPI with processor affinity support? |
Open MPI currently only supports processor affinity for some
platforms. In general, processor affinity will automatically be built
if it is supported -- no additional command line flags to configure
should be necessary.
See this FAQ entry for
more details.
| 84. How do I build Open MPI with memory affinity / NUMA support (e.g., libnuma)? |
Open MPI currently only supports libnuma memory
affinity for Linux-based systems (please let us know if there are
other NUMA libraries that you need supported!).
Support for libnuma must be explicitly requested with the
"--with-libnuma" command line switch to Open MPI's configure
script. In general, the procedure is the same building support for high-speed interconnect
networks, except that you use --with-libnuma. For example:
shell$ ./configure --with-libnuma=/path/to/libnuma/installation
|
After Open MPI is installed, you should see an maffinity component
named "libnuma":
shell$ ompi_info | grep libnuma
MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)
|
Specific frameworks and version numbers may vary, depending on your
version of Open MPI.
See this FAQ entry for
more details.
| 85. How do I build Open MPI with support for sending CUDA device memory? |
This feature currently only exists in the trunk version of the
Open MPI library. The support for CUDA device pointers needs to be
explicitly configured into the library. Here is the pertinent
information from the configure --help command.
--with-cuda(=DIR) Build cuda support, optionally adding DIR/include,
DIR/lib, and DIR/lib64
--with-cuda-libdir=DIR Search for cuda libraries in DIR
|
Here are some examples of configure commands that enable CUDA support.
1. Searches in default locations. Looks for cuda.h in
/usr/local/cuda/include and libcuda.so in /usr/lib64.
2. Searches for cuda.h in /usr/local/cuda-v4.0/cuda and libcuda.so in
default location of /usr/lib64.
./configure --with-cuda=/usr/local/cuda-v4.0/cuda
|
3. Searches for cuda.h in /usr/local/cuda-v4.0/cuda/include and
libcuda.so in /usr/lib64. (same as previous one)
./configure --with-cuda=/usr/local/cuda-v4.0/cuda --with-cuda-libdir=/usr/lib64
|
If the cuda.h or libcuda.so files cannot be found, then the configure
will abort.
See this FAQ entry
for detals on how to use the CUDA support.
| 86. How do I not build a specific plugin / component for Open MPI? |
The --enable-mca-no-build option to Open MPI's configure
script enables you to specify a list of components that you want to
skip building. This allow you to not include support for specific
features in Open MPI if you do not want to.
It takes a single argmuent: a comma-delimited list of
framework/component pairs inidicating which specific components you do
not want to build. For example:
shell$ ./configure --enable-mca-no-build=paffinity-linux,timer-solaris
|
Note that this option is really only useful for components that would
otherwise be built. For example, if you are on a machine without
Myrinet support, it is not necessary to specify:
shell$ ./configure --enable-mca-no-build=btl-gm
|
because the configure script will naturally see that you do not have
support for GM and will automatically skip the gm BTL component.
| 87. What other options to [configure] exist? |
There are many options to Open MPI's configure script.
Please run the following to get a full list (including a short
description of each option);
shell$ ./configure --help
|
| 88. Why does compiling the Fortran 90 bindings take soooo long? |
NOTE: Starting with Open
MPI v1.7, if you are not using gfortran, buidling the Fortran 90 and
08 bindings do not suffer the same performance penalty that previous
versions incurred. The Open MPI developers encourage all users to
upgrade to the new Fortran bindings implementation -- including the
new MPI-3 Fortran'08 bindings -- when possible.
This is actually a design problem with the MPI F90 bindings
themselves. The issue is that since F90 is a strongly typed language,
we have to overload each function that takes a choice buffer with a
typed buffer. For example, MPI_SEND has many different overloaded
versions -- one for each type of the user buffer. Specifically, there
is an MPI_SEND that has the following types for the first argument:
- logical*1, logical*2, logical*4, logical*8, logical*16 (if
supported)
- integer*1, integer*2, integer*4, integer*8, integer*16 (if
supported)
- real*4, real*8, real*16 (if supported)
- complex*8, complex*16, complex*32 (if supported)
- character
On the surface, this is 17 bindings for MPI_SEND. Multiply this by
every MPI function that takes a choice buffer (50) and you 850
overloaded functions. However, the problem gets worse -- for each
type, we also have to overload for each array dimension that needs to
be supported. Fortran allows up to 7 dimensional arrays, so this
becomes (17x7) = 119 versions of every MPI function that has a choice
buffer argument. This makes (17x7x50) = 5,950 MPI interface
functions.
To make matters even worse, consider the ~25 MPI functions that take
2 choice buffers. Functions have to be provided for all possible
combinations of types. This then becomes exponential -- the total
number of interface functions balloons up to 6.8M.
Additionally, F90 modules must all have their functions in a single
source file. Hence, all 6.8M functions must be in one .f90 file and
compiled as a single unit (currently, no F90 compiler that we are
aware of can handle 6.8M interface functions in a single module).
To limit this problem, Open MPI, by default, does not generate
interface functions for any of the 2-buffer MPI functions.
Additionally, we limit the maximum number of supported dimensions to 4
(instead of 7). This means that we're generating (17x4*50) = 3,400
interface functions in a single F90 module. So it's far smaller than
6.8M functions, but it's still quite a lot.
This is what makes compiling the F90 module take so long.
Note, however, you can limit the maximum number of dimensions that
Open MPI will generate for the F90 bindings with the configure switch
--with-f90-max-array-dim=DIM, where DIM is an integer <= 7. The
default value is 4. Decreasing this value makes the compilation go
faster, but obviously supports fewer dimensions.
Other than this limit on dimension size, there is little else that we
can do -- the MPI-2 F90 bindings were unfortunately not well thought
out in this regard.
Note, however, that the Open MPI team has proposed Fortran '03
bindings for MPI in a paper that was presented at the Euro
PVM/MPI'05 conference. These bindings avoid all the scalability
problems that are described above and have some other nice properties.
This is something that is being worked on in Open MPI, but there is
currently have no estimated timeframe on when it will be available.
| 89. Does Open MPI support MPI_REAL16 and MPI_COMPLEX32? |
It depends. Note that these datatypes are optional in the MPI
standard.
Prior to v1.3, Open MPI supported MPI_REAL16 and MPI_COMPLEX32 if
a portable C integer type could be found that was the same size
(measured in bytes) as Fortran's REAL*16 type. It was later
discovered that even though the sizes may be the same, the bit
representations between C and Fortran may be different. Since Open
MPI's reduction routines are implemented in C, calling MPI_REDUCE (and
related functions) with MPI_REAL16 or MPI_COMPLEX32 would generate
undefined results (although message passing with these types in
homogeneous environments generally worked fine).
As such, Open MPI v1.3 made the test for supporting MPI_REAL16 and
MPI_COMPLEX32 more stringent: Open MPI will support these types only
if:
- An integer C type can be found that has the same size (measured
in bytes) as the Fortran
REAL*16 type.
- The bit representation is the same between the C type and the
Fortran type.
Version 1.3.0 only checks for portable C types (e.g., long double).
A future version of Open MPI may include support for compiler-specific
/ non-portable C types. For example, the Intel compiler has specific
options for creating a C type that is the same as REAL*16, but we did
not have time to include this support in Open MPI v1.3.0.
| 90. Can I re-locate my Open MPI installation without re-configuring/re-compiling/re-installing from source? |
Starting with Open MPI v1.2.1, yes.
Background: Open MPI hard-codes some directory paths in its
executables based on installation paths specified by the configure
script. For example, if you configure with an installation prefix of
/opt/openmpi/, Open MPI encodes in its executables that it should be
able to find its help files in /opt/openmpi/share/openmpi.
The "installdirs" functionality in Open MPI lets you change any of
these hard-coded directory paths at run time
(assuming that you have already adjusted your PATH
and/or LD_LIBRARY_PATH environment variables to the new location
where Open MPI now resides). There are three methods:
- Move an existing Open MPI installation to a new prefix: Set the
OPAL_PREFIX environment variable before launching Open MPI. For
example, if Open MPI had initially been installed to /opt/openmpi
and the entire openmpi tree was later moved to /home/openmpi,
setting OPAL_PREFIX to /home/openmpi will enable Open MPI to
function properly.
- "Stage" an Open MPI installation in a temporary location: When
creating self-contained installation packages, systems such as RPM
install Open MPI into temporary locations. The package system then
bundles up everything under the temporary location into a package that
can be installed into its real location later. For example, when
creating an RPM that will be installed to
/opt/openmpi, the RPM
system will transparently prepend a "destination directory" (or
"destdir") to the installation directory. As such, Open MPI will
think that it is installed in /opt/openmpi, but it is actually
temporarily installed in (for example)
/var/rpm/build.1234/opt/openmpi. If it is necessary to use Open
MPI while it is installed in this staging area, the OPAL_DESTDIR
environment variable can be used; setting OPAL_DESTDIR to
/var/rpm/build.1234 will automatically prefix every directory such
that Open MPI can function properly.
- Overriding invidividual directories: Open MPI uses the
GNU-specified directories (per Autoconf/Automake), and can be
overridden by setting environment variables directly related to their
common names. The list of environment variables that can be used is:
- OPAL_PREFIX
- OPAL_EXEC_PREFIX
- OPAL_BINDIR
- OPAL_SBINDIR
- OPAL_LIBEXECDIR
- OPAL_DATAROOTDIR
- OPAL_DATADIR
- OPAL_SYSCONFDIR
- OPAL_SHAREDSTATEDIR
- OPAL_LOCALSTATEDIR
- OPAL_LIBDIR
- OPAL_INCLUDEDIR
- OPAL_INFODIR
- OPAL_MANDIR
- OPAL_PKGDATADIR
- OPAL_PKGLIBDIR
- OPAL_PKGINCLUDEDIR
Note that not all of the directories listed above are used by Open
MPI; they are listed here in entirety for completeness.
Also note that several directories listed above are defined in terms
of other directories. For example, the $bindir is defined by
default as $prefix/bin. Hence, overriding the $prefix (via
OPAL_PREFIX) will automatically change the first part of the
$bindir (which is how method 1 described above works).
Alternatively, OPAL_BINDIR can be set to an absolute value that
ignores $prefix altogether.
| 91. I'm still having problems / my problem is not listed here. What do I do? |
Please see this FAQ
category for troubleshooting tips and the Getting Help page -- it details
how to send a request to the Open MPI mailing lists.
| 92. In general, how do I build MPI applications with Open MPI? |
The Open MPI team strongly recommends that you simply use
Open MPI's "wrapper" compilers to compile your MPI applications.
That is, instead of using (for example) gcc to compile your program,
use mpicc. Open MPI provides a wrapper compiler for four languages:
| Language |
Wrapper compiler name |
| C |
mpicc |
| C++ |
mpiCC, mpicxx, or mpic++ (note that mpiCC will not exist on case-insensitive filesystems) |
| Fortran |
mpifort (for v1.7 and above) mpif77 and mpif90 (for older versions) |
Hence, if you expect to compile your program as:
shell$ gcc my_mpi_application.c -o my_mpi_application
|
Simply use the following instead:
shell$ mpicc my_mpi_application.c -o my_mpi_application
|
Note that Open MPI's wrapper compilers do not do any actual compiling
or linking; all they do is manipulate the command line and add in all
the relevant compiler / linker flags and then invoke the underlying
compiler / linker (hence, the name "wrapper" compiler). More
specifically, if you run into a compiler or linker error, check your
source code and/or back-end compiler -- it is usually not the fault of
the Open MPI wrapper compiler.
93. Wait -- what is mpifort? Shouldn't I use
mpif77 and mpif90? |
mpifort is a new name for the Fortran wrapper compiler that
debuted in Open MPI v1.7.
It supports compiling all versions of Fortran, and utilizing all
MPI Fortran interfaces (mpif.h, use mpi, and use
mpi_f08). There is no need to distinguish between "Fortran 77"
(which hasn't existed for 30+ years) or "Fortran 90" -- just use
mpifort to compile all your Fortran MPI applications and don't worry
about what dialect it is, nor which MPI Fortran interface it uses.
Other MPI implementations will also soon support a wrapper compiler
named mpifort, so hopefully we can move the whole world to this
simpler wrapper compiler name, and elminiate the use of mpif77 and
mpif90.
Specifically: mpif77 and mpif90 are
deprecated as of Open MPI v1.7. Although mpif77 and
mpif90 still exist in Open MPI v1.7 for legacy reasons, they will
likely be removed in some (undetermined) future release. It is in
your interest to convert to mpifort now.
Also note that these names are literally just sym links to mpifort
under the covers. So you're using mpifort whether you realize it or
not. :-)
Basically, the 1980's called; they want their mpif77 wrapper
compiler back. Let's let them have it.
| 94. I can't / don't want to use Open MPI's wrapper compilers.
What do I do? |
We repeat the above statement: the Open MPI Team strongly
recommends that the use the wrapper compilers to compile and link MPI
applications.
If you find yourself saying, "But I don't want to use wrapper
compilers!", please humor us and try them. See if they work for you.
Be sure to let us know if they do not work for you.
Many people base their "wrapper compilers suck!" mentality on bad
behavior from poorly-implemented wrapper compilers in the mid-1990's.
Things are much better these days; wrapper compilers can handle
almost any situation, and are far more reliable than you attempting to
hard-code the Open MPI-specific compiler and linker flags manually.
That being said, there are some -- very, very few -- situations
where using wrapper compilers can be problematic -- such as nesting
multiple wrapper compilers of multiple projects. Hence, Open MPI
provides a workaround to find out what command line flags you need to
compile MPI applications. There are generally two sets of flags that
you need: compile flags and link flags.
# Show the flags necessary to compile MPI C applications
shell$ mpicc --showme:compile
# Show the flags necessary to link MPI C applications
shell$ mpicc --showme:link
|
The --showme:* flags work with all Open MPI wrapper compilers
(specifically: mpicc, mpiCC / mpicxx / mpic++, mpifort, and
if you really must use them, mpif77, mpif90).
Hence, if you need to use some other compiler other than Open MPI's
wrapper compilers, we advise you to run the appropriate Open MPI
wrapper compiler with the --showme flags to see what Open MPI needs
to compile / link, and then use those with your compiler.
NOTE: It is absolutely not sufficient
to simply add "-lmpi" to your link line and assume that you will
obtain a valid Open MPI executable.
NOTE: It is almost never a good idea to hard-code these results in a
Makefile (or other build system). It is almost always best to run
(for example) "mpicc --showme:compile" in a dynamic fashion to
find out what you need. For example, GNU Make allows running commands
and assigning their results to variables:
MPI_COMPILE_FLAGS = $(shell mpicc --showme:compile)
MPI_LINK_FLAGS = $(shell mpicc --showme:link)
my_app: my_app.c
$(CC) $(MPI_COMPILE_FLAGS) my_app.c $(MPI_LINK_FLAGS) -o my_app
|
| 95. How do I override the flags specified by Open MPI's wrapper
compilers? (v1.0 series) |
NOTE: This answer
applies to the v1.0 series of Open MPI only. If you are using a later
series, please see this FAQ
entry.
The wrapper compilers each construct command lines in the following
form:
<compiler> <xCPPFLAGS> <xFLAGS> user_arguments <xLDFLAGS> <xLIBS>
|
Where <compiler> is replaced by the default back-end compiler for each
language, and "x" is customized for each language (i.e., C, C++, F77,
and F90).
By setting appropriate environment variables, a user can
override default values used by the wrapper compilers. The table
below lists the variables for each of the wrapper compilers; the
Generic set applies to any wrapper compiler if the corresponding
wrapper-specific variable is not set. For example, the value of
$OMPI_LDFLAGS will be used with mpicc only if
$OMPI_MPICC_LDFLAGS is not set.
| Wrapper Compiler |
Compiler |
Preprocessor Flags |
Compiler Flags |
Linker Flags |
Linker Library Flags |
| Generic |
|
OMPI_CPPFLAGS
OMPI_CXXPPFLAGS
OMPI_F77PPFLAGS
OMPI_F90PPFLAGS |
OMPI_CFLAGS
OMPI_CXXFLAGS
OMPI_F77FLAGS
OMPI_F90FLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
| mpicc |
OMPI_MPICC |
OMPI_MPICC_CPPFLAGS |
OMPI_MPICC_CFLAGS |
OMPI_MPICC_LDFLAGS |
OMPI_MPICC_LIBS |
| mpicxx |
OMPI_MPIXX |
OMPI_MPICXX_CXXPPFLAGS |
OMPI_MPICXX_CXXFLAGS |
OMPI_MPICXX_LDFLAGS |
OMPI_MPICXX_LIBS |
| mpif77 |
OMPI_MPIF77 |
OMPI_MPIF77_F77PPFLAGS |
OMPI_MPIF77_F77FLAGS |
OMPI_MPIF77_LDFLAGS |
OMPI_MPIF77_LIBS |
| mpif90 |
OMPI_MPIF90 |
OMPI_MPIF90_F90PPFLAGS |
OMPI_MPIF90_F90FLAGS |
OMPI_MPIF90_LDFLAGS |
OMPI_MPIF90_LIBS |
NOTE: If you set a variable listed above, Open MPI will entirely
replace the default value that was originally there. Hence, it is
advisable to only replace these values when absolutely necessary.
| 96. How do I override the flags specified by Open MPI's wrapper
compilers? (v1.1 series and beyond) |
NOTE: This answer
applies to the v1.1 and later series of Open MPI only. If you are
using the v1.0 series, please see this
FAQ entry.
The Open MPI wrapper compilers are driven by text files that
contain, among other things, the flags that are passed to the
underlying compiler. These text files are generated automatically for
Open MPI and are customized for the compiler set that was selected
when Open MPI was configured; it is not recommended that users edit
these files.
Note that changing the underlying compiler may not work at
all. For example, C++ and Fortran compilers are notoriously
binary incompatible with each other (sometimes even within multiple
releases of the same compiler). If you compile/install Open MPI with
C++ compiler XYZ and then use the OMPI_CXX environment
variable to change the mpicxx wrapper compiler to use the
ABC C++ compiler, your application code may not compile and/or link.
The traditional method of using multiple different compilers with Open
MPI is to install Open MPI multiple times; each installation should be
built/installed with a different compiler. This is annoying, but it
is beyond the scope of Open MPI to be able to fix.
However, there are cases where it may be necessary or desireable to
edit these files and add to or subtract from the flags that Open MPI
selected. These files are installed in $pkgdatadir, which defaults
to $prefix/share/openmpi/<wrapper_name>-wrapper-data.txt. A
few environment variables are available for run-time replacement of
the wrapper's default values (from the text files):
| Wrapper Compiler |
Compiler |
Preprocessor Flags |
Compiler Flags |
Linker Flags |
Linker Library Flags |
Data File |
| Open MPI wrapper compilers |
mpicc |
OMPI_CC |
OMPI_CPPFLAGS |
OMPI_CFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpicc-wrapper-data.txt |
mpic++ |
OMPI_CXX |
OMPI_CPPFLAGS |
OMPI_CXXFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpic++-wrapper-data.txt |
mpiCC |
OMPI_CXX |
OMPI_CPPFLAGS |
OMPI_CXXFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpiCC-wrapper-data.txt |
mpifort |
OMPI_FC |
OMPI_CPPFLAGS |
OMPI_FCFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpifort-wrapper-data.txt |
mpif77 (deprecated as of v1.7) |
OMPI_F77 |
OMPI_CPPFLAGS |
OMPI_FFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpif77-wrapper-data.txt |
mpif90 (deprecated as of v1.7) |
OMPI_FC |
OMPI_CPPFLAGS |
OMPI_FCFLAGS |
OMPI_LDFLAGS |
OMPI_LIBS |
mpif90-wrapper-data.txt |
| OpenRTE wrapper compilers |
ortecc |
ORTE_CC |
ORTE_CPPFLAGS |
ORTE_CFLAGS |
ORTE_LDFLAGS |
ORTE_LIBS |
ortecc-wrapper-data.txt |
ortec++ |
ORTE_CXX |
ORTE_CPPFLAGS |
ORTE_CXXFLAGS |
ORTE_LDFLAGS |
ORTE_LIBS |
ortec++-wrapper-data.txt |
| OPAL wrapper compilers |
opalcc |
OPAL_CC |
OPAL_CPPFLAGS |
OPAL_CFLAGS |
OPAL_LDFLAGS |
OPAL_LIBS |
opalcc-wrapper-data.txt |
opalc++ |
OPAL_CXX |
OPAL_CPPFLAGS |
OPAL_CXXFLAGS |
OPAL_LDFLAGS |
OPAL_LIBS |
opalc++-wrapper-data.txt |
Note that the values of these fields can be directly influenced by
passing flags to Open MPI's configure script. The following options
are available to configure:
- --with-wrapper-cflags: Extra flags to add to CFLAGS when using
mpicc.
- --with-wrapper-cxxflags: Extra flags to add to CXXFLAGS when
using
mpiCC.
- --with-wrapper-fflags: Extra flags to add to FFLAGS when using
mpif77 (this option has disappeared in
Open MPI 1.7 and will not return; see this FAQ entry for more
details).
- --with-wrapper-fcflags: Extra flags to add to FCFLAGS when
using
mpif90 and mpifort.
- --with-wrapper-ldflags: Extra flags to add to LDFLAGS when
using any of the wrapper compilers.
- --with-wrapper-libs: Extra flags to add to LIBS when using any
of the wrapper compilers.
The files cited in the above table are fairly simplistic "key=value"
data formats. The following are several fields that are likely to be
interesting for end-users:
- project_short: Prefix for all environment variables. See
below.
- compiler_env: Specifies the base name of the environment
variable that can be used to override the wrapper's underlying
compiler at run-time. The full name of the environment variable is of
the form <project_short>_<compiler_env>; see table
above.
- compiler_flags_env: Specifies the base name of the environment
variable that can be used to override the wrapper's compiler flags at
run-time. The full name of the environment variable is of the form
<project_short>_<compiler_flags_env>; see table
above.
- compiler: The executable name of the underlying compiler.
- extra_includes: Relative to $installdir, a list of directories
to also list in the preprocessor flags to find header files.
- preprocessor_flags: A list of flags passed to the
preprocessor.
- compiler_flags: A list of flags passed to the compiler.
- linker_flags: A list of flags passed to the linker.
- libs: A list of libraries passed to the linker.
- required_file: If non-empty, check for the presence of this
file before continuing. If the file is not there, the wrapper will
abort saying that the language is not supported.
- includedir: Directory containing Open MPI's header files. The
proper compiler "include" flag is prepended to this directory and
added into the preprocessor flags.
- libdir: Directory containing Open MPI's library files. The
proper compiler "include" flag is prepended to this directory and
added into the linker flags.
- module_option: This field only appears in
mpif90. It is the
flag that the Fortran 90 compiler requires to declare where module
files are located.
| 97. How can I tell what the wrapper compiler default flags are? |
If the corresponding environment variables are not set, the
wrappers will add -I$includedir and -I$includedir/openmpi (which
usually map to $prefix/include and $prefix/include/openmpi,
respectively) to the xFLAGS area, and add -L$libdir (which usually
maps to $prefix/lib) to the xLDFLAGS area.
To obtain the values of the other flags, there are two main methods:
- Use the
--showme option to any wrapper compiler. For example
(lines broken here for readability):
shell$ mpicc prog.c -o prog --showme
gcc -I/path/to/openmpi/include -I/path/to/openmpi/include/openmpi/ompi \
prog.c -o prog -L/path/to/openmpi/lib -lmpi \
-lopen-rte -lopen-pal -lutil -lnsl -ldl -Wl,--export-dynamic -lm
|
This shows a coarse-grained method for getting the entire command
line, but does not tell you what each set of flags are (xFLAGS,
xCPPFLAGS, xLDFLAGS, and xLIBS).
- Use the
ompi_info command. For example:
shell$ ompi_info --all | grep wrapper
Wrapper extra CFLAGS:
Wrapper extra CXXFLAGS:
Wrapper extra FFLAGS:
Wrapper extra FCFLAGS:
Wrapper extra LDFLAGS:
Wrapper extra LIBS: -lutil -lnsl -ldl -Wl,--export-dynamic -lm
|
This installation is only adding options in the xLIBS areas of the
wrapper compilers; all other values are blank (remember: the -I's
and -L's are implicit).
Note that the --parsable option can be used to obtain
machine-parsable versions of this output. For example:
shell$ ompi_info --all --parsable | grep wrapper:extra
option:wrapper:extra_cflags:
option:wrapper:extra_cxxflags:
option:wrapper:extra_fflags:
option:wrapper:extra_fcflags:
option:wrapper:extra_ldflags:
option:wrapper:extra_libs:-lutil -lnsl -ldl -Wl,--export-dynamic -lm
|
| 98. Why does "mpicc --showme <some flags>" not show any
MPI-relevant flags? |
The output of commands similar to the following may be
somewhat surprising:
shell$ mpicc -g --showme
gcc -g
shell$
|
Where are all the MPI-related flags, such as the necessary -I, -L, and
-l flags?
The short answer is that these flags are not included in the wrapper
compiler's underlying command line unless the wrapper compiler sees a
filename argument. Specifically (output artifically wrapped below for
readability)
shell$ mpicc -g --showme
gcc -g
shell$ mpicc -g foo.c --showme
gcc -I/opt/openmpi/include/openmpi -I/opt/openmpi/include -g foo.c
-Wl,-u,_munmap -Wl,-multiply_defined,suppress -L/opt/openmpi/lib -lmpi
-lopen-rte -lopen-pal -ldl
|
The second command had the filename "foo.c" in it, so the wrapper
added all the relevant flags. This behavior is specifically to allow
behavior such as the following:
shell$ mpicc --version --showme
gcc --version
shell$ mpicc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5363)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
shell$
|
That is, the wrapper compiler does not behave differently when
constructing the underlying command line if "--showme" is used or
not. The only difference is whether the resulting command line is
displayed or executed.
Hence, this behavior allows users to pass arguments to the underlying
compiler without intending to actually compile or link (such as
passing --version to query the underlying compiler's version). If the
wrapper compilers added more flags in these cases, some underlying
compilers emit warnings.
| 99. Are there ways to just add flags to the wrapper compilers? |
Yes!
Open MPI's configure script allows you to add command line flags to
the wrappers on a permanent basis. The following configure options
are available:
- --with-wrapper-cflags=<flags>: These flags are added into
the
CFLAGS area in the mpicc wrapper compiler.
- --with-wrapper-cxxflags=<flags>: These flags are added into
the
CXXFLAGS area in the mpicxx wrapper compiler.
- --with-wrapper-fflags=<flags>: These flags are added into
the
FFLAGS area in the mpif77 wrapper compiler (this option has disappeared in Open MPI 1.7 and
will not return; see this
FAQ entry for more details).
- --with-wrapper-fcflags=<flags>: These flags are added into
the
FCFLAGS area in the mpif90 wrapper compiler.
- --with-wrapper-ldflags=<flags>: These flags are added into
the
LDFLAGS area in all the wrapper compilers.
- --with-wrapper-libs=<flags>: These flags are added into
the
LIBS area in all the wrapper compilers.
These configure options can be handy if you have some optional
compiler/linker flags that you need both Open MPI and all MPI
applications to be compiled with. Rather than trying to get all your
users to remember to pass the extra flags to the compiler when
compiling their applications, you can specify them with the configure
options shown above, thereby silently including them in the Open MPI
wrapper compilers -- your users will therefore be using the correct
flags without ever knowing it.
| 100. Why don't the wrapper compilers add "-rpath" (or similar)
flags by default? |
The default installation of Open MPI tries very hard to not
include any non-essential flags in the wrapper compilers. This is the
most conservative setting and allows the greatest flexability for
end-users. If the wrapper compilers started adding flags to support
specific features (such as run-time locations for finding the Open MPI
libraries), such flags -- no matter how useful to some portion of
users -- would almost certainly break assumptions and functionality
for other users.
As a workaround, Open MPI provides several mechanisms for users to
manually override the flags in the wrapper compilers:
- First and simplest, you can add your own flags to the wrapper
compiler command line by simply listing them on the command line. For
example:
shell$ mpicc my_mpi_application.c -o my_mpi_application -rpath /path/to/openmpi/install/lib
|
- Use the
--showme options to the wrapper compilers to
dynamically see what flags the wrappers are adding, and modify them as
appropiate. See this FAQ entry for
more details.
- Use environment variables to override the arguments that the
wrappers insert. If you are using Open MPI 1.0.x, this FAQ entry, otherwise see this FAQ entry.
- If you are using Open MPI 1.1 or layer, you can modify text files
that provide the system-wide default flags for the wrapper compilers.
this FAQ entry for more
details.
- If you are using Open MPI 1.1 or layer, you can pass additional
flags in to the system-wide wrapper compiler default flags through
Open MPI's configure script. See this FAQ entry for more
details.
You can use one of more of these methods to insert your own flags
(such as "-rpath" or similar).
| 101. Can I build 100% static MPI applications? |
Fully static linking is not for the weak, and it is not
recommended. But it is possible, with some caveats.
- You must have static libraries available for everything that
your program links to. This includes Open MPI; you must have used the
--enable-static option to Open MPI's configure or otherwise have
available the static versions of the Open MPI libraries (note that
Open MPI static builds default to including all of its plugins in
its libraries -- as opposed to having each plugin in its own dynamic
shared object file. So all of Open MPI's code will be contained in
the static libraries -- even what are normally contained in Open MPI's
plugins). Note that some popular Linux libraries do not have static
versions by default (e.g., libnuma), or require additional RPMs to be
installed to get the equivalent libraries.
- Open MPI must have been built without a memory manager. This
means that Open MPI must have been configured with the
--without-memory-manager flag. This is irrelevant on some platforms
for which Open MPI does not have a memory manager, but on some
platforms it is necessary (Linux). It is harmless to use this flag on
platforms where Open MPI does not have a memory manager. Not having a
memory manager means that Open MPI's mpi_leave_pinned behavior for
OS-bypass networks such as InfiniBand will not work.
- On some systems (Linux), you may see linker warnings about some
files requiring dynamic libraries for functions such as
gethostname
and dlopen. These are ok, but do mean that you need to have the
shared libraries installed. You can disable all of Open MPI's
dlopen behavior (i.e., prevent it from trying to open any plugins)
by specifying the --disable-dlopen flag to Open MPI's configure
script). This will eliminate the linker warnings about dlopen.
For example, this is how to configure Open MPI to build static
libraries on Linux:
shell$ ./configure --without-memory-manager --without-libnuma \
--enable-static [...your other configure arguments...]
|
Some systems may have additional constraints about their support
libraries that require additional steps to produce working 100% static
MPI applications. For example, the libibverbs support library from
OpenIB / OFED has its own plugin system (which, by default, won't work
with an otherwise-static application); MPI applications need
additional compiler/linker flags to be specified to create a working
100% MPI application. See this
FAQ entry for the details.
| 102. Can I build 100% static OpenFabrics / OpenIB / OFED MPI
applications on Linux? |
Fully static linking is not for the weak, and it is not
recommended. But it is possible. First, you must read this FAQ entry.
For an OpenFabrics / OpenIB / OFED application to be built statically,
you must have libibverbs v1.0.4 or later (v1.0.4 was released after
OFED 1.1, so if you have OFED 1.1, you will manually need to upgrade
your libibverbs). Both libibverbs and your verbs hardware plugin must
be available in static form.
Once all of that has been setup, run the following (artificially
wrapped sample output shown below -- your output may be slightly
different):
shell$ mpicc your_app.c -o your_app --showme
gcc -I/opt/openmpi/include/openmpi \
-I/opt/openmpi/include -pthread ring.c -o ring \
-L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
-L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
-lopen-pal -libverbs -lrt -Wl,--export-dynamic -lnsl -lutil -lm -ldl
|
(or use whatever wrapper compiler is relevant -- the --showme flag
is the important part here)
This example shows the steps for the GNU compiler suite, but other
compilers will be similar. This example also assumes that the
OpenFabrics / OpenIB / OFED install was rooted at /usr/local/ofed;
some distributions install under /usr/ofed (or elsewhere). Finally,
some installations use the library directory "lib64" while others
use "lib". Adjust your directory names as appropriate.
Take the output of from the above command and run it manually to
compile and link your application, adding the following hilighted
arguments:
shell$ gcc -static -I/opt/openmpi/include/openmpi \
-I/opt/openmpi/include -pthread ring.c -o ring \
-L/usr/local/ofed/lib -L/usr/local/ofed/lib64/infiniband \
-L/usr/local/ofed/lib64 -L/opt/openmpi/lib -lmpi -lopen-rte \
-lopen-pal -Wl,--whole-archive -libverbs /usr/local/ofed/lib64/infiniband/mthca.a \
-Wl,--no-whole-archive -lrt -Wl,--export-dynamic -lnsl -lutil \
-lm -ldl
|
Note that the mthca.a file is the verbs plugin for Mellanox HCAs.
If you have an HCA from a different vendor (such as IBM or QLogic),
use the appropriate filename (look in $ofed_libdir/infiniband for
verbs plugin files for your hardware).
Specifically, these added arguments do the following:
-
-static: Tell the linker to generate a static executable.
-
-Wl,--whole-archive: Tell the linker to include the entire
ibverbs library in the executable.
-
$ofed_root/lib64/infiniband/mthca.a: Include the Mellanox verbs
plugin in the executable.
-
-Wl,--no-whole-archive: Tell the linker the return to the
default of not including entire libraries in the executable.
You can either add these arguments in manually, or you can see this FAQ entry to
modify the default behavior of the wrapper compilers to hide this
complexity from end users (but be aware that if modify the wrapper
compilers default behavior, all users will be creating static
applications!).
| 103. Why does it take soooo long to compile F90 MPI applications? |
NOTE: Starting with Open
MPI v1.7, if you are not using gfortran, buidling the Fortran 90 and
08 bindings do not suffer the same performance penalty that previous
versions incurred. The Open MPI developers encourage all users to
upgrade to the new Fortran bindings implementation -- including the
new MPI-3 Fortran'08 bindings -- when possible.
This is unfortunately due to a design flaw in the MPI F90
bindings themselves.
The answer to this question is exactly the same as it is for why it
takes so long to compile the MPI F90 bindings in the Open MPI
implementation; please see
this FAQ entry for the details.
| 104. How do I build BLACS with Open MPI? |
The blacs_install.ps file (available from that web site)
describes how to build BLACS, so we won't repeat much of it here
(especially since it might change in future versions). These
instructions only pertain to making Open MPI work correctly with
BLACS.
After selecting the appropriate starting Bmake.inc, make the
following changes to Sections 1, 2, and 3. The example below is from
the Bmake.MPI-SUN4SOL2; your Bmake.inc file may be different.
# Section 1:
# Ensure to use MPI for the communication layer
COMMLIB = MPI
# The MPIINCdir macro is used to link in mpif.h and
# must contain the location of Open MPI's mpif.h.
# The MPILIBdir and MPILIB macros are irrelevant
# and should be left empty.
MPIdir = /path/to/openmpi-1.6.4
MPILIBdir =
MPIINCdir = $(MPIdir)/include
MPILIB =
# Section 2:
# Set these values:
SYSINC =
INTFACE = -Df77IsF2C
SENDIS =
BUFF =
TRANSCOMM = -DUseMpi2
WHATMPI =
SYSERRORS =
# Section 3:
# You may need to specify the full path to
# mpif77 / mpicc if they aren't already in
# your path.
F77 = mpif77
F77LOADFLAGS =
CC = mpicc
CCLOADFLAGS =
|
The remainder of the values are fairly obvious and irrelevant to Open
MPI; you can set whatever optimization level you want, etc.
If you follow the rest of the instructions for building, BLACS will
build correctly and use Open MPI as its MPI communication layer.
| 105. How do I build ScaLAPACK with Open MPI? |
The scalapack_install.ps file (available from that web site)
describes how to build ScaLAPACK, so we won't repeat much of it here
(especially since it might change in future versions). These
instructions only pertain to making Open MPI work correctly with
ScaLAPACK. These instructions assume that you have built and
installed BLACS with Open MPI.
# Make sure you follow the instructions to build BLACS with Open MPI,
# and put its location in the following.
BLACSdir = <path where you installed BLACS>
# The MPI section is commented out. Uncomment it. The wrapper
# compiler will handle SMPLIB, so make it blank. The rest are correct
# as is.
USEMPI = -DUsingMpiBlacs
SMPLIB =
BLACSFINIT = $(BLACSdir)/blacsF77init_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSCINIT = $(BLACSdir)/blacsCinit_MPI-$(PLAT)-$(BLACSDBGLVL).a
BLACSLIB = $(BLACSdir)/blacs_MPI-$(PLAT)-$(BLACSDBGLVL).a
TESTINGdir = $(home)/TESTING
# The PVMBLACS setup needs to be commented out.
#USEMPI =
#SMPLIB = $(PVM_ROOT)/lib/$(PLAT)/libpvm3.a -lnsl -lsocket
#BLACSFINIT =
#BLACSCINIT =
#BLACSLIB = $(BLACSdir)/blacs_PVM-$(PLAT)-$(BLACSDBGLVL).a
#TESTINGdir = $(HOME)/pvm3/bin/$(PLAT)
# Make sure that the BLASLIB points to the right place. We built this
# example on Solaris, hence the name below. The Linux version of the
# library (as of this writing) is blas_LINUX.a.
BLASLIB = $(LAPACKdir)/blas_solaris.a
# You may need to specify the full path to mpif77 / mpicc if they
# aren't already in your path.
F77 = mpif77
F77LOADFLAGS =
CC = mpicc
CCLOADFLAGS =
|
The remainder of the values are fairly obvious and irrelevant to Open
MPI; you can set whatever optimization level you want, etc.
If you follow the rest of the instructions for building, ScaLAPACK
will build correctly and use Open MPI as its MPI communication
layer.
| 106. How do I build PETSc with Open MPI? |
The only special configuration that you need to build PETSc is
to ensure that Open MPI's wrapper compilers (i.e., mpicc and
mpif77) are in your $PATH before running the PETSc configure.py
script.
PETSc should then automatically find Open MPI's wrapper compilers and
correctly build itself using Open MPI.
| 107. How do I build VASP with Open MPI? |
The following was reported by an Open MPI user who was able to
successfully build and run VASP with Open MPI:
I just compiled the latest VASP v4.6 using Open MPI v1.2.1, ifort
v9.1, ACML v3.6.0, BLACS with patch-03 and Scalapack v1.7.5 built with
ACML.
I configured Open MPI with --enable-static flag.
I used the VASP supplied makefile.linux_ifc_opt and only corrected
the paths to the ACML, scalapack, and BLACS dirs (I didn't lower the
optimization to -O0 for mpi.f like I suggested before). The -D's
are standard except I get a little better performance with
-DscaLAPACK (I tested it with out this option too):
CPP = $(CPP_) -DMPI -DHOST="LinuxIFC" -DIFC \
-Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=2000 \
-Duse_cray_ptr -DscaLAPACK
|
Also, Blacs and Scalapack used the -D's suggested in the Open MPI FAQ.
| 108. Are other language / application bindings available for Open MPI? |
Other MPI language bindings and application-level programming
interfaces have been been written by third parties. Here are a link
to some of the available packages:
- bcMPI: bcMPI is a software package that implements MPI
extensions for MATLAB and GNU Octave. It consists of a core library
(libbcmpi) that interfaces to the MPI library, a toolbox for MATLAB
(mexmpi), and a toolbox for Octave (octmpi).
- MPI
Toolbox for Octave (MPITB): Octave Linux users in a cluster with
several PCs can use MPITB in order to call MPI library routines from
within the Octave environment.
- Parallel::MPI: Perl bindings for MPI.
- mpi4py: MPI
for Python (or mpi4py) provides bindings of the Message Passing
Interface (MPI) standard for the Python programming language, allowing
any Python program to exploit multiple processors.
- pyMPI: a
project integrating the Message Passing Interface (MPI) into the
Python interpreter.
- Boost MPI: Boost
C++ class library for MPI.
If you'd like to have your project listed here, send mail to the User's list.
| 109. What pre-requisites are necessary for running an Open MPI job? |
In general, Open MPI requires that its executables are in your
PATH on every node that you will run on and if Open MPI was compiled
as dynamic libraries (which is the default), the directory where its
libraries are located must be in your LD_LIBRARY_PATH on every node.
Specifically, if Open MPI was installed with a prefix of /opt/openmpi,
then the following should be in your PATH and LD_LIBRARY_PATH
PATH: /opt/openmpi/bin
LD_LIBRARY_PATH: /opt/openmpi/lib
|
Depending on your environment, you may need to set these values in
your shell startup files (e.g., .profile, .cshrc, etc.).
NOTE: there are exceptions to this rule -- notably the --prefix option to mpirun.
See this FAQ entry for more
details on how to add Open MPI to your PATH and LD_LIBRARY_PATH.
Additionally, Open MPI requires that jobs can be started on remote
nodes without any input from the keyboard. For example, if using
rsh or ssh as the remote agent, you must have your environment
setup to allow execution on remote nodes without entering a password
or passphrase.
| 110. What ABI guarantees does Open MPI provide? |
Open MPI's versioning and ABI scheme is described
here, but is summarized here in this FAQ entry for convenience.
Open MPI provided forward application binary interface (ABI)
compatibility for MPI applications starting with v1.3.2. Prior to
that version, no ABI guarantees were provided.
NOTE: Prior to v1.3.2, subtle
and strange failures are almost guaranteed to occur if applications
were compiled and linked against shared libraries from one version of
Open MPI and then run with another. The Open MPI team strongly
discourages making any ABI assumptions before v1.3.2.
NOTE: ABI for the "use mpi"
Fortran interface was inadvertantly broken in the v1.6.3 release, and
was restored in the v1.6.4 release. Any Fortran applications that
utilize the "use mpi" MPI interface that were compiled and linked
against the v1.6.3 release will not be link-time compatible with other
releases in the 1.5.x / 1.6.x series. Such applications remain source
compatible, however, and can be recompiled/re-linked with other Open
MPI releases.
Starting with v1.3.2, Open MPI provides forward ABI compatibility --
with respect to the MPI API only -- in all versions of a given feature
release series and its corresponding super stable
series. For example, on a single platform, an MPI application
linked against Open MPI v1.3.2 shared libraries can be updated to
point to the shared libraries in any successive v1.3.x or v1.4 release
and still work properly (e.g., via the LD_LIBRARY_PATH environment
variable or other operating system mechanism).
For the v1.5 series, this means that all releases of v1.5.x and v1.6.x
will be ABI compatible, per the above definition.
Open MPI reserves the right to break ABI compatibility at new feature
release series. For example, the same MPI application from above
(linked against Open MPI v1.3.2 shared libraries) will not work with
Open MPI v1.5 shared libraries. Similarly, MPI applications
compiled/linked against Open MPI 1.6.x will not be ABI compatible with
Open MPI 1.7.x
| 111. Do I need a common filesystem on all my nodes? |
No, but it certainly makes life easier if you do.
A common environment to run Open MPI is in a "Beowulf"-class or
similar cluster (e.g., a bunch of 1U servers in a bunch of racks).
Simply stated, Open MPI can run on a group of servers or workstations
connected by a network. As mentioned above, there are several
prerequisites, however (for example, you typically must have an
account on all the machines, you can ssh or ssh between the nodes
without using a password etc.).
Regardless of whether Open MPI is installed on a shared / networked
filesystem or independently on each node, it is usually easiest if
Open MPI is available in the same filesystem location on every node.
For example, if you install Open MPI to /opt/openmpi-1.6.4 on
one node, ensure that it is available in /opt/openmpi-1.6.4
on all nodes.
This FAQ entry
has a bunch more information about installation locations for Open
MPI.
112. How do I add Open MPI to my PATH and LD_LIBRARY_PATH? |
Open MPI must be able to find its executables in your PATH
on every node (if Open MPI was compiled as dynamic libraries, then its
library path must appear in LD_LIBRARY_PATH as well). As such, your
configuration/initialization files need to add Open MPI to your PATH
/ LD_LIBRARY_PATH properly.
How to do this may be highly dependent upon your local configuration,
so you may need to consult with your local system administrator. Some
system administrators take care of these details for you, some don't.
YMMV. Some common examples are included below, however.
You must have at least a minimum understanding of how your shell works
to get Open MPI in your PATH / LD_LIBRARY_PATH properly. Note
that Open MPI must be added to your PATH and LD_LIBRARY_PATH in
two situations: (1) when you login to an interactive shell,
(2) and when you login to non-interactive shells on remote nodes.
- If (1) is not configured properly, executables like
mpicc will
not be found, and it is typically obvious what is wrong. The Open MPI
executable directory can manually be added to the PATH, or the
user's startup files can be modified such that the Open MPI
executables are added to the PATH every login. This latter approach
is preferred.
All shells have some kind of script file that is executed at login
time to set things like PATH and LD_LIBRARY_PATH and perform other
environmental setup tasks. This startup file is the one that needs to
be edited to add Open MPI to the PATH and LD_LIBRARY_PATH. Consult
the manual page for your shell for specific details (some shells are
picky about the permissions of the startup file, for example). The
table below lists some common shells and the startup files that they
read/execute upon login:
| Shell |
Interactive login startup file |
sh (Bourne shell, or bash named "sh") |
.profile |
| csh |
.cshrc followed by .login |
| tcsh |
.tcshrc if it exists, .cshrc if it does not, followed by
.login |
| bash |
.bash_profile if it exists, or .bash_login if it exists, or
.profile if it exists (in that order). Note that some Linux
distributions automatically come with .bash_profile scripts for
users that automatically execute .bashrc as well. Consult the bash
man page for more information. |
- If (2) is not configured properly, executables like
mpirun will
not function properly, and it can be somewhat confusing to figure out
(particularly for bash users).
The startup files in question here are the ones that are
automatically executed for a non-interactive login on a remote node
(e.g., "rsh othernode ps"). Note that not all shells support
this, and that some shells use different files for this than listed in
(1). Some shells will supersede (2) with (1). That is, fulfilling
(2) may automatically fulfill (1). The following table lists some
common shells and the startup file that is automatically executed,
either by Open MPI or by the shell itself:
| Shell |
Non-interactive login startup file |
sh (Bourne or bash named "sh") |
This shell does not execute any file automatically, so Open MPI
will execute the .profile script before invoking Open MPI
executables on remote nodes |
| csh |
.cshrc |
| tcsh |
.tcshrc if it exists, or .cshrc if it does not |
| bash |
.bashrc if it exists |
113. What if I can't modify my PATH and/or LD_LIBRARY_PATH? |
There are some situations where you cannot modify the PATH or
LD_LIBRARY_PATH -- e.g., some ISV application prefer to hide all
parallelism from the user, and therefore do not want to make the user
modify their shell startup files. Another case is where you want a
single user to be able to launch multiple MPI jobs simultaneously,
each with a different MPI implementation. Hence, setting shell
startup files to point to one MPI implementation would be problematic.
In such cases, you have two options:
- Use
mpirun's --prefix command line option (described
below).
- Modify the wrapper compilers to include directives to include
run-time search locations for the Open MPI libraries (see this FAQ entry)
mpirun's --prefix command line option takes as an argument the
top-level directory where Open MPI was installed. While relative
directory names are possible, they can become ambiguous depending on
the job launcher used; using absolute directory names are strongly
recommended.
For example, say that Open MPI was installed into
/opt/openmpi-1.6.4. You would use the --prefix option like
this:
shell$ mpirun --prefix /opt/openmpi-1.6.4 -np 4 a.out
|
This will prefix the PATH and LD_LIBRARY_PATH on both the local
and remote hosts with /opt/openmpi-1.6.4/bin and
/opt/openmpi-1.6.4/lib, respectively. This is usually
unnecessary when using resource managers to launch jobs (e.g., SLURM,
Torque, etc.) because they tend to copy the entire local environment
-- to include the PATH and LD_LIBRARY_PATH -- to remote nodes
before execution. As such, if PATH and LD_LIBRARY_PATH are set
properly on the local node, the resource manager will automatically
propagate those values out to remote nodes. The --prefix option is
therefore usually most useful in rsh or ssh-based environments (or
similar).
Beginning with the 1.2 series, it is possible to make this the default
behavior by passing to configure the flag
--enable-mpirun-prefix-by-default. This will make mpirun behave
exactly the same as "mpirun --prefix $prefix ...", where $prefix is
the value given to --prefix in configure.
Finally, note that specifying the absolute pathname to mpirun is
equivalent to using the --prefix argument. For example, the
following is equivalent to the above command line that uses --prefix:
shell$ /opt/openmpi-1.6.4/bin/mpirun -np 4 a.out
|
| 114. How do I launch Open MPI parallel jobs? |
Similar to many MPI implementations, Open MPI provides the
commands mpirun and mpiexec to launch MPI jobs. Several of the
questions in this FAQ category deal with using these commands.
Note, however, that these commands are exactly identical.
Specifically, they are symbolic links to a common back-end launcher
command named orterun (Open MPI's run-time environment interaction
layer is named the Open Run-Time Environment, or ORTE -- hence
orterun).
As such, the rest of this FAQ usually refers only to mpirun, even
though the same discussions also apply to mpiexec and orterun
(because they are all, in fact, the same command).
| 115. How do I run a simple SPMD MPI job? |
Open MPI provides both mpirun and mpiexec commands. A simple way
to start a single program, multiple data (SPMD) application in
parallel is:
shell$ mpirun -np 4 my_parallel_application
|
This starts a four-process parallel application, running four copies
of the executable named my_parallel_application.
The rsh starter component accepts the --hostfile (also known as
--machinefile) option to indicate which hosts to start the processes
on:
shell$ mpirun --hostfile my_hostfile -np 4 my_parallel_application
|
The hostfile my_hostfile is a text file with hosts specified, one
per line. Each host can also specify a default a maximum number of
slots to be used on that host (i.e., the number of available
processors on that host). Comments are also supported. For example:
# This is an example hostfile. Comments begin with #
#
# The following node is a single processor machine:
foo.example.com
# The following node is a dual-processor machine:
bar.example.com slots=2
# The following node is a quad-processor machine, and we absolutely
# want to disallow over-subscribing it:
yow.example.com slots=4 max-slots=4
|
slot and max-slots are discussed more in this FAQ entry.
Note, however, that not all environments require a hostfile. For
example, Open MPI will automatically detect when it is running in
batch / scheduled environments (such as SGE, PBS/Torque, SLURM, and LoadLeveler)
environments and use host information provided by those systems (i.e.,
it will ignore any provided hostfiles).
Also note that if using a launcher that uses a hostfile and no
hostfile is specified, all processes are launched on the local host.
| 116. How do I run an MPMD MPI job? |
Both the mpirun and mpiexec commands support multiple
program, multiple data (MPMD) style launches, either from the command
line or from a file. For example:
shell$ mpirun -np 2 a.out : -np 2 b.out
|
This will launch a single parallel application, but the first two
processes will be instances of the a.out executable, and the second
two processes will be instances of the b.out executable. In MPI
terms, this will be a single MPI_COMM_WORLD, but the a.out
processes will be ranks 0 and 1 in MPI_COMM_WORLD, while the b.out
processes will be ranks 2 and 3 in MPI_COMM_WORLD.
mpirun (and mpiexec) can also accept a parallel application
specified in a file instead of on the command line. For example:
shell$ mpirun --app my_appfile
|
where the file my_appfile contains the following:
# Comments are supported; comments begin with #
# Application context files specify each sub-application in the
# parallel job, one per line. The first sub-application is the 2
# a.out processes:
-np 2 a.out
# The second sub-application is the 2 b.out processes:
-np 2 b.out
|
This will result in the same behavior as running a.out and b.out
from the command line.
Note that mpirun and mpiexec are identical in command-line options
and behavior; using the above command lines with mpiexec instead of
mpirun will result in the same behavior.
| 117. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why? |
If you can run ompi_info and possibly even launch MPI
processes locally, but fail to launch MPI processes on remote hosts,
it is likely that you do not have your PATH and/or LD_LIBRARY_PATH
setup properly on the remote nodes.
Specifically, the Open MPI commands usually run properly even if
LD_LIBRARY_PATH is not set properly because they encode the
Open MPI library location in their executables and search there by
default. Hence, running ompi_info (and friends) usually works, even
in some improperly setup environments.
However, Open MPI's wrapper compilers do not encode the Open MPI
library locations in MPI executables by default (the wrappers only
specify a bare minimum of flags necessary to create MPI executables;
we consider any flags beyond this bare minimum set a local policy
decision). Hence, attempting to launch MPI executables in
environments where LD_LIBRARY_PATH is either not set or was set
improperly may result in messages about libmpi.so not being found.
You can
change Open MPI's wrapper compiler behavior to specify the run-time
location of Open MPI's libraries, if you wish.
Depending on how Open MPI was configured
and/or invoked, it may even be possible to run MPI applications in
environments where PATH and/or LD_LIBRARY_PATH is not set, or is
set improperly. This can be desirable for environments where multiple
MPI implementations are installed, such as multiple versions of Open
MPI.
| 118. When I build Open MPI with the Intel compilers, I get warnings
about "orted" or my MPI application not finding libimf.so. What do I do? |
The problem is usually because the Intel libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]
|
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libimf.so, is an Intel compiler library. As
such, it is likely that the user did not setup the Intel compiler
library in their environment properly on this node.
Double check that you have setup the Intel compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the Intel compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
shell$
|
The above example shows that running a trivial C program compiled by
the Intel compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the Intel compiler environment is setup properly for
non-interactive logins.
| 119. When I build Open MPI with the PGI compilers, I get warnings
about "orted" or my MPI application not finding libpgc.so. What do I do? |
The problem is usually because the PGI libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]
|
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libpgc.so, is a PGI compiler library. As
such, it is likely that the user did not setup the PGI compiler
library in their environment properly on this node.
Double check that you have setup the PGI compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the PGI compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
shell$
|
The above example shows that running a trivial C program compiled by
the PGI compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the PGI compiler environment is setup properly for
non-interactive logins.
| 120. When I build Open MPI with the Pathscale compilers, I get warnings
about "orted" or my MPI application not finding libmv.so. What do I do? |
The problem is usually because the Pathscale libraries cannot be
found on the node where Open MPI is attempting to launch an MPI
executable. For example:
shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]
|
Open MPI first attempts to launch a "helper" daemon
(orted) on node1.example.com, but it failed because one
of orted's dependent libraries was not able to be found. This
particular library, libmv.so, is a Pathscale compiler library. As
such, it is likely that the user did not setup the Pathscale compiler
library in their environment properly on this node.
Double check that you have setup the Pathscale compiler environment on the
target node, for both interactive and non-interactive logins. It is a
common error to ensure that the Pathscale compiler environment is setup
properly for interactive logins, but not for
non-interactive logins. For example:
shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com
Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit
shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
shell$
|
The above example shows that running a trivial C program compiled by
the Pathscale compilers works fine on both the head node and node1 when
logging in interactively, but fails when run on node1
non-interactively. Check your shell script startup files and verify
that the Pathscale compiler environment is setup properly for
non-interactive logins.
121. Can I run non-MPI programs with mpirun / mpiexec? |
Yes.
Indeed, Open MPI's mpirun and mpiexec are actually synonyms for
our underlying launcher named orterun (i.e., the Open Run-Time
Environment layer in Open MPI, or ORTE). So you can use mpirun and
mpiexec to launch any application. For example:
shell$ mpirun -np 2 --host a,b uptime
|
This will launch a copy of the unix command uptime on the hosts a
and b.
Other questions in the FAQ section deal with the specifics of the
mpirun command line interface; suffice it to say that it works
equally well for MPI and non-MPI applications.
| 122. Can I run GUI applications with Open MPI? |
Yes, but it will depend on your local setup and may require
additional setup.
In short: you will need to have X forwarding enabled from the remote
processes to the display where you want output to appear. In a secure
environment, you can simply allow all X requests to be shown on the
target display and set the DISPLAY environment variable in all MPI
process' environments to the target display, perhaps something like
this:
shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out
|
However, this technique is not generally suitable for unsecure
environments (because it allows anyone to read and write to your
display). A slightly more secure way is to only allow X connections
from the nodes where your application will be running:
shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out
|
(assuming that the four nodes you are running on are compute1
through compute4).
Other methods are available, but they involve sophisticated X
forwarding through mpirun and are generally more complicated than
desirable.
| 123. Can I run ncurses-based / curses-based / applications with
funky input schemes with Open MPI? |
Maybe. But probably not.
Open MPI provides fairly sophisticated stdin / stdout / stderr
forwarding. However, it does not work well with curses, ncurses,
readline, or other sophisticated I/O packages that generally require
direct control of the terminal.
Every application and I/O library is different -- you should try to
see if yours is supported. But chances are that it won't work.
Sorry. :-(
124. What other options are available to mpirun? |
mpirun supports the "--help" option which provides a usage
message and a summary of the options that it supports. It should be
considered the definitive list of what options are provided.
Several notable options are:
- --hostfile: Specify a hostfile for launchers (such as the
rsh
launcher) that need to be told on which hosts to start parallel
applications
- --host: Specify a host or list of hosts to run on (see this FAQ entry for more details)
- --np (or -np): Indicate the number of processes to
start.
- --mca (or -mca): Set MCA parameters (see the Run-Time Tuning FAQ)
- --wdir <directory>: Set the working directory of the
started applications. If not supplied, the current working directory
is assumed (or
$HOME, if the current working directory does not
exist on all nodes).
- -x <env-variable-name>: The name of an environment
variable to export to the parallel application. The -x option can
be specified multiple times to export multiple environment
variables to the parallel application.
125. How do I use the --host option to mpirun? |
The --host option to mpirun takes a comma-delimited list
of hosts on which to run. For example:
shell$ mpirun -np 3 --host a,b,c hostname
|
Will launch one copy of hostname on hosts a, b, and c.
--host works in two different ways:
- Exclusionary: If a list of hosts to run on has been provided by
another source (e.g., by a hostfile or a batch scheduler such as
SLURM, PBS/Torque, SGE, etc.), the hosts provided by the
--host option
must be in the already-provided host list. If the --host-specified
nodes are not in the already-provided host list, mpirun will abort
without launching anything.
In this case, the --host option acts like an exclusionary filter --
it limits the scope of where processes will be scheduled from the
original list of hosts to produce a final list of hosts.
For example, say that the hostfile my_hosts contains the hosts
node1 through node4. If you run:
shell$ mpirun -np 1 --hostfile my_hosts --host node3 hostname
|
This will run a single copy of hostname on the host node3.
However, if you run:
shell$ mpirun -np 1 --hostfile my_hosts --host node17 hostname
|
This is an error (because node17 is not listed in my_hosts;
mpirun will abort.
Finally, note that in exclusionary mode, processes will only be
executed on the --host-specified hosts, even if it causes
oversubscription. For example:
shell$ mpirun -np 4 --host a uptime
|
This will launch 4 copies of uptime on host a.
- Inclusionary: If a list of hosts has not been provided by
another source, then the hosts provided by the
--host option will be
used as the original and final host list.
In this case, --host acts as an inclusionary agent; all
--host-supplied hosts become available for scheduling processes.
For example (assume that you are not in a scheduling environment
where a list of nodes is being transparently supplied):
shell$ mpirun -np 3 --host a,b,c hostname
|
This will launch a single copy of hostname on the hosts a, b,
and c.
Note, too, that --host is essentially a per-application switch.
Hence, if you specify multiple applications (as in an MPMD job),
--host can be specified multiple times:
shell$ mpirun -np 1 --host a hostname : -np 1 --host b uptime
|
This will launch hostname on host a and uptime on host b.
| 126. How do I control how my processes are scheduled across nodes? |
The short version is that if you are not oversubscribing your
nodes (i.e., trying to run more processes than you have told Open MPI
are available on that node), scheduling is pretty simple and occurs
either on a by-slot or by-node round robin schedule. If you're
oversubscribing, the issue gets much more complicated -- keep reading.
The more complete answer is: Open MPI schedules processes to nodes by
asking two questions from each application on the mpirun command
line:
- How many processes should be launched?
- Where should those processes be launched?
The "how many" question is directly answered with the -np switch
to mpirun. The "where" question is a little more complicated, and
depends on three factors:
- The final node list (e.g., after
--host exclusionary or
inclusionary processing)
- The scheduling policy (which applies to all applications in a
single job)
- The default and maximum number of slots on each host
As briefly mentioned in this FAQ
entry, slots are Open MPI's representation of how many
processors are available on a given host.
The default number of slots on any machine, if not explicitly
specified, is 1 (e.g., if a host is listed in a hostfile by has no
corresponding "slots" keyword). Schedulers (such as SLURM,
PBS/Torque, SGE, etc.) automatically provide an accurate default slot
count.
Max slot counts, however, are rarely specified by schedulers. The max
slot count for each node will default to "infinite" if it is not
provided (meaning that Open MPI will oversubscribe the node if you ask
it to -- see more on oversubscribing in this FAQ entry).
Open MPI currently supports two scheduling policies: by slot and by
node:
- By slot: This is the default scheduling policy, but can also be
explicitly requested by using either the
--byslot option to mpirun
or by setting the MCA parameter rmaps_base_schedule_policy to the
string "slot".
In this mode, Open MPI will schedule processes on a node until all of
its default slots are exhausted before proceeding to the next node.
In MPI terms, this means that Open MPI tries to maximize the number of
adjacent ranks in MPI_COMM_WORLD on the same host without
oversubscribing that host.
For example:
shell$ cat my-hosts
node0 slots=2 max_slots=20
node1 slots=2 max_slots=20
shell$ mpirun --hostfile my-hosts -np 8 --byslot | sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node0
Hello World I am rank 2 of 8 running on node1
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node0
Hello World I am rank 6 of 8 running on node1
Hello World I am rank 7 of 8 running on node1
|
- By node: This policy can be requested either by using the
--bynode option to mpirun or by setting the MCA parameter
rmaps_base_schedule_policy to the string "node".
In this mode, Open MPI will schedule a single process on each node in
a round-robin fashion (looping back to the beginning of the node list
as necessary) until all processes have been scheduled. Nodes are
skipped once their default slot counts are exhausted.
For example:
shell$ cat my-hosts
node0 slots=2 max_slots=20
node1 slots=2 max_slots=20
shell$ mpirun --hostname my-hosts -np 8 --bynode hello | sort
Hello World I am rank 0 of 8 running on node0
Hello World I am rank 1 of 8 running on node1
Hello World I am rank 2 of 8 running on node0
Hello World I am rank 3 of 8 running on node1
Hello World I am rank 4 of 8 running on node0
Hello World I am rank 5 of 8 running on node1
Hello World I am rank 6 of 8 running on node0
Hello World I am rank 7 of 8 running on node1
|
In both policies, if the default slot count is exhausted on all nodes
while there are still processes to be scheduled, Open MPI will loop
through the list of nodes again and try to schedule one more process
to each node until all processes are scheduled. Nodes are skipped in
this process if their maximum slot count is exhausted. If the maximum
slot count is exhausted on all nodes while there are still processes
to be scheduled, Open MPI will abort without launching any processes.
NOTE: This is the scheduling policy in Open MPI because of a long
historical precedent in LAM/MPI. However, the scheduling of processes
to processors is a component in the RMAPS framework in Open MPI; it
can be changed. If you don't like how this scheduling occurs, please
let us know.
| 127. I'm not using a hostfile. How are slots calculated? |
If you are using a supported resource manager, Open MPI will
get the slot information directly from that entity. If you are using
the --host parameter to mpirun, be aware that each instance of a
hostname bumps up the internal slot count by one. For example:
shell$ mpirun --host node0,node0,node0,node0 ....
|
This tells Open MPI that host "node0" has a slot count of 4. This is
very different than, for example:
shell$ mpirun -np 4 --host node0 a.out
|
This tells Open MPI that host "node0" has a slot count of 1 but you
are running 4 processes on it. Specifically, Open MPI assumes that
you are oversubscribing the node.
| 128. Can I run multiple parallel processes on a uniprocessor machine? |
Yes.
But be very careful to ensure that Open MPI
knows that you are oversubscibing your node! If Open
MPI is unaware that you are oversubscribing a node, severe performance degredation can result.
See this FAQ entry for more details
on oversubscription.
| 129. Can I oversubscribe nodes (run more processes than processors)? |
Yes.
However, it is critical that Open MPI knows that you are
oversubscribing the node, or severe performance degredation can result.
The short explanation is as follows: never
specify a number of slots that is more than the available number of
processors. For example, if you want to run 4
processes on a uniprocessor, then indicate that you only have 1 slot
but want to run 4 processes. For example:
shell$ cat my-hostfile
localhost
shell$ mpirun -np 4 --hostfile my-hostfile a.out
|
Specifically: do NOT have a
hostfile that contains "slots = 4" (because there is only one
available processor).
Here's the full explanation:
Open MPI basically runs its message passing progression engine in two
modes: aggressive and degraded.
- Degraded: When Open MPI thinks that it is in an oversubscribed
mode (i.e., more processes are running than there are processors
available), MPI processes will automatically run in degraded mode
and frequently yield the processor to its peers, thereby allowing all
processes to make progress (be sure to see this
FAQ entry that describes how degraded mode affects processor and
memory affinity).
- Aggressive: When Open MPI thinks that it is in an exactly- or
under-subscribed mode (i.e., the number of running processes is equal
to or less than the number of available processors), MPI processes
will automatically run in aggressive mode, meaning that they will
never voluntarily give up the processor to other processes. With some
network transports, this means that Open MPI will spin in tight loops
attempting to make message passing progress, effectively causing other
processes to not get any CPU cycles (and therefore never make any
progress).
For example, on a uniprocessor node:
shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -np 4 --hostfile my-hostfile a.out
|
This would cause all 4 MPI processes to run in aggressive mode
because Open MPI thinks that there are 4 available processors
to use. This is actually a lie (there is only 1 processor -- not 4),
and can cause extremely bad performance.
| 130. Can I force Agressive or Degraded performance modes? |
Yes.
The MCA parameter mpi_yield_when_idle controls whether an MPI
process runs in Aggressive or Degraded performance mode. Setting it
to zero forces Aggressive mode; any other value forces Degraded mode
(see this FAQ
entry to see how to set MCA parameters).
Note that this value only affects the behavior of MPI processes when
they are blocking in MPI library calls. It does not affect behavior
of non-MPI processes, nor does it affect the behavior of a process
that is not inside an MPI library call.
Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are
cautioned against setting this parameter unless you are really,
absoultely, positively sure of what you are doing.
| 131. How do I run with the TotalView parallel debugger? |
Generally, you can run Open MPI processes with TotalView as
follows:
shell$ mpirun --debug ...mpirun arguments...
|
Assuming that TotalView is the first supported parallel debugger in
your path, Open MPI will autmoatically invoke the correct underlying
command to run your MPI process in the TotalView debugger. Be sure to
see this
FAQ entry for details about what versions of Open MPI and
TotalView are compatible.
For reference, this underlying command form is the following:
shell$ totalview mpirun -a ...mpirun arguments...
|
So if you wanted to run a 4-process MPI job of your a.out
executable, it would look like this:
shell$ totalview mpirun -a -np 4 a.out
|
Alternatively, Open MPI's mpirun offers the "-tv" convenience
option which does the same thing as TotalView's "-a" syntax. For
example:
shell$ mpirun -tv -np 4 a.out
|
Note that by default, TotalView will stop deep in the machine code of
mpirun itself, which is not what most users want. It is possible
to get TotalView to recognize that mpirun is simply a "starter"
program and should be (effectively) ignored. Specifically, TotalView
can be configured to skip mpirun (and mpiexec and orterun) and
jump right into your MPI application. This can be accomplished by
placing some startup instructions in a TotalView-specific file named
$HOME/.tvdrc.
Open MPI includes a sample TotalView startup file that performs this
function (see etc/openmpi-totalview.tcl in Open MPI distribution
tarballs; it is also installed, by default, to
$prefix/etc/openmpi-totalview.tcl in the Open MPI installation).
This file can be either copied to $HOME/.tvdrc or sourced from the
$HOME/.tvdrc file. For example, placing the following line in your
$HOME/.tvdrc (replacing /path/to/openmpi/installation with the
proper directory name, of course) will use the Open MPI-provided
startup file:
source /path/to/openmpi/installation/etc/openmpi-totalview.tcl
|
| 132. How do I run with the DDT parallel debugger? |
If you've used DDT at least once before (to use the
configuration wizard to setup support for Open MPI), you can start it
on the command line with:
shell$ mpirun --debug ...mpirun arguments...
|
Assuming that you are using Open MPI v1.2.4 or later, and assuming
that DDT is the first supported parallel debugger in your path, Open
MPI will autmoatically invoke the correct underlying command to run
your MPI process in the DDT debugger. For reference (or if you are
using an earlier version of Open MPI), this underlying command form is
the following:
shell$ ddt -n {nprocs} -start {exe-name}
|
Note that passing arbitrary arguments to Open MPI's mpirun is not
supported with the DDT debugger.
You can also attach to already-running proceses with either of the
following two syntaxes:
shell$ ddt -attach {hostname1:pid} [{hostname2:pid} ...] {exec-name}
# Or
shell$ ddt -attach-file {filename of newline separated hostname:pid pairs} {exec-name}
|
DDT can even be configured to operate with cluster/resource schedulers
such that it can run on a local workstation, submit your MPI job via
the scheduler, and then attach to the MPI job when it starts.
See the official DDT documentation for more details.
| 133. What launchers are available? |
The documentation contained in the Open MPI tarball will have
the most up-to-date information, but as of v1.0, Open MPI supports:
- BProc versions 3 and 4 with LSF
- Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
- PBS Pro, Torque, and Open PBS
- LoadLeveler scheduler (full support since 1.1.1)
- rsh / ssh
- SLURM
- XGrid
- Yod (Cray XT-3 and XT-4)
134. How do I specify to the rsh launcher to use rsh or ssh? |
See this FAQ entry.
| 135. How do I run with the SLURM and PBS/Torque launchers? |
If support for these systems are included in your Open MPI
installation (which you can check with the ompi_info command -- look
for components named "slurm" and/or "tm"), Open MPI will
automatically detect when it is running inside such jobs and will just
"do the Right Thing."
See this FAQ entry for
a description of how to run jobs in SLURM; see this FAQ entry for a description
of how to run jobs in PBS/Torque.
| 136. How do I run with the SGE launcher? |
Support for SGE is included in Open MPI version 1.2 and
later.
NOTE: To build SGE support in v1.3,
you will need to explicitly request the SGE support with the "--with-sge"
command line switch to Open MPI's configure script.
See this FAQ entry for
a description of how to correctly build Open MPI with SGE support.
To verify if support for SGE is configured into your Open MPI
installation, run ompi_info as shown below and look for gridengine.
The components you will see are slightly different between v1.2 and
v1.3.
For Open MPI 1.2:
shell$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v1.0, API v1.0, Component v1.2)
MCA pls: gridengine (MCA v1.0, API v1.0, Component v1.2)
|
For Open MPI 1.3:
shell$ ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)
|
Open MPI will automatically detect when it is running inside SGE and
will just "do the Right Thing."
Specifically, if you execute an mpirun command in a SGE job, it
will automatically use the SGE mechanisms to launch and kill
processes. There is no need to specify what nodes to run on -- Open
MPI will obtain this information directly from SGE. For example, this
will run the 4 MPI processes on the nodes that were allocated by
SGE:
# Get the environment variables for SGE
# (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment)
# C shell settings
shell% source /opt/sge/default/common/settings.csh
# bourne shell settings
shell$ . /opt/sge/default/common/settings.sh
# Allocate a SGE interactive job with 4 slots
# from a parallel environment (PE) named 'orte'
shell$ qsh -pe orte 4
# Now run a 4-process Open MPI job
shell$ mpirun -np 4 a.out
|
There are also other ways to submit jobs under SGE:
# Submit a batch job with the 'mpirun' command embedded in a script
shell$ qsub -pe orte 4 my_mpirun_job.csh
# Submit an SGE and OMPI job and mpirun in one line
shell$ qrsh -V -pe orte 4 mpirun -np 4 hostname
# Use qstat(1) to show the status of SGE jobs and queues
shell$ qstat -f
|
As a reference to the setup, be sure you have a Parallel Environment
(PE) defined for submitting parallel jobs. You don't have to name your
PE "orte". The following example shows a PE named 'orte' that would
look like:
% qconf -sp orte
pe_name orte
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
|
And be sure the queue will make use of the PE that you specified:
% qconf -sq all.q
...
pe_list make cre orte
...
|
To determine whether the SGE parallel job is sucessfully launched to the remote nodes,
you can pass in this MCA parameter "--mca pls_gridengine_verbose 1" to mpirun.
This will add in a -verbose flag to qrsh -inherit command that is used to
send parallel tasks to the remote SGE execution hosts. It will show
whether the connections to the remote hosts are established successfully or not.
| 137. Can I suspend and resume my job? |
A new feature was added into Open MPI 1.3.1 that supports
suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP
(not SIGSTOP) signal to mpirun. mpirun will catch this signal and
forward it to the a.outs as a SIGSTOP signal. To resume the job,
you send a SIGCONT signal to mpirun which will be caught and
forwarded to the a.outs.
By default, this feature is not enabled. This means that both the
SIGTSTP and SIGCONT signals will simply be consumed by the mpirun
process. To have them forwarded, you have to run the job with --mca
orte_forward_job_control 1. Here is an example on Solaris.
shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.out
|
In another window, we suspend and continue the job.
shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 0 0 0:00:21 5.9% a.out/1
15303 rolfv 158M 22M cpu2 0 0 0:00:21 5.9% a.out/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1
shell$ kill -TSTP 15301
shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15303 rolfv 158M 22M stop 30 0 0:01:44 21% a.out/1
15305 rolfv 158M 22M stop 20 0 0:01:44 21% a.out/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1
shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
15305 rolfv 158M 22M cpu1 0 0 0:02:06 17% a.out/1
15303 rolfv 158M 22M cpu3 0 0 0:02:06 17% a.out/1
15301 rolfv 8128K 5144K sleep 59 0 0:00:00 0.0% orterun/1
|
Note that all this does is stop the a.outs. It does not, for example,
free any pinned memory when the job is in the suspended state.
To get this to work under the SGE environment, you have to change the
suspend_method entry in the queue. It has to be set to SIGTSTP. Here
is an example of what a queue should look like.
sheel$ qconf -sq all.q
qname all.q
[...snip...]
starter_method NONE
suspend_method SIGTSTP
resume_method NONE
|
Note that if you need to suspend other types of jobs with SIGSTOP
(instead of SIGTSTP) in this queue then you need to provide a script
that can implement the correct signals for each job type.
| 138. Does the SGE tight integration support the -notify flag to qsub? |
If you are running SGE6.2 Update 3 or later, then the -notify flag
is supported. If you are running earlier versions, then the -notify flag
will not work and using it will cause the job to be killed.
To use -notify, one has to be a careful. First, let us review what
-notify does. Here is an excerpt from the qsub man page for the
-notify flag.
- -notify
-
This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.
Let us assume you the reason you want to use
the -notify flag is to get the SIGUSR1 signal prior to getting the
SIGTSTP signal. As mentioned in this this FAQ entry one could
run the job as shown in this batch script.
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
mpirun -np 16 -mca orte_forward_job_control 1 a.out
|
However, one has to make one of two changes to this script for things
to work properly. By default, a SIGUSR1 signal will kill a
shell script. So we have to make sure that does not happen. Here
is one way to handle it.
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
exec mpirun -np 16 -mca orte_forward_job_control 1 a.out
|
Alternatively, one can catch the signals in the script instead of doing
an exec on the mpirun.
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
function sigusr1handler()
{
echo "SIGUSR1 caught by shell script" 1>&2
}
function sigusr2handler()
{
echo "SIGUSR2 caught by shell script" 1>&2
}
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2
mpirun -np 16 -mca orte_forward_job_control 1 a.out
|
| 139. How do I run with LoadLeveler? |
If support for LoadLeveler is included in your Open MPI
installation (which you can check with the ompi_info command -- look
for components named "loadleveler"), Open MPI will
automatically detect when it is running inside such jobs and will just
"do the Right Thing."
Specifically, if you execute an mpirun command in a LoadLeveler job,
it will automatically determine what nodes and how many slots on each
node have been allocated to the current job. There is no need to
specify what nodes to run on. Open MPI will then attempt to launch the
job using whatever resource is available (on Linux rsh/ssh is used).
For example:
# Job to submit
shell$ cat job
#@ output = job.out
#@ error = job.err
#@ job_type = parallel
#@ node = 3
#@ tasks_per_node = 4
mpirun a.out
# Submit batch job to LoadLeveler
shell$ llsubmit job
|
This will run 4 MPI process per node on the 3 nodes which were allocated by
LoadLeveler for this job.
For users of Open MPI 1.1
series: In version 1.1.0 there exists a problem which
will make it so that Open MPI will not be able to determine what nodes
are available to it if the job has more than 128 tasks. In the 1.1.x
series starting with version 1.1.1., this can be worked around by
passing "-mca ras_loadleveler_priority 110" to mpirun. Version 1.2
and above work without any additional flags.
| 140. How do I load libmpi at runtime? |
If you want to load a the shared library libmpi explicitly
at runtime either by using dlopen() from C/C ++ or something like
the ctypes package from Python, some extra care is required. The
default configuration of Open MPI uses dlopen() internally to load
its support components. These components rely on symbols available in
libmpi. In order to make the symbols in libmpi available to the
components loaded by Open MPI at runtime, libmpi must be loaded with
the RTLD_GLOBAL option.
In C/C++, this option is specified as the second parameter to
dlopen(). When using ctypes with Python, this can be done with
the second (optional) parameter to CDLL(). For example (shown below
in Mac OS X, where Open MPI's shared library name ends in ".dylib";
other operating systems use other suffixes, such as ".so")
from ctypes import *
mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
f = pythonapi.Py_GetArgcArgv
argc = c_int()
argv = POINTER(c_char_p)()
f(byref(argc), byref(argv))
mpi.MPI_Init(byref(argc), byref(argv))
mpi.MPI_Finalize()
|
Other scripting languages should have similar options when dynamically
loading shared libraries.
| 141. What MPI environmental variables exist? |
Beginning with the 1.3 release, Open MPI provides the following
environmental variables that will be defined on every
MPI process:
- OMPI_COMM_WORLD_SIZE - the number of processes in this process' MPI Comm_World
- OMPI_COMM_WORLD_RANK - the MPI rank of this process
- OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process on this node
within its job. For example, if four processes in a job share a node, they
will each be given a local rank ranging from 0 to 3.
- OMPI_UNIVERSE_SIZE - the number of process slots allocated to this job. Note
that this may be different than the number of processes in the job.
- OMPI_COMM_WORLD_LOCAL_SIZE - the number of ranks from this job that are running on this node.
- OMPI_COMM_WORLD_NODE_RANK - the relative rank of this process on this node
looking across ALL jobs.
Open MPI guarantees that these variables will remain stable throughout future releases
| 142. How do I get my MPI job to wireup its MPI connections right away? |
By default, Open MPI opens MPI connections between processes in a
"lazy" fashion - i.e., the connections are only opened when the MPI process
actually attempts to send a message to another process for the first time. This is
done since (a) Open MPI has no idea what connections an application process will
really use, and (b) creating the connections takes time. Once
the connection is established, it remains "connected" until one of the two
connected processes terminates, so the creation time cost is paid only once.
Applications that require a fully connected topology, however, can see
improved startup time if they automatically "pre-connect" all their
processes during MPI_Init. Accordingly, Open MPI provides the MCA
parameter "mpi_preconnect_mpi" which directs Open MPI to establish a
"mostly" connected topology during MPI_Init (note that this MCA
parameter used to be named "mpi_preconnect_all" prior to Open MPI
v1.5; in v1.5, it was deprecated and replaced with
"mpi_preconnect_mpi"). This is accomplished in a somewhat scalable
fashion to help minimize startup time.
Users can set this parameter in two ways:
- in the environment as OMPI_MCA_mpi_preconnect_mpi=1
- on the cmd line as mpirun -mca mpi_preconnect_mpi 1
See this FAQ entry
for more details on how to set MCA parameters.
| 143. What kind of CUDA support exists in Open MPI? |
Open MPI recently added support for sending and receiving CUDA device
memory directly. Prior to this support, the programmer would first
have to stage the data in host memory prior to making the MPI calls.
Now, the Open MPI library will automatically detect that the pointer
being passed in is a CUDA device memory pointer and do the right
thing.
The use of device pointers is supported in all of the send and receive
APIs as well as most of the collective APIs. Neither the collective
reduction APIs nor the one-sided APIs are supported. Here is the list
of APIs that are currently support sending and receiving CUDA device memory.
MPI_Send, MPI_Bsend, MPI_Ssend, MPI_Rsend, MPI_Isend, MPI_Ibsend,
MPI_Issend, MPI_Irsend, MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init,
MPI_Rsend_init, MPI_Recv, MPI_Irecv, MPI_Recv_init, MPI_Sendrecv,
MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Allgather,
MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Scatter, MPI_Scatterv
Open MPI depends on various new features of CUDA 4.0, so one needs to
have the CUDA 4.0 driver and toolkit. The new features of interest
are the Unified Virtual Addressing (UVA) so that all pointers within a
program have unique addresses. In addition, there is a new API that
allows one to determine if a pointer is a CUDA device pointer or host
memory pointer. This API is used by the library to decide what needs
to be done with each buffer. In addition, CUDA 4.1 also provides the
ability to register host memory with the CUDA driver which can improve
performance. Until CUDA 4.1 is released, users may see a warning
about trying to register memory and failing. That is just a warning
and can be ignored as things will still work.
If utilizing the driver API, the application needs to ensure that it
has called cuInit() and cuCtxCreate() prior to calling MPI_Init.
With the CUDA runtime API, one needs to make sure that the runtime has
been initialized so that the MPI library has a valid CUDA context.
The Open MPI implementation essentially substitutes cuMemcpy calls for
memcpy calls in the library when device memory is detected. This
means there are some performance effects that should be noted. First,
in order to utilize the cuMemcpy, the library automatically switches
to protocols that first stage the data in host memory. Therefore, for
larger messages, there can be some performance degradation as the
large message RDMA protocols cannot be used for sending device memory
directly. Secondly, there is a latency hit on each cuMemcpy call of
around 10 usecs. This means that one might see an additional 20 usecs
overhead (copy in and copy out) on top of the transport latency.
Derived datatypes, both contiguous and non-contiguous, are supported.
However, the non-contiguous datatypes currently have high overhead
because of the many calls to cuMemcpy to copy all the pieces of the
buffer into the intermediate buffer.
All of these issues are currently being investigated and hope to be
improved upon.
See this FAQ entry
for detals on how to configure the CUDA support into the library.
| 144. Open MPI tells me that it fails to load components with a "file not found" error -- but the file is there! Why does it say this? |
Open MPI loads a lot of plugins at run time. It opens its
plugins via the excellent GNU Libtool libltdl
portability library. If a plugin fails to load, Open MPI queries
libltdl to get a printable string indicating why the plugin failed
to load.
Unfortunately, there is a well-known bug in libltdl that may cause a
"file not found" error message to be displayed, even when the file
is found. The "file not found" error usually masks the real,
underlying cause of the problem. For example:
mca: base: component_find: unable to open /opt/openmpi/mca_ras_dash_host: file not found (ignored)
|
Note that Open MPI put in a libltdl workaround starting with version
1.5. This workaround should print the real reason the plugin failed
to load instead of the erroneous "file not found" message.
There are two common underlying causes why a plugin fails to load:
- The plugin is for a different version of Open MPI. This FAQ entry has more information
about this case.
- The plugin cannot find shared libraries that it requires. For
example, if the
openib plugin fails to load, ensure that
libibverbs.so can be found by the linker at run time (e.g., check
the value of your LD_LIBRARY_PATH environment variable). The same is
true for any other plugin that have shared library dependencies (e.g.,
the mx BTL and MTL plugins need to be able to find
the libmyriexpress.so shared library at run time).
| 145. I see strange messages about missing symbols in my application; what do these mean? |
Open MPI loads a lot of plugins at run time. It opens its
plugins via the excellent GNU Libtool libltdl
portability library. Sometimes a plugin can fail to load because it
can't resolve all the symbols that it needs. There are a few reasons
why this can happen.
- The plugin is for a different version of Open MPI. See this FAQ entry
for an explanation of how Open MPI might try to open the "wrong"
plugins.
- An application is trying to manually dynamically open
libmpi in
a private symbol space. For example, if an application is not linked
against libmpi, but rather calls something like this:
/* This is a Linux example -- the issue is similar/the same on other
operating systems */
handle = dlopen("libmpi.so", RTLD_NOW | RTLD_LOCAL);
|
This is due to some deep run time linker voodoo -- it is discussed
towards the end of this
post to the Open MPI developer's list. Briefly, the issue is
this:
- The dynamic library
libmpi is opened in a "local" symbol
space.
- MPI_INIT is invoked, which tries to open Open MPI's plugins.
- Open MPI's plugins rely on symbols in
libmpi (and other Open
MPI support libraries); these symbols must be resolved when the plugin
is loaded.
- However, since
libmpi was opened in a "local" symbol space,
its symbols are not available to the plugins that it opens.
- Hence, the plugin fails to load because it can't resolve all of
its symbols, and displays a warning message to that effect.
The ultimate fix for this issue is a bit bigger than Open MPI,
unfortunately -- it's a POSIX issue (as briefly described in the devel
posting, above).
However, there are several common workarounds:
- Dynamically open
libmpi in a public / global symbol scope --
not a private / local scope. This will enable libmpi's symbols to
be available for resolution when Open MPI dynamically opens its
plugins.
- If
libmpi is opened as part of some underlying framework where
it is not possible to change the private / local scope to a public /
global scope, then dynamically open libmpi in a public / global
scope before invoking the underlying framework. This sounds a little
gross (and it is), but at least the run-time linker is smart enough to
not load libmpi twice -- but it does keeps libmpi in a public
scope.
- Use the
--disable-dlopen or
--disable-mca-dso options to Open MPI's configure script (see this FAQ entry for more
details on these options). These options slurp all of Open MPI's
plugins up in to libmpi -- meaning that the plugins physically
reside in libmpi and will not be dynamically opened at run
time.
- Build Open MPI as a static library by configuring Open MPI with
--disable-shared and --enable-static. This has the same effect as
--disable-dlopen, but it also makes libmpi.a (as opposed to a
shared library).
| 146. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it? |
You may wonder why you see this warning message (put here
verbatim so that it becomes web-searchable):
mca_pml_teg.so:undefined symbol:mca_ptl_base_modules_initialized
|
This happens when you upgrade to Open MPI v1.1 (or later) over an old
installation of Open MPI v1.0.x without previously uninstalling
v1.0.x. There are fairly uninteresting reasons why this problem
occurs; the simplest, safest solution is to uninstall version 1.0.x
and then re-install your newer version. For example:
shell# cd /path/to/openmpi-1.0
shell# make uninstall
[... lots of output ...]
shell# cd /path/to/openmpi-1.1
shell# make install
|
The above example shows changing into the Open MPI 1.1 directory to
re-install, but the same concept applies to any version after Open MPI
version 1.0.x.
Note that this problem is fairly specific to installing / upgrading
Open MPI from the source tarball. Pre-packaged installers (e.g., RPM)
typically do not incur this problem.
| 147. Can I build shared libraries on AIX with the IBM XL compilers? |
Short answer: in older versions of Open MPI, maybe.
Add "LDFLAGS=-Wl,-brtl" to your configure command line:
shell$ ./configure LDFLAGS=-Wl,-brtl ...
|
This enables "runtimelinking", which will make GNU Libtool name the
libraries properly (i.e., *.so). More importantly, runtimelinking
will cause the runtime linker to behave more or less like an ELF
linker would (with respect to symbol resolution).
Future versions of OMPI may not require this flag (and "runtimelinking"
on AIX).
NOTE: As of OMPI v1.2, AIX is
no longer supported.
| 148. Why am I getting a seg fault in libopal? |
It is likely that you did not get a segv in libopal; it is
likely that you are seeing a message like this (with OMPI v1.0 and v1.1):
[0] func:/opt/ompi/lib/libopal.so.0 [0x2a958de8a7]
|
or something like this (OMPI v1.2 and beyond; Linux output shown
below -- looks slightly different on other OS's):
[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]
|
This is actually the function that is printing out the stack trace
message; it is not the function that caused the segv itself. The
function that caused the problem will be a few below this. Future
versions of OMPI will simply not display this libopal function in the
segv reporting to avoid confusion.
Let's provide a concrete example. Take the following trivial MPI
program that is guaranteed to cause a seg fault in MPI_COMM_WORLD rank
1:
shell$ cat segv.c
#include
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 1) {
char *d = 0;
/* This will cause a seg fault */
*d = 3;
}
MPI_Finalize();
return 0;
}
|
Running this code, you'll see something similar to the following:
shell$ mpicc segv.c -o segv -g
shell$ mpirun -np 2 --mca btl tcp,self segv
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]
[1] func:/opt/ompi/lib/libopal.so.0 [0x2a958dd2b7]
[2] func:/lib64/tls/libpthread.so.0 [0x3be410c320]
[3] func:segv(main+0x3c) [0x400894]
[4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3be361c4bb]
[5] func:segv [0x4007ca]
*** End of error message ***
|
The real error was back up in main, which is #3 on the stack trace.
But Open MPI's stack-tracing function (opal_backtrace_print, in this
case) is what is displayed as #0, so it's an easy mistake to assume
that libopal is the culprit.
| 149. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler? |
Early versions of the Intel 9.1 C++ compiler series had
problems with the Open MPI C++ bindings. Even trivial MPI
applications that used the C++ MPI bindings could incur process
failures (such as segmentation violations) or generate MPI-level
errors complaining about invalid parameters.
Intel released a new version of their 9.1 series C++ compiler on
October 5, 2006 (build 44) that seems to solve all of these issues.
The Open MPI team recommends that all users needing the C++ MPI API
upgrade to this version (or later) if possible. Since the problems
are with the compiler, there is little that Open MPI can do to work
around the issue; upgrading the compiler seems to be the only
solution.
| 150. All my MPI applications segv! Why? (Intel Linux 12.1 compiler) |
Users have reported on the Open MPI users mailing list
multiple times that when they compile Open MPI with the Intel 12.1
compiler suite, Open MPI tools (such as the wrapper compilers,
including mpicc) and MPI applications will seg fault immediately.
As far as we know, this affects both Open MPI v1.4.4 (and later) and
v1.5.4 (and later).
Here's
one example of a user reporting this to the Open MPI User's list.
The cause of the problem has turned out to be a bug in early versions
of the Intel Linux 12.1 compiler series itself. If you upgrade your
Intel compiler to the latest version of the Intel 12.1 compiler suite
and rebuild Open MPI, the problem will go away.
| 151. Why can't I attach my parallel debugger (TotalView, DDT, fx2,
etc.) to parallel jobs? |
As noted in this FAQ
entry, Open MPI supports parallel debuggers that utilize the
TotalView API for parallel process attaching. However, it can
sometimes fail of Open MPI is not installed correctly. Symptoms of
this failure typically involve having the debugger hang (or crash)
when attempting to attach to a parallel MPI application.
Parallel debuggers may rely on having Open MPI's mpirun program
being compiled without optimization. Open MPI's configure and build
process therefore attempts to identify optimization flags and remove
them when compiling mpirun, but it does not have knowledge of all
optimization flags for all compilers. Hence, if you specify some
esoteric optimization flags to Open MPI's configure script, some
optimization flags may slip through the process and create an mpirun
that cannot be read by TotalView and other parallel debuggers.
If you run into this problem, you can manully build mpirun without
optimization flags. Go into the tree where you built Open MPI:
shell$ cd /path/to/openmpi/build/tree
shell$ cd orte/tools/orterun
shell$ make clean
[...output not shown...]
shell$ make all CFLAGS=-g
[...output not shown...]
shell$
|
This will build mpirun (also known as orterun) with just the "-g"
flag. Once this completes, run make install, also from within the
orte/tools/orterun directory, and possibly as root depending on
where you installed Open MPI. Using this new orterun ([mpirun]),
your parallel debugger should be able to attach to MPI jobs.
Additionally, a user reported to us that setting some TotalView flags
may be helpful with attaching. The user specifically cited the Open
MPI v1.3 series compiled with the Intel 11 compilers and TotalView
8.6, but it may also be helpful for other versions, too:
shell$ export with_tv_debug_flags="-O0 -g -fno-inline-functions"
|
152. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying |
This is a known issue in the Open MPI v1.2. series. Try the
following:
- If you are using Linux-based systems, increase some of the limits
on the node where
mpirun is invoked (you must have
administrator/root privlidges to increase these limits):
# The default is 128; increase it to 10,000
shell# echo 10000 > /proc/sys/net/core/somaxconn
# The default is 1,000; increase it to 100,000
shell# echo 100000 > /proc/sys/net/core/netdev_max_backlog
|
- Set the
oob_tcp_listen_mode MCA parameter to the string value
listen_thread. This enables Open MPI's mpirun to respond much
more quickly to incoming TCP connections during job launch, for
example:
shell$ mpirun --mca oob_tcp_listen_mode listen_thread -np 1024 my_mpi_program
|
See this FAQ entry
for more details on how to set MCA parameters.
| 153. How do I find out what MCA parameters are being seen/used by my job? |
As described elsewhere, MCA parameters are the "life's blood" of
Open MPI. MCA parameters are used to control both detailed and large-scale
behavior of Open MPI and are present throughout the code base.
This raises an important question: since MCA parameters can be set from a
file, the environment, the command line, and even internally within Open MPI,
how do I actually know what MCA params my job is seeing, and their value?
One way, of course, is to use the ompi_info command, which is documented
elsewhere (you can use "man ompi_info", or "ompi_info --help" to get more info
on this command). However, this still doesn't fully answer the question since
ompi_info isn't an MPI process.
To help relieve this problem, Open MPI (starting with the 1.3 release)
provides the MCA parameter mpi_show_mca_params that directs the rank=0 MPI process to report the
name of MCA parameters, their current value as seen by that process, and
the source that set that value. The parameter can take several values that define
which MCA parameters to report:
- all: report all MCA params. Note that this typically generates a rather long
list of parameters since it includes all of the default parameters defined inside
Open MPI
- default: MCA params that are at their default settings - i.e., all
MCA params that are at the values set as default within Open MPI
- file: MCA params that had their value set by a file
- api: MCA params set using Open MPI's internal APIs, perhaps to override an incompatible
set of conditions specified by the user
- enviro: MCA params that obtained their value either from the local environment
or the command line. Open MPI treats environmental and command line parameters as
equivalent, so there currently is no way to separate these two sources
These options can be combined in any order by separating them with commas.
Here is an example of the output generated by this parameter:
$ mpirun -mca grpcomm basic -mca mpi_show_mca_params enviro ./hello
ess=env (environment or cmdline)
orte_ess_jobid=1016725505 (environment or cmdline)
orte_ess_vpid=0 (environment or cmdline)
grpcomm=basic (environment or cmdline)
mpi_yield_when_idle=0 (environment or cmdline)
mpi_show_mca_params=enviro (environment or cmdline)
Hello, World, I am 0 of 1
|
Note that several MCA parameters set by Open MPI itself for internal uses are displayed in addition to the
ones actually set by the user.
Since the output from this option can be long, and since it can be helpful to have a more
permanent record of the MCA parameters used for a job, a companion MCA parameter
mpi_show_mca_params_file is provided. If mpi_show_mca_params is also set, the output listing of MCA parameters
will be directed into the specified file instead of being printed to stdout.
| 154. How do I debug Open MPI processes in parallel? |
This is a difficult question. Debugging in serial can be
tricky: errors, uninitialized variables, stack smashing, ... etc.
Debugging in parallel adds multiple different dimensions to this
problem: a greater propensity for race conditions, asynchronous
events, and the general difficulty of trying to understand N processes
simultaneously executing -- the problem becomes quite formidable.
This FAQ section does not provide any definition solutions to
debugging in parallel. At best, it shows some general techniques and
a few specific examples that may be helpful to your situation.
But there are various controls within Open MPI that can help with
debugging. These are probably the most valuable entries in this FAQ
section.
| 155. What tools are available for debugging in parallel? |
There are two main categories of tools that can aid in
parallel debugging:
- Debuggers: Both serial and parallel debuggers are useful.
Serial debuggers are what most programmers are used to (e.g., gdb),
while parallel debuggers can attach to all the individual processes in
an MPI job simultaneously, treating the MPI application as a single
entity. This can be an extremely powerful abstraction, allowing the
user to control every aspect of the MPI job, manually replicate race
conditions, etc.
- Profilers: Tools that analyze your usage of MPI and display
statistcs and meta information about your application's run. Some
tools present the information "live" (as it occurs), while others
collect the information and display it in a post mortem analysis.
Both freeware and commercial solutions are available for each kind of
tool.
| 156. How do I run with parallel debuggers? |
See these FAQ entries:
| 157. What controls does Open MPI have that aid in debugging? |
Open MPI has a series of MCA parameters for the MPI layer
itself that are designed to help with debugging. These parameters can
be can be set in the
usual ways. MPI-level MCA parameters can be displayed by invoking
the following command:
shell$ ompi_info --param mpi all
|
Here is a summary of the debugging parameters for the MPI layer:
- mpi_param_check: If set to true (any positive value), and when
Open MPI is compiled with parameter checking enabled (the default),
the parameters to each MPI function can be passed through a series of
correctness checks. Problems such as passing illegal values (e.g.,
NULL or MPI_DATATYPE_NULL or other "bad" values) will be discovered
at run time and an MPI exception will be invoked (the default of which
is to print a short message and abort the entire MPI job). If set to
0, these checks are disabled, slightly increasing performance.
- mpi_show_handle_leaks: If set to true (any positive value),
OMPI will display lists of any MPI handles that were not freed before
MPI_FINALIZE (e.g., communicators, datatypes, requests, etc.).
- mpi_no_free_handles: If set to true (any positive value), do
not actually free MPI object when their corresponding MPI "free"
function (e.g., do not free communicators when MPI_COMM_FREE is
invoked). This can be helpful in tracking down applications that
accidentally continue to use MPI handles after they have been
freed.
- mpi_show_mca_params: If set to true (any positive value), show
a list of all MCA parameters and their values during MPI_INIT. This
can be quite helpful for reproducability of MPI applications.
- mpi_show_mca_params_file: If set to a non-empty value, and if
the value of mpi_show_mca_params is true, then output the list of
MCA parameters to the filname value. If this parameter is an empty
value, the list is sent to stderr.
- mpi_keep_peer_hostnames: If set to a true value (any positive
value), send the list of all hostnames involved in the MPI job to
every process in the job. This can help the specificity of error
messages that Open MPI emits if a problem occurs (i.e., Open MPI can
display the name of the peer host that it was trying to communicate
with), but it can somewhat slow down the startup of large-scale
MPI jobs.
- mpi_abort_delay: If nonzero, print out an identifying message
when MPI_ABORT is invoked showing the hostname and PID of the process
that invoked MPI_ABORT, and then delay that many seconds before
exiting. A negative value means to delay indefinitely. This allows a
user to manually come in and attach a debugger when an error occurs.
Remember that the default MPI error handler -- MPI_ERRORS_ABORT --
invokes MPI_ABORT, so this parameter can be useful to discover
problems identified by mpi_param_check.
- mpi_abort_print_stack: If nonzero, print out a stack trace (on
supported systems) when MPI_ABORT is invoked.
- mpi_ddt_<foo>_debug, where <foo> can be one of
pack, unpack, position, or copy: These are intenral debugging
features that are not intended for end users (but
ompi_info will
report that they exist).
158. Do I need to build Open MPI with compiler/linker debugging
flags (such as -g) to be able to debug MPI applications? |
No.
If you build Open MPI without compiler/linker debugging flags (such as
-g), you will not be able to step inside MPI functions
when you debug your MPI applications. However, this is likely what
you want -- the internals of Open MPI are quite complex and you
probably don't want to start poking around in there.
You'll need to compile your own applications with -g (or whatever
your compiler's equivalent is), but unless you have a need/desire to
be able to step into MPI functions to see the internals of Open MPI,
you do not need to build Open MPI with -g.
159. Can I use serial debuggers (such as gdb) to debug MPI
applications? |
Yes; the Open MPI developers do this all the time.
There are two common ways to use serial debuggers:
- Attach to individual MPI processes after they are running.
For example, launch your MPI application as normal with mpirun.
Then login to the node(s) where your application is running and use
the --pid option to gdb to attach to your application.
An inelegant-but-functional technique commonly used with this method
is to insert the following code in your application where you want to
attach:
{
int i = 0;
char hostname[256];
gethostname(hostname, sizeof(hostname));
printf("PID %d on %s ready for attach\n", getpid(), hostname);
fflush(stdout);
while (0 == i)
sleep(5);
}
|
This code will output a line to stdout outputting the name of the host
where the process is running and the PID to attach to. It will then
spin on the sleep() function forever waiting for you to attach with
a debugger. Using sleep() as the inside of the loop means that the
processor won't be pegged at 100% while waiting for you to attach.
Once you attach with a debugger, go up the function stack until you
are in this block of code (you'll likely attach during the sleep())
then set the variable i to a nonzero value. With GDB, the syntax
is:
Then set a breakpoint after your block of code and continue execution
until the breakpoint is hit. Now you have control of your live MPI
application and use the full functionality of the debugger.
You can even add conditionals to only allow this "pause" in the
application for specific MPI processes (e.g., MPI_COMM_WORLD rank 0,
or whatever process is misbehaving).
- Use
mpirun to launch xterms (or equivalent) with
serial debuggers.
This technique launches a separate window for each MPI process in
MPI_COMM_WORLD, each one running a serial debugger (such as gdb)
that will launch and run your MPI application. Having a separate
window for each MPI process can be quite handy for low process-count
MPI jobs, but requires a bit of setup and configuration that is
outside of Open MPI to work properly. A naieve approach would be to
assume that the following would immediately work:
shell$ mpirun -np 4 xterm -e gdb my_mpi_application
|
Unfortunately, it likely won't work. Several factors must be
considered:
- What launcher is Open MPI using? In an rsh/ssh environment, Open
MPI will default to using
ssh when it is available, falling back to
rsh when ssh cannot be found in the $PATH. But note that Open
MPI closes the ssh (or rsh) sessions when the MPI job starts for
scalability reasons. This means that the built-in SSH X forwarding
tunnels will be shut down before the xterms can be launched.
Although it is possible to force Open MPI to keep its SSH connections
active (to keep the X tunneling available), we recommend using
non-SSH-tunneled X connections, if possible (see below).
- In non-rsh/ssh environments (such as when using resource
managers), the environment of the process invoking
mpirun may be
copied to all nodes. In this case, the DISPLAY environment variable
may not be suitable.
- Some operating systems default to disabling the X11 server from
listening for remote/network traffic. For example, see this
post on the user's mailing list, describing how to enable network
access to the X11 on Fedora Linux.
- There may be intermediate firewalls or other network blocks that
prevent X traffic from flowing between the hosts where the MPI
processes (and
xterms) are running and the host connected
to the output display.
The easiest way to get remote X applications (such as
xterm) to display on your local screen is to forego the
security of SSH-tunneled X forwarding. In a closed environment such
as an HPC cluster, this may be an acceptable practice (indeed, you may
not even have the option of using SSH X forwarding if you SSH logins
to cluster nodes are disabled), but check with your security
administrator to be sure.
If using non-encrypted X11 forwarding is permissable, we recommend the
following:
- For each non-local host where you will be running an MPI process,
add it to your X server's permission list with the
xhost command.
For example:
shell$ cat my_hostfile
inky
blinky
stinky
clyde
shell$ for host in `cat my_hostfile` ; do xhost +host ; done
|
- Use the
-x option to mpirun to export an appropriate DISPLAY
variable so that the launched X applications know where to send their
output. An appropriate value is usually (but not always) the
hostname containing the display where you want the output and the :0
(or :0.0) suffix. For example:
shell$ hostname
arcade.example.come
shell$ mpirun -np 4 --hostfile my_hostfile \
-x DISPLAY=arcade.example.com:0 xterm -e gdb my_mpi_application
|
Note that X traffic is fairly "heavy" -- if you are operating over a
slow network connection, it may take some time before the xterm
windows appear on your screen.
- If your
xterm supports it, the -hold option may be useful.
-hold tells xterm to stay open even when the application has
completed. This means that if something goes wrong (e.g., gdb fails
to execute, or unexpectedly dies, or ...), the xterm window will
stay open allowing you to see what happened, instead of closing
immediately and losing whatever error message may have been
output.
- When you have finished, you may wish to disable X11 network
permissions from the hosts that you were using. Use
xhost again to
disable these permissions:
shell$ for host in `cat my_hostfile` ; do xhost -host ; done
|
Note that mpirun will not complete until all the xterms
complete.
| 160. My process dies without any output. Why? |
There many be many reasons for this; the Open MPI Team
strongly encourages the use of tools (such as debuggers) whenever
possible.
One of the reaons, however, may come from inside Open MPI itself. If
your application fails due to memory corruption, Open MPI may
subsequently fail to output an error message before dying.
Specifically, starting with v1.3, Open MPI attempts to aggregate error
messages from multiple processes in an attempt to show unique error
messages only once (vs. one for each MPI process -- which can be
unweildly, especially when running large MPI jobs).
However, this aggregation process requires allocating memory in the
MPI process when it displays the error message. If the process'
memory is already corrupted, Open MPI's attempt to allocate memory may
fail and the process will simply die, possibly silently. When Open
MPI does not attempt to aggregate error messages, most of its setup
work is done during MPI_INIT and no memory is allocated during the
"print the error" routine. It therefore almost always successfully
outputs error messages in real time -- but at the expense that you'll
potentially see the same error message for each MPI process that
encourntered the error.
Hence, the error message aggregation is usually a good thing, but
sometimes it can mask a real error. You can disable Open MPI's error
message aggregation with the orte_base_help_aggregate MCA
parameter. For example:
shell$ mpirun --mca orte_base_help_aggregate 0 ...
|
The Memchecker-MCA is implemented to allow MPI-semantic
checking for your application, (as well as internals of Open MPI) with
the help of memory checking tools such as the Memcheck of the
Valgrind-suite (http://www.valgrind.org/).
Memchecker component is included in Open MPI v1.3 and later.
| 162. What kind of errors can Memchecker find? |
Memchecker is implemented on the basis of Memcheck tool from
Valgrind, so it takes all the advantages from it. Firstly, it checks
all reads and writes of memory, and intercepts calls to
malloc/new/free/delete. Most importantly, Memchecker is able to detect
the user buffer errors in both Non-blocking and One-sided
communications, e.g. reading or writing to buffers of active
non-blocking Recv-operations and writing to buffers of active
non-blocking Send-operations.
Here are some example codes that Memchecker can detect:
Accessing buffer under control of non-blocking communication:
int buf;
MPI_Irecv(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
/* The following line will produce a memchecker warning */
buf = 4711;
MPI_Wait (&req, &status);
|
Wrong input parameters, e.g. wrongly sized send buffers:
char *send_buffer;
send_buffer = malloc(5);
memset(send_buffer, 0, 5);
/* The following line will produce a memchecker warning */
MPI_Send(send_buffer, 10, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
|
Accessing window under control of one-sided communication:
MPI_Get(A, 10, MPI_INT, 1, 0, 1, MPI_INT, win);
A[0] = 4711;
MPI_Win_fence(0, win);
|
Uninitialized input buffers:
char *buffer;
buffer = malloc(10);
/* The following line will produce a memchecker warning */
MPI_Send(buffer, 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
|
Usage of the uninitialized MPI_Status field in MPI_ERROR structure:
(The MPI-1 standard defines the MPI ERROR-field to be undefined for
single-completion calls such as MPI Wait or MPI Test, see MPI-1 p. 22):
MPI_Wait(&request, &status);
/* The following line will produce a memchecker warning */
if (status.MPI_ERROR != MPI_SUCCESS)
return ERROR;
|
| 163. How can I use Memchecker? |
To use Memchecker, you need Open MPI 1.3 or later, and
Valgrind 3.2.0 or later.
As this functionality is off by default, one needs to turn them on
with the configure flag --enable-memchecker. Then, configure will
check for a recent Valgrind-distribution and include the compilation
of ompi/opal/mca/memchecker. You may ensure, that the library is
being built, by using the ompi_info application. Please note, that
all of this will only make sense together with --enable-debug, which
is required by Valgrind for outputing messages pointing directly to
the relevant source code lines. Otherwise, without debugging info,
the messages from Valgrind are nearly useless.
Here is a configuration example to enable Memchecker:
shell$ ./configure --prefix=/path/to/openmpi --enable-debug \
--enable-memchecker --with-valgrind=/path/to/valgrind
|
To check if Memchecker is successfully enabled after installation,
simply run command:
shell$ ompi_info | grep memchecker
|
You will get the output message like this:
MCA memchecker: valgrind (MCA v1.0, API v1.0, Component v1.3)
|
Otherwise, you probably didn't configure and install Open MPI correctly.
| 164. How to run my MPI application with Memchecker? |
First of all, you have to make sure that Valgrind 3.2.0 or
later is installed, and Open MPI is compiled with Memchecker
enabled. Then simply run you application with Valgrind, e.g.:
shell$ mpirun -np 2 valgrind ./my_app
|
Or if you enabled Memchecker, but you don't want to check the
application at this time, then just run your application as
usual. E.g.:
shell$ mpirun -np 2 ./my_app
|
| 165. Does Memchecker cause performance degradation to my application? |
The configure option --enable-memchecker (together with --enable-debug) does
cause performance degradation, even if not running under Valgrind.
The following explains the mechanism and may help in making the decision
whether to provide a cluster-wide installation with --enable-memchecker.
There are two cases:
Further information and performance data with the NAS Parallel Benchmarks
may be found in the paper Memory Debugging of MPI-Parallel
Applications in Open MPI.
| 166. Is Open MPI 'Valgrind-clean' or how can I identify real errors? |
This issue has been raised many times on the mailing list, e.g.,
here or
here.
There are many situations, where Open MPI purposefully does not initialize and
subsequently communicates memory, e.g., by calling writev.
Furthermore, several cases are known, where memory is not properly freed upon
MPI_Finalize.
This certainly does not help distinguishing real errors from false positives.
Valgrind provides functionality to suppress errors and warnings from certain
function contexts.
In an attempt to ease debugging using Valgrind, starting with v1.5, Open MPI
provides a so-called Valgrind-suppression file, that can be passed on the
command line:
mpirun -np 2 valgrind --suppressions=$PREFIX/share/openmpi/openmpi-valgrind.supp
|
More information on suppression-files and how to generate
them can be found in
Valgrind's Documentation.
167. Can I make Open MPI use rsh instead of ssh? |
Yes. The method to do this has changed over the different
versions of Open MPI.
- v1.3 series: The
orte_rsh_agent MCA parameter accepts a
colon-delimited list of programs to search for in your path to use as
the remote startup agent (the MCA parameter name plm_rsh_agent also
works, but it is deprecated). The default value is "ssh : rsh",
meaning that it will look for ssh first, and if it doesn't find it,
use rsh. You can change the value of this parameter as relevant to
your environment, such as simply changing it to rsh or rsh : ssh
if you have a mixture.
- v1.1 and v1.2 series: The v1.1 and v1.2 method is exactly the
same as the v1.3 method, but the MCA parameter name is slightly
different:
pls_rsh_agent ("pls" vs. "plm"). Using the old
"pls" name will continue to work in the v1.3 series, but it is now
officially deprecated -- you'll receive a warning if you use it.
- v1.0 series: In the 1.0.x series, Open MPI defaults to using
ssh for remote startup of processes in unscheduled environments.
You can change this to rsh by setting the MCA
parameter pls_rsh_agent to rsh.
See this FAQ entry
for details on how to set MCA parameters -- particularly with
multi-word values.
168. What pre-requisites are necessary for running an Open MPI job
under rsh/ssh? |
In general, they are the same for running Open MPI jobs in
other environments (see this FAQ
category for more general information).
169. How can I make ssh not ask me for a password? |
If you are using rsh to launch processes on remote nodes,
There are multiple ways.
Note that there are multiple versions of ssh available. References
to ssh in this text refer to OpenSSH.
This documentation provides an overview for using user keys and the
OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x
key management, you should upgrade). See the OpenSSH documentation
for more details and a more thorough description. The process is
essentially the same for other versions of SSH, but the command names
and filenames may be slightly different. Consult your SSH
documentation for more details.
Normally, when you use ssh to connect to a remote host, it will
prompt you for your password. However, the easiest way for mpirun
(and mpiexec, which, in Open MPI, is identical to mpirun) to work
properly, you need to be able to execute jobs on remote nodes without
typing in a password. In order to do this, you will need to set up
passphrase We recomend using RSA passphrases as they is generally
"better" (i.e., more secure) than DSA passphrases. As such, this
text will describe the process for RSA setup.
NOTE: This text will briefly
show you the steps involved in doing this, but the ssh documentation
is authorative on these matters should be consulted for more
information.
The first thing that you need to do is generate an RSA key pair to use
with ssh-keygen:
Accept the default value for the file in which to store the key
([$HOME/.ssh/id_rsa]) and enter a passphrase for your key pair. You
may choose to not enter a passphrase and therefore obviate the need
for using the ssh-agent. However, this greatly
weakens the authentication that is possible, because your secret key
is potentially vulnerable to compromise because it is unencrypted.
It has been compared to the moral equivalent of leaving a plain text
copy of your password in your $HOME directory. See the ssh
documentation for more details.
Next, copy the $HOME/.ssh/id_rsa.pub file generated by ssh-keygen
to $HOME/.ssh/authorized_keys (or add it to the end of
authorized_keys if that file already exists):
shell$ cd $HOME/.ssh
shell$ cp id_rsa.pub authorized_keys
|
In order for RSA authentication to work, you need to have the
$HOME/.ssh directory in your home directory on all the machines you
are running Open MPI. If your home directory is on a common
filesystem, this may be already taken care of. If not, you will need to
copy the $HOME/.ssh directory to your home directory on all Open
MPI nodes (be sure to do this in a secure manner -- perhaps using the
scp command -- particularly if your secret key is not encrypted).
ssh is very particular about file permissions. Ensure that your home
directory on all your machines is set to at least mode 755, your
$HOME/.ssh directory is also set to at least mode 755, and that the
following files inside $HOME/.ssh have at least the following
permissions:
-rw-r--r-- authorized_keys
-rw------- id_rsa
-rw-r--r-- id_rsa.pub
-rw-r--r-- known_hosts
|
The phrase "at least" in the above paragraph means the following:
- The files need to be readable by you
- The files should only be writable by you
- The files should not be executable
- Aside from
id_rsa, the files can be readable by others, but
do not need to be
- Your
$HOME and $HOME/.ssh directories can be readable by
others, but do not need to be
You are now set up to use RSA authentication. However, when you ssh
to a remote host, you will still be asked for your RSA passphrase
(as opposed to your normal password). This is where the ssh-agent
program comes in. It allows you to type in your RSA passphrase once,
and then have all successive invocations of ssh automatically
authenticate you against the remote host. See the ssh-agent(1)
documentation for more details than what are provided here.
Additionally, check the documentation and setup of your local
environment; ssh-agent may already be setup for you (e.g., see if
the shell environment variable $SSH_AUTH_SOCK exists; if so,
ssh-agent is likely already running). If ssh-agent is not already
running, you can start it manually with the following:
Note the specific invocation method: ssh-agent outputs some shell
commands to its output (e.g., setting the SSH_AUTH_SOCK environment
variable).
You will probably want to start the ssh-agent before you start your
graphics / windowing system so that all your windows will inherit the
environment variables set by this command. Note that some sites
invoke ssh-agent for each user upon login automatically; be sure to
check and see if there is an ssh-agent running for you already.
Once the ssh-agent is running, you can tell it your passphrase by
running the ssh-add command:
shell$ ssh-add $HOME/.ssh/id_rsa
At this point, if you ssh to a remote host that has the same
$HOME/.ssh directory as your local one, you should not be prompted
for a password or passphrase. If you are, a common problem is that
the permissions in your $HOME/.ssh directory are not as they should
be.
Note that this text has covered the ssh commands in very little
detail. Please consult the ssh documentation for more information.
170. What is a .rhosts file? Do I need it? |
If you are using rsh to launch processes on remote nodes,
you will probably need to have a $HOME/.rhosts file.
This file allows you to execute commands on remote nodes without being
prompted for a password. The permissions on this file usually must be
0644 ([rw-r--r--]). It must exist in your home directory on every
node that you plan to use Open MPI with.
Each line in the .rhosts file indicates a machine and user that
programs may be launched from. For example, if the user
steve wishes to launch programs from the machine stevemachine to
the machines alpha, beta, and gamma, there must be a .rhosts
file on each of the three remote machines ([alpha], beta, and
gamma) with at least the following line in it:
The first field indicates the name of the machine where jobs may
originate from; the second field indicates the user ID who may
originate jobs from that machine. It is better to supply a
fully-qualified domain name for the machine name (for security reasons
-- there may be many machines named stevemachine on the internet).
So the above example should be:
stevemachine.example.com steve
|
The Open MPI Team strongly discourages the use of "+" in the .rhosts
file. This is always a huge security hole.
If rsh does not find a matching line in the $HOME/.rhosts file, it
will prompt you for a password. Open MPI requires the password-less
execution of commands; if rsh prompts for a password, mpirun will
fail.
NOTE: Some implementations of
rsh are very picky about the format of text in the .rhosts file.
In particular, some do not allow leading white space on each line in
the .rhosts file, and will give a misleading "permission denied"
error if you have white space before the machine name.
NOTE: It should be noted that
rsh is not considered "secure" or "safe" -- .rhosts
authentication is considered fairly weak. The Open MPI Team
recommends that you use ssh ("Secure Shell") to launch remote
programs as it uses a much stronger authentication system.
171. Should I use + in my .rhosts file? |
No!
While there are a very small number of cases where using "+" in
your .rhosts file may be acceptable, the Open MPI Team highly
recommends that you do not.
Using a "+" in your .rhosts file indicates that you will allow
any machine and/or any user to connect as you. This is extremely
dangerous, especially on machines that are connected to the internet.
Consider the fact that anyone on the internet can connect to your
machine (as you) -- it should strike fear into your heart.
The + should not be used for either field of the .rhosts file.
Instead, you should use the full and proper hostname and username of
accounts that are authorized to remotely login as you to that machine
(or machines). This is usually just a list of your own username on a
list of machines that you wish to run Open MPI with. See this FAQ entry for further details, as well
as your local rsh documentation.
Additionally, the Open MPI Team strongly recommends that rsh is not
used in unscheduled environments (espectially those connected to the
internet) -- it is considered weak remote authentication. Instead, we
recommend the use of ssh -- the secure remote shell. See this FAQ entry for more details.
| 172. What versions of BProc does Open MPI work with? |
As of December 2005, Open MPI supports recent versions of
BProc, such as those found in Clustermatic.
We have not tested with older forks of the BProc project, such as
those from Scyld.
Since Open MPI's BProc support uses some advanced support from recent
BProc versions, it is somewhat doubtful (but totally untested) as to
whether it would work on Scyld systems.
Feedback and/or assistance to getting Open MPI to run properly on
Scyld systems would be greatly appreciated.
| 173. What pre-requisites are necessary for running an Open MPI job under BProc? |
In general, they are the same for running Open MPI jobs in
other environments (see this FAQ
category for more general information).
However, BProc it is worth noting that BProc may not bring all
necessary dynamic libraries with a process when it is migrated to a
back-end compute node. Plus, Open MPI opens components on the fly
(i.e., after the process has started), so if these components are
unavailable on the back-end compute nodes, Open MPI applications may
fail.
In general the Open MPI team recomends one of the following two
solutions when running on BProc clusters (in order):
- Compile Open MPI statically, meaning that Open MPI's libraries
produce static "
.a" libraries and all components are included in
the library (as opposed to dynamic ".so" libraries, and separate
".so" files for each component that are found and loaded at
run-time) so that applications do not need to find any shared
libraries or components when they are migrated to back-end compute
nodes. This can be accomplished by specifying --enable-static
--disable-shared to configure script when building Open MPI.
- If you do not wish to use static compilation, ensure that Open MPI
is fully installed on all nodes (i.e., the head node and all compute
nodes) in the same directory location. For example, if Open MPI is
installed in
/opt/openmpi-1.6.4 on the head node, ensure that
it is also installed in that same directory on all the compute
nodes.
| 174. How do I run jobs under SLURM? |
The short answer is you can use mpirun as normal, or directly launch
your application using srun.
The longer answer is that Open MPI supports launching parallel jobs in
all three methods that SLURM supports:
- Launching via "
salloc ...": supported (older versions of SLURM used "srun -A ...")
- Launching via "
sbatch ...": supported (older versions of SLURM used "srun -B ...")
- Launching via "
srun -n X my_mpi_application"
Specifically, you can launch Open MPI's mpirun in an interactive
SLURM allocation (via the salloc command) or you can submit a
script to SLURM (via the sbatch command), or you can "directly"
launch MPI executables via srun.
Open MPI automatically obtains both the list of hosts and how many
processes to start on each host from SLURM directly. Hence, it is
unnecessary to specify the --hostfile, --host, or -np options to
mpirun. Open MPI will also use SLURM-native mechanisms to launch
and kill processes ([rsh] and/or ssh are not required).
For example:
# Allocate a SLURM job with 4 nodes
shell$ salloc -N 4 sh
# Now run an Open MPI job on all the nodes allocated by SLURM
# (Note that you need to specify -np for the 1.0 and 1.1 series;
# the -np value is inferred directly from SLURM starting with the
# v1.2 series)
shell$ mpirun my_mpi_application
|
This will run the 4 MPI processes on the nodes that were allocated by
SLURM. Equivalently, you can do this:
# Allocate a SLURM job with 4 nodes and run your MPI application in it
shell$ salloc -N 4 mpirun my_mpi_aplication
|
Or, if submitting a script:
shell$ cat my_script.sh
#!/bin/sh
mpirun my_mpi_application
shell$ sbatch -N 4 my_script.sh
srun: jobid 1234 submitted
shell$
|
| 175. Doe Open MPI support "srun -n X my_mpi_application"? |
Yes
| 176. I use SLURM on a cluster with the OpenFabrics network stack. Do I need to do anything special? |
Yes. You need to ensure that SLURM sets up the locked memory
limits properly. Be sure to see this FAQ entry about
locked memory and this FAQ entry for
references about SLURM.
| 177. How do I reduce startup time for jobs on large clusters? |
There are several ways to reduce the startup time on large clusters. Some
of them are described on this page. We continue to work on making startup even
faster, especially on the large clusters coming in future years.
Open MPI 1.3 is significantly faster and more robust than its predecessors. We
recommend that anyone running large jobs and/or on large clusters make the
upgrade to the 1.3 series.
| 178. Where should I put my libraries: Network vs. local filesystems? |
Open MPI itself doesn't really care where its libraries are stored. However, where they
are stored does have an impact on startup times, particularly for large clusters, which
can be mitigated somewhat through use of Open MPI's configuration options.
Startup times will always be minimized by storing the libraries local to each node, either
on local disk or in RAM-disk. The latter is sometimes problematic since the libraries do
consume some space, thus potentially reducing memory that would have been available for
MPI processes.
There are two main considerations for large clusters that need to place the Open MPI libraries
on networked file systems:
- While DSO's are more flexible, you definitely do not want to use them when the Open MPI
libraries will be mounted on a network file system! Doing so will lead to significant network
traffic and delayed start times, especially on clusters with a large number of nodes. Instead,
be sure to configure your build with --disable-dlopen. This will include the DSO's in the
main libraries, resulting in much faster startup times.
- Many networked file systems use automount for user level directories, as well as for some
locally administered system directories. There are many reasons why system administrators may
choose to automount such directories. MPI jobs, however, tend to launch very quickly, thereby
creating a situation wherein a large number of nodes will nearly simultaneously demand automount
of a specific directory. This can overload NFS servers, resulting in delayed response or even
failed automount requests.
Note that this applies to both automount of directories containing Open MPI libraries as well
as directories containing user applications. Since these are unlikely to be the same location,
multiple automount requests from each node are possible, thus increasing the level of traffic.
| 179. Static vs shared libraries? |
It is perfectly fine to use either shared or static libraries. Shared libraries will
save memory when operating multiple processes per node, especially on clusters with high
numbers of cores on a node, but can also take longer to launch on networked file systems
(see network vs. local filesystem FAQ entry
for suggestions on how to mitigate such problems).
| 180. How do I reduce the time to wireup OMPI's out-of-band communication system? |
Open MPI's run-time uses an out-of-band (OOB) communication subsystem to pass messages
during the launch, initialization, and termination stages for the job. These messages allow
mpirun to tell its daemons what processes to launch, and allow the daemons in turn to forward
stdio to mpirun, update mpirun on process status, etc.
The OOB uses TCP sockets for its communication, with each daemon opening a socket back to
mpirun upon startup. In a large cluster, this can mean thousands of connections being formed
on the node where mpirun resides, and requires that mpirun actually process all these connection
requests. Mpirun defaults to processing connection requests sequentially - so on large clusters,
a backlog can be created that can cause remote daemons to timeout waiting for a response.
Fortunately, Open MPI provides an alternative mechanism for processing connection requests
that helps alleviate this problem. Setting the MCA parameter oob_tcp_listen_mode to
listen_thread causes mpirun to startup a separate thread dedicated to responding to connection
requests. Thus, remote daemons receive a quick response to their connection request, allowing
mpirun to deal with the message as soon as possible.
This parameter can be included in the default MCA parameter file, placed in the user's environment,
or added to the mpirun command line. See this FAQ entry
for more details on how to set MCA parameters.
| 181. Why is my job failing because of file descriptor limits? |
This is a known issue in Open MPI releases prior to the v1.3 series. The problem lies
in the connection topology for Open MPI's out-of-band (OOB) communication subsystem. Prior to the
1.3 series, a fully-connected topology was used that required every process to open a connection
to every other process in the job. This can rapidly overwhelm the usual system limits.
There are two methods you can use to circumvent the problem. First, upgrade to the v1.3 series if
you can - this would be our recommended approach as there are considerable improvements in that
series vs. the 1.2 one.
If you cannot upgrade and must stay with the v1.2 series, then you need to increase the number of
file descriptors in your system limits. This commonly requires that your system administrator
increase the number of file descriptors allowed by the system itself. The number required depends
both on the number of nodes in your cluster and the max number of processes you plan to run on
each node. Assuming you want to allow jobs that fully occupy the cluster, than the minimum number
of file descriptors you will need is roughly (#procs_on_a_node+1) * #procs_in_the_job.
It is always wise to have a few extra just in case :-)
Note that this only covers the file descriptors needed for the out-of-band communication subsystem.
It specifically does not address file descriptors needed to support the MPI TCP transport, if that
is being used on your system. If it is, then additional file descriptors will be required for those
TCP sockets. Unfortunately, a simple formula cannot be provided for that value as it depends completely
on the number of point-to-point TCP connections being made. If you believe that users may want to
fully connect an MPI job via TCP, then it would be safest to simply double the number of file
descriptors calculated above.
This can, of course, get to be a really big number...which is why you might want to consider
upgrading to the v1.3 series, where OMPI only opens #nodes OOB connections on each node. We are
currently working on even more sparsely connected topologies for very large clusters, with the
goal of constraining the number of connections opened on a node to an arbitrary number as
specified by an MCA parameter.
| 182. I know my cluster's configuration - how can I take advantage of that knowledge? |
Clusters rarely change from day-to-day, and large clusters rarely change at all.
If you know your cluster's configuration, there are several steps you can take to
both reduce Open MPI's memory footprint and reduce the launch time of large-scale applications.
These steps use a combination of build-time configuration options to eliminate components -
thus eliminating their libraries and avoiding unnecessary component open/close operations - as
well as run-time MCA parameters to specify what modules to use by default for most users.
One way to save memory is to avoid building components that will actually never be selected
by the system. Unless MCA parameters specify which components to open, built components are
always opened and tested as to whether or not they should be selected for use.\
If you know that a component can build on your system, but due to your
cluster's configuration will never actually be selected, then it is best to simply configure
OMPI to not build that component by using the --enable-mca-no-build configure option.
For example, if you know that your system will only utilize the "ob1"
component of the PML framework, then you can no_build all the others. This not only reduces
memory in the libraries, but also reduces memory footprint that is consumed by Open MPI opening
all the built components to see which of them can be selected to run.
In some cases, however, a user may optionally choose to use a component other than the default.
For example, you may want to build all of the routed framework components, even though the
vast majority of users will simply use the default binomial component. This means you have
to allow the system to build the other components, even though they may rarely be used.
You can still save launch time and memory, though, by setting the routed=binomial MCA parameter
in the default MCA parameter file. This causes OMPI to not open the other components during startup,
but allows users to override this on their command line or in their environment
so no functionality is lost - you just save some memory and time.
Rather than have to figure this all out by hand, we are working on a new OMPI tool called
ompi-profiler. When run on a cluster, it will tell you the selection results of all frameworks -
i.e., for each framework on each node, which component was selected to run - and a variety of other
information that will help you tailor Open MPI for your cluster. Stay tuned for more info as
we continue to work on ways to improve your performance...
| 183. What is the Modular Component Architecture (MCA)? |
The Modular Component Architecture (MCA) is the backbone for
much of Open MPI's functionality. It is a series of frameworks,
components, and modules that are assembled at run-time to create
an MPI implementation.
Frameworks: An MCA framework manages zero or more components at run
time and is targeted at a specific task (e.g., provide MPI collective
operation functionality). Each MCA framework supports a single
component type, but may support multiple versions of that type. The
framework uses the services from the MCA base functionality to find
and/or load components.
Components: An MCA component is an implementation of a framework's
interface. It is a standalone collection of code that can be bundled
into a plugin that can be inserted into the Open MPI code base,
either at run-time and/or compile-time.
Modules: An MCA module is an instance of a component (in the C++
sense of the word "instance"; an MCA component is analogous to a C++
class). For example, if a node running an Open MPI application has
multiple ethernet NICs, the Open MPI application will contain one TCP
MPI point-to-point component, but two TCP point-to-point modules.
Frameworks, components, and modules can be dynamic or static. That
is, they can be available as plugins or they may be compiled statically
into libraries (e.g., libmpi).
| 184. What are MCA parameters? |
MCA parameters are the basic unit of run-time tuning for Open
MPI. They are simple "key = value" pairs that are used extensively
throughout the code base. The general rules of thumb that the
developers use are:
- Instead of using a constant for an important value, make it an MCA
parameter
- If a task can be implemented in multiple, user-discernible ways,
implement as many as possible and make choosing between them be an MCA
parameter
For example, an easy MCA parameter to describe is the boundary between
short and long messages in TCP wire-line transmissions. "Short"
messages are sent eagerly whereas "long" messages use a rendezvous
protocol. The decision point between these two protocols is the
overall size of the message (in bytes). By making this value an MCA
parameter, it can be changed at run-time by the user or system
administrator to use a sensible value for a particular environment or
set of hardware (e.g., a value suitable for 100 Mbps Ethernet is
probably not suitable for Gigabit Ethernet, and may require a
different value for 10 Gigabit Ethernet).
Note that MCA parameters may be set in several different ways
(described in another FAQ entry). This allows, for example, system
administrators to fine-tune the Open MPI installation for their
hardware / environment such that normal users can simply use the
default values.
More specifically, HPC environments -- and the applications that run
on them -- tend to be unique. Providing extensive run-time tuning
capabilities through MCA parameters allows the customization of Open
MPI to each system's / user's / application's particular needs.
| 185. What frameworks are in Open MPI? |
There are three types of frameworks in Open MPI: those in the
MPI layer (OMPI), those in the run-time layer (ORTE), and those in the
operating system / platform layer (OPAL).
The specific list of frameworks varies between each major release
series of Open MPI. See the links below to FAQ entries for specific
versions of Open MPI:
| 186. What frameworks are in Open MPI v1.2 (and prior)? |
The comprehensive list of frameworks in Open MPI is
continually being augmented. As of August 2005, here is the current
list:
OMPI frameworks
- allocator: Memory allocator
- bml: BTL management layer (managing multiple devices)
- btl: Byte transfer layer (point-to-point byte movement)
- coll: MPI collective algorithms
- io: MPI-2 I/O functionality
- mpool: Memory pool management
- pml: Point-to-point management layer (fragmenting, reassembly,
top-layer protocols, etc.)
- osc: MPI-2 one-sided communication
- ptl: (outdated / deprecated) MPI point-to-point transport layer
- rcache: Memory registration management
- topo: MPI topology information
ORTE frameworks
- errmgr: Error manager
- gpr: General purpose registry
- iof: I/O forwarding
- ns: Name server
- oob: Out-of-band communication
- pls: Process launch subsystem
- ras: Resource allocation subsystem
- rds: Resource discovery subsystem
- rmaps: Resource mapping subsystem
- rmgr: Resource manager (upper meta layer for all other Resource
frameworks)
- rml: Remote messaging layer (routing of OOB messages)
- schema: Name schemas
- sds: Startup discovery services
- soh: State of health
OPAL frameworks
- maffinity: Memory affinity
- memory: Memory hooks
- paffinity: Processor affinity
- timer: High-resolution timers
| 187. What frameworks are in Open MPI v1.3? |
The comprehensive list of frameworks in Open MPI is
continually being augmented. As of November 2008, here is the current
list in the Open MPI v1.3 series:
OMPI frameworks
- allocator: Memory allocator
- bml: BTL management layer
- btl: MPI point-to-point Byte Transfer Layer, used for MPI
point-to-point messages on some types of networks
- coll: MPI collective algorithms
- crcp: Checkpoint/restart coordination protocol
- dpm: MPI-2 dynamic process management
- io: MPI-2 I/O
- mpool: Memory pooling
- mtl: Matching transport layer, used for MPI point-to-point
messages MPI-2 one-sided communications
- pml: MPI point-to-point management layer
- pubsub: MPI-2 publish/subscribe management
- rcache: Memory registration cache
- topo: MPI topology routines
ORTE frameworks
- errmgr: RTE error manager
- ess: RTE environment-specfic services
- filem: Remote file management
- grpcomm: RTE group communications
- iof: I/O forwarding
- odls: OpenRTE daemon local launch subsystem
- oob: Out of band messaging
- plm: Process lifecycle management
- ras: Resource allocation system
- rmaps: Resource mapping system
- rml: RTE message layer
- routed: Routing table for the RML
- snapc: Snapshot coordination
OPAL frameworks
- backtrace: Debugging call stack backtrace support
- carto: Cartography (host/network mapping) support
- crs: Checkpoint and restart service
- installdirs: Installation directory relocation services
- maffinity: Memory affinity
- memchecker: Run-time memory checking
- memcpy: Memcpy copy support
- memory: Memory management hooks
- paffinity: Processor affinity
- timer: High-resolution timers
| 188. How do I know what components are in my Open MPI installation? |
The ompi_info command, in addition to providing a wealth of
configuration information about your Open MPI installation, will list
all components (and the frameworks that they belong to) that are
available. These include system-provided components as well as
user-provided components.
| 189. How do I install my own components into an Open MPI installation? |
By default, Open MPI looks in two places for components at
run-time (in order):
- $prefix/lib/openmpi/: This is the system-provided components
directory, part of the installation tree of Open MPI itself.
- $HOME/.openmpi/components/: This is where users can drop their
own components that will automatically be "seen" by Open MPI at
run-time. This is ideal for developmental, private, or otherwise
unstable components.
Note that the directories and search ordering used for finding
components in Open MPI is, itself, an MCA parameter. Setting the
mca_component_path changes this value (a colon-delimited list of
directories).
Note also that components are only used on nodes where they are
"visible." Hence, if you $prefix/lib/openmpi/ is a directory on a
local disk that is not shared via a network filesystem to other nodes
where you run MPI jobs, then components that are installed to that
directory will only be used by MPI jobs running on the local node.
More specifically: components have the same visibility as normal
files. If you need a component to be available to all nodes where you
run MPI jobs, then you need to ensure that it is visible on all nodes
(typically either by installing it on all nodes for non-networked
filesystem installs, or by installing them in a directory that is
visibile to all nodes via a networked filesystem). Open MPI does not
automatically send components to remote nodes when MPI jobs are run.
| 190. How do I know what MCA parameters are available? |
The ompi_info command can list the parameters for a given
component, all the parameters for a specific framework, or all
parameters. Most parameters contain a description of the parameter;
all will show the parameter's current value.
For example:
shell$ ompi_info --param all all
|
Shows all the MCA parameters for all components that ompi_info
finds, whereas:
shell$ ompi_info --param btl all
|
Shows all the MCA parameters for all BTL components that ompi_info
finds. Finally:
shell$ ompi_info --param btl tcp
|
Shows all the MCA parameters for the TCP BTL component.
| 191. How do I set the value of MCA parameters? |
There are three main ways to set MCA parameters, each of which
are searched in order.
- Command line: The highest-precedence method is setting MCA
parameters on the command line. For example:
shell$ mpirun --mca mpi_show_handle_leaks 1 -np 4 a.out
|
This sets the MCA parameter mpi_show_handle_leaks to the value of 1
before running a.out with four processes. In general, the format
used on the command line is "--mca <param_name>
<value>".
Note that when senting multi-word values, you need to use quotes to ensure that the shell and Open MPI understand that they are a single value. For example:
shell$ mpirun --mca param "value with multiple words" ...
|
- Environment variable: Next, environment variables are searched.
Any environment variable named
OMPI_MCA_<param_name> will be
used. For example, the following has the same effect as the previous
example (for sh-flavored shells):
shell$ OMPI_MCA_mpi_show_handle_leaks=1
shell$ export OMPI_MCA_mpi_show_handle_leaks
shell$ mpirun -np 4 a.out
|
Or, for csh-flavored shells:
shell% setenv OMPI_MCA_mpi_show_handle_leaks 1
shell% mpirun -np 4 a.out
|
Note that setting environment variables to values with multiple words
requires quoting, such as:
# sh-flavored shells
shell$ OMPI_MCA_param="value with multiple words"
# csh-flavored shells
shell% setenv OMPI_MCA_param "value with multiple words"
|
- Aggregate MCA parameter files: Simple text files can be used to
set MCA parameter values for a specific application. See this FAQ entry (Open MPI version 1.3
and higher).
- Files: Finally, simple text files can be used to set MCA
parameter values. Parameters are set one per line (comments are
permitted). For example:
# This is a comment
# Set the same MCA parameter as in previous examples
mpi_show_handle_leaks = 1
|
Note that quotes are not necessary for setting multi-word values in
MCA parameter files. Indeed, if you use quotes in the MCA parameter
file, they will be used as part of the value itself. For example:
# The following two values are different:
param1 = value with multiple words
param2 = "value with multiple words"
|
By default, two files are searched (in order):
- $HOME/.openmpi/mca-params.conf: The user-supplied set of
values takes the highest precedence.
- $prefix/etc/openmpi-mca-params.conf: The system-supplied set
of values has a lower precedence.
More specifically, the MCA parameter mca_param_files specifies a
colon-delimited path of files to search for MCA parameters. Files to
the left have lower precedence; files to the right are higher
precedence.
Keep in mind that, just like components, these parameter files are
only relevant where they are "visible" (see this FAQ entry). Specifically,
Open MPI does not read all the values from these files during startup
and then send them to all nodes in the job -- the files are read on
each node during each process' startup. This is intended behavior: it
allows for per-node customization, which is especially relevant in
heterogeneous environments.
| 192. What are Aggregate MCA (AMCA) parameter files? |
Starting with version 1.3, aggregate MCA (AMCA) parameter
files contain MCA parameter key/value pairs similar to the
$HOME/.openmpi/mca-params.conf file described in this FAQ entry.
The motivation behind AMCA parameter sets came from the realization
that for certain applications a large number of MCA parameters are
required for the application to run well and/or as the user
expects. Since these MCA parameters are application specific (or even
application run specific) they should not be set in a global manner,
but only pulled in as determined by the user.
MCA parameters set in AMCA parameter files will override any MCA
parameters supplied in global parameter files (e.g.,
$HOME/.openmpi/mca-params.conf), but not command line or environment
parameters.
AMCA parameter files are typically supplied on the command line via
the -am option.
For example, consider a AMCA parameter file called foo.conf
placed in the same directory as the application a.out. A user
will typically run the application as:
shell$ mpirun -np 2 a.out
|
To use the foo.conf AMCA parameter file this command line
changes to:
shell$ mpirun -np 2 -am foo.conf a.out
|
If the user wants to override a parameter set in foo.conf they
can add it to the command line as seen below.
shell$ mpirun -np 2 -am foo.conf -mca btl tcp,self a.out
|
AMCA parameter files can be coupled if more than one file is to be
used. If we have another AMCA parameter file called bar.conf
that we want to use we add it to the command line as follows:
shell$ mpirun -np 2 -am foo.conf:bar.conf a.out
|
AMCA parameter files are loaded in priority order. This means that
foo.conf AMCA file has priority over the bar.conf file. So
if the bar.conf file sets the MCA parameter
mpi_leave_pinned=0 and the foo.conf file sets this MCA
parameter to mpi_leave_pinned=1 then the latter will be used.
The location of AMCA parameter files are resolved in a similar way as
the shell. If no path operator is provided (i.e., foo.conf) then
Open MPI will search the $SYSCONFDIR/amca-param-sets directory then
the current working directory. If a relative path is specified then
only that path will be searched (i.e., ./foo.conf,
baz/foo.conf). If an absolute path is specified then only that
path will be searched (i.e., /bip/boop/foo.conf).
Though the typical use case for AMCA parameter files is to be
specified on the command line, they can also be set as MCA parameters
in the environment. The MCA parameter (mca_base_param_file_prefix)
contains a ':' separated list of AMCA parameter files exactly as they
would be passed to the -am command line option. The MCA
parameter (mca_base_param_file_path) specifies the path to search for
AMCA files with relative paths. By default this is
$SYSCONFDIR/amca-param-sets/:$CWD.
| 193. How do I select which components are used? |
Each MCA framework has a top-level MCA parameter that helps
guide which components are selected to be used at run-time.
Specifically, there is an MCA parameter of the same name as each MCA
framework that can be used to include or exclude components from a
given run.
For example, the btl MCA parameter is used to control which BTL
components are used (i.e., MPI point-to-point communications; see this FAQ entry for a full list of MCA
frameworks). It can take as a value a comma-separated list of
components with the optional prefix "^". For example:
# Tell Open MPI to exclude the tcp and openib BTL components
# and implicitly include all the rest
shell$ mpirun --mca btl ^tcp,openib ...
# Tell Open MPI to include *only* the components listed here and
# implicitly ignore all the rest (i.e., the loopback, shared memory,
# and Myrinet/GM MPI point-to-point components):
shell$ mpirun --mca btl self,sm,gm ...
|
Note that ^ can only be the prefix of the entire value because the
inclusive and exclusive behavior are mutually exclusive.
Specifically, since the exclusive behavior means "use all components
except these," it does not make sense to mix it with the inclusive
behavior of not specifying it (i.e., "use all of these components").
Hence, something like this:
shell$ mpirun --mca btl self,sm,openib,^tcp ...
|
does not make sense because it says both "use only the self, sm,
and openib components" and "use all components except tcp" and
will result in an error.
Just as with all MCA parameters, the btl parameter (and all
framework parameters) can be set in
multiple different ways.
| 194. What is processor affinity? Does Open MPI support it? |
Open MPI supports processor affinity on a variety of systems
through process binding, in which each MPI process, along with its
threads, is "bound" to a specific subset of processing resources
(cores, sockets, etc.). That is, the operating system will constrain
that process to run on only that subset. (Other processes might be
allowed on the same resources.)
Affinity can improve performance by inhibiting excessive process
movement -- for example, away from "hot" caches or NUMA memory.
Judicious bindings can improve performance by reducing resource contention
(by spreading processes apart from one another) or improving interprocess
communications (by placing processes close to one another). Binding can
also improve performance reproducibility by eliminating variable process
placement. Unfortunately, binding can also degrade performance by
inhibiting the OS capability to balance loads.
You can run the "ompi_info" command and look for "paffinity"
components to see if your system is supported. For example:
$ ompi_info | grep paffinity
MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
|
Note that processor affinity probably should not be used when a node
is over-subscribed (i.e., more processes are launched than there are
processors). This can lead to a serious degradation in performance
(even more than simply oversubscribing the node). Open MPI will
usually detect this situation and automatically disable the use of
processor affinity (and display run-time warnings to this effect).
Also see this FAQ entry for how to use
processor and memory affinity in Open MPI.
| 195. What is memory affinity? Does Open MPI support it? |
Memory affinity is only relevant for Non-Uniform Memory Access
(NUMA) machines, such as "big iron" SGI and Cray machines, or many
models of multi-processor Opteron machines. In a NUMA architecture,
memory is physically distributed throughout the machine even though it
is virtually treated as a single address space. That is, memory may
be physically local to one or more processors -- and therefore remote
to other processors.
Simply put: some memory will be faster to access (for a given process)
than others.
Open MPI supports general and specific memory affinity, meaning that
it generally tries to allocate all memory local to the processor that
asked for it. When shared memory is used for communication, Open MPI
uses memory affinity to make certain pages local to specific
processes in order to minimize memory network/bus traffic.
Open MPI supports memory affinity on a variety of systems. You can
run the "ompi_info" command and look for "maffinity"
components to see if your system is supported. For example:
$ ompi_info | grep maffinity
MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.0)
|
Note that memory affinity support is enabled
only when processor affinity is enabled. Specifically: using memory
affinity does not make sense if processor affinity is not enabled
because processes may allocate local memory and then move to a
different processor, potentially remote from the memory that it just
allocated.
Also see this FAQ entry for how to use
processor and memory affinity in Open MPI.
| 196. How do I tell Open MPI to use processor and/or memory affinity? |
Assuming that your system supports processor and memory
affinity (check ompi_info for "paffinity" and "maffinity"
components), you can explicitly tell Open MPI to use them when running
MPI jobs.
Note that memory affinity support is enabled
only when processor affinity is enabled. Specifically: using memory
affinity does not make sense if processor affinity is not enabled
because processes may allocate local memory and then move to a
different processor, potentially remote from the memory that it just
allocated.
Also note that processor and memory affinity is meaningless (but
harmless) on uniprocessor machines.
How to enable / use processor and memory affinity in Open MPI depends
on which version you are using:
| 197. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.2.x? (What is mpi_paffinity_alone?) |
Open MPI 1.2 offers only crude control, with the MCA
parameter "mpi_paffinity_alone". For example:
$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out
|
(Just like any other MCA parameter, mpi_paffinity_alone can be set
via any of the normal MCA parameter
mechanisms.)
On each node where your job is running, your job's MPI processes will
be bound, one-to-one, in the order of their global MPI ranks, to the
lowest-numbered processing units (for example, cores or hardware threads)
on the node as identified by the OS. Further, memory affinity will also
be enabled if it is supported on the node,
as described in a different FAQ entry.
If multiple jobs are launched on the same node in this manner, they will
compete for the same processing units and severe performance degradation
will likely result. Therefore, this MCA parameter is best used when you
know your job will be "alone" on the nodes where it will run.
Since each process is bound to a single processing unit, performance will
likely suffer catastrophically if processes are multi-threaded.
Depending on how processing units on your node are numbered, the binding
pattern may be good, bad, or even disastrous. For example, performance
might be best if processes are spread out over all processor sockets on
the node. The processor ID numbering, however, might lead to
mpi_paffinity_alone filling one socket before moving to another.
Indeed, on nodes with multiple hardware threads per core (e.g.,
"HyperThreads", "SMT", etc.), the numbering could lead to multiple
processes being bound to a core before the next core is considered.
In such cases, you should probably upgrade to a newer version of Open MPI
or use a different, external mechanism for processor binding.
Note that Open MPI will automatically disable processor affinity on
any node that is oversubscribed (i.e., where more Open MPI processes
are launched in a single job on a node than it has processors) and
will print out warnings to that effect.
Also note, however, that processor affinity is not exclusionary with
Degraded performance mode. Degraded mode is usually only used when
oversubscribing nodes (i.e., running more processes on a node than it
has processors -- see this FAQ entry for
more details about oversubscribing, as well as a definition of
Degraded performance mode). It is possible manually to select
Degraded performance mode and use processor affinity as long as you
are not oversubscribing.
| 198. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.3.x? (What are rank files?) |
Open MPI 1.3 supports the mpi_paffinity_alone MCA parameter
that is described in this FAQ
entry.
Open MPI 1.3 (and higher) also allows a different binding to be specified
for each process via a rankfile. Consider the following example:
shell$ cat rankfile
rank 0=host0 slot=2
rank 1=host1 slot=4-7,0
rank 2=host2 slot=1:0
rank 3=host3 slot=1:2-3
shell$ mpirun -np 4 -hostfile hostfile --rankfile rankfile ./my_mpi_application
or
shell$ mpirun -np 4 -hostfile hostfile --mca rmaps_rank_file_path rankfile ./my_mpi_application
|
The rank file specifies a host node and slot list binding for each
MPI process in your job. Note:
- Typically, the slot list is a comma-delimited list of ranges. The
numbering is OS/BIOS-dependent and refers to the finest grained processing
units identified by the OS -- for example, cores or hardware threads.
- Alternatively, a colon can be used in the slot list for socket:core
designations. For example, 1:2-3 means cores 2-3 of socket 1.
- It is strongly recommended that you provide a full rankfile when
using such affinity settings, otherwise there would be a very high
probability of processor oversubscription and performance degradation.
- The hosts specified in the rankfile must be known to mpirun,
for example via a list of hosts in a hostfile or as obtained from a
resource manager.
- The number of processes
np must be provided on the mpirun cmd
line.
- If some processing units are not available -- e.g., due to
unpopulated sockets, idled cores, or BIOS settings -- the syntax assumes
a logical numbering in which numbers are contiguous despite the physical
gaps. You may refer to actual physical numbers with a "p" prefix.
For example, rank 4=host3 slot=p3:2
will bind rank4 to the physical socket3 : physical core2 pair.
Rank files are also discussed on the mpirun man page.
If you want to use the same "slot list" binding for each process,
presumably in cases where there is only one process per node, you can
specify this slot list on the command line rather than having to use a
rank file:
shell$ mpirun -np 4 -hostfile hostfile --slot-list 0:1 ./my_mpi_application
|
Remember, every process will use the same slot list. If multiple processes
run on the same host, they will bind to the same resources -- in this case,
socket0:core1, presumably oversubscribing that core and ruining performance.
Slot lists can be used to bind to multiple slots, which would be helpful for
multi-threaded processes. For example:
- Two threads per process: rank 0=host1 slot=0,1
- Four threads per process: rank 0=host1 slot=0,1,2,3
Note that no thread will be bound to a specific slot within the list. OMPI
only supports process level affinity; each thread will be bound to all
of the slots within the list.
| 199. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.4.x? (How do I use the --by* and --bind-to-* options?) |
Open MPI 1.4 supports all the same processor affinity controls
as Open MPI v1.3, but also
supports additional command-line binding switches to mpirun:
-
--bind-to-none: Do not bind processes.
(Default)
-
--bind-to-core: Bind each MPI process to a core.
-
--bind-to-socket: Bind each MPI process to a processor socket.
-
--report-bindings: Report how the launched processes were bound
by Open MPI.
In the case of cores with multiple hardware threads (e.g., HyperThreads or
SMT), only the first hardware thread on each core is used with the
--bind-to-* options. This will hopefully be fixed in the Open MPI v1.5 series.
The above options are typically most useful when used with the
following switches that indicate how processes are to be laid out in
the MPI job. To be clear: if the following options are used without
a --bind-to-* option, they only have the effect of deciding which
node a process will run on. Only the --bind-to-* options actually
bind a process to a specific (set of) hardware resource(s).
-
--byslot: Alias for --bycore.
-
--bycore: When laying out processes, put sequential MPI
processes on adjacent processor cores. (Default)
-
--bysocket: When laying out processes, put sequential MPI
processes on adjacent processor sockets.
-
--bynode: When laying out processes, put sequential MPI
processes on adjacent nodes.
Note that --bycore and --bysocket lay processes out in terms of the
actual hardware rather than by some node-dependent numbering, which
is what mpi_paffinity_alone does as described
in this FAQ entry.
Finally, there is poorly-named a "combination" option that effects both process
layout counting and binding: --cpus-per-proc (and an even more poorly-named
alias --cpus-per-rank).
Editor's note: I feel that these options are poorly named for two
reasons: 1) "cpu" is not consistently defined (e.g., it may be a
core, or may be a hardware thread, or it may be something else), and
2) even though many users use the terms "rank" and "MPI process"
interchangeably, they are NOT the same thing.
This option does the following:
- Takes an integer argument (
ncpus) that indicates how
many operating system processor IDs (which may be cores or may be
hardware threads) should be bound to each MPI process.
- Allocates and binds
ncpus OS processor IDs to each MPI process.
For example, on a machine with 4 processor sockets, each with 4
processor cores, each with one hardware thread:
shell$ mpirun -np 8 --cpus-per-proc 2 my_mpi_process
|
This command will bind each MPI process to ncpus=2
cores. All cores on the machine will be used.
- Note that
ncpus cannot be more than the number of OS processor
IDs in a single processor socket. Put loosely: --cpus-per-proc only
allows binding to multiple cores/threads within a single socket.
The --cpus-per-proc can also be used with the --bind-to-* options
in some cases, but this code is not well tested and may result in
unexpected binding behavior. Test carefully to see where processes
actually get bound before relying on the behavior for production runs.
The --cpus-per-proc and other affinity-related command line options
are likely to be revamped some time during the Open MPI v1.5 series.
| 200. How do I tell Open MPI to use processor and/or memory affinity
in Open MPI v1.5.x? |
Open MPI 1.5 currently has the same processor affinity
controls as Open MPI v1.4. This
FAQ entry is a placemarker for future enhancements to the 1.5 series'
processor and memory affinity features.
Stay tuned!
| 201. Does Open MPI support calling fork(), system(), or popen() in MPI processes? |
It depends on a lot of factors, including (but not limited to):
- The operating system
- The underlying compute hardware
- The network stack (a see this FAQ entry for more details)
- Interactions with other middleware in the MPI process
In some cases, Open MPI will determine that it is not safe to
fork(). In these cases, Open MPI will register a pthread_atfork()
callback to print a warning when the process forks.
This warning is helpful for legacy MPI applications where the current
maintainers are unaware that system() or popen() is being invoked from
an obscure subroutine nestled deep in millions of line of Fortran code
(we've seen this kind of scenario many times).
However, this atfork handler can be dangerous because there is no way
to unregister an atfork handler. Hence, packages that
dynamically open Open MPI's libraries (e.g., Python bindings for Open
MPI) may fail if they finalize and unload libmpi, but later call
fork. The atfork system will try to invoke Open MPI's atfork handler;
nothing good can come of that.
For such scenarios, or if you simply want to disable printing the
warning, Open MPI can be set to never register the atfork handler with
the mpi_warn_on_fork MCA parameter. For example:
shell$ mpirun --mca mpi_warn_on_fork 0 ...
|
Of course, systems that dlopen libmpi may not use Open MPI's mpirun,
and therefore may need to use a
different mechanism to set MCA parameters.
| 202. I want to run some performance benchmarks with Open MPI. How do I do that? |
Running benchmarks correctly is an extremely difficult task to
do correctly. There are many, many factors to take into account; it
is not as simple as just compiling and running a stock benchmark
application. This FAQ entry is by no means a definitive guide, but it
does try to offer some suggestions for generating accurate, meaningful
benchmarks.
- Decide exactly what you are benchmarking and setup your system
accordingly. For example, if you are trying to benchmark maximum
performance, then many of the suggestions listed below are extremely
relevant (be the only user on the systems and network in question, be
the only software running, use processor affinity, etc.). If you're
trying to benchmark average performance, some of the suggestions below
may be less relevant. Regardless, it is critical to know exactly
what you're trying to benchmark, and know (not guess) both your
system and the benchmark application itself well enough to understand
what the results mean.
To be specific, many benchmark applications are not well understood
for exactly what they are testing. There have been many cases where
users run a given benchmark application and wrongfully conclude that
their system's performance is bad -- solely on the basis of a single
benchmark that they did not understand. Read the documentation of the
benchmark carefully, and possibly even look into the code itself to
see exactly what it is testing.
Case in point: not all ping-pong benchmarks are created equal. Most
users assume that a ping-pong benchmark is a ping-pong benchmark is a
ping-pong benchmark. But this is not true; the common ping-pong
benchmarks tend to test subtly different things (e.g., NetPIPE, TCP
bench, IMB, OSU, etc.). Make sure you understand what your
benchmark is actually testing.
- Make sure that you are the only user on the systems where you
are running the benchmark to eliminate contention from other
processes.
- Make sure that you are the only user on the entire network /
interconnect to eliminate network traffic contention from other
processes. This is usually somewhat difficult to do, especially in
larger, shared systems. But your most accurate, repeatable results
will be achieved when you are the only user on the entire
network.
- Disable all services and daemons that are not being used. Even
"harmless" daemons consume system resources (such as RAM) and cause
"jitter" by occassionally waking up, consuming CPU cycles, reading
or writing to disk, etc. The optimum benchmark system has an absolute
minimum number of system services running.
- Use processor affinity on multi-processor/core machines to
disallow the operating system from swapping MPI processes between
processor (and causing unnecessary cache thrashing, for
example).
On NUMA architectures, having the processes getting bumped from one
socket to another is more expensive in terms of cache locality (with
all of the cache coherency overhead that comes with the lack of it)
than in terms of hypertransport routing (see below).
Non-NUMA architectures such as the Intel Woodcrest have a flat access
time to the South Bridge, but cache locality is still important so CPU
affinity is always a good thing to do.
- Be sure to understand your system's architecture, particularly
with respect to the memory, disk, and network characteristics, and
test accordingly. For example, on NUMA architectures, most common
being Opteron, the South Bridge is connected through a hypertransport
link to one CPU on one socket. Which socket depends on the
motherboard, but it should be described in the motherboard
documentation (it's not always socket 0!). If a process on the other
socket needs to write something to a NIC on a PCIE bus behind the
South Bridge, it needs to first hop through the first socket. On
modern machines (circa late 2006), this hop cost usually something
like 100ns (i.e., 0.1 us). If the socket is further away, like in a 4
or 8-socket configuration, there could potentially be more hops,
leading to more latency.
- Compile your benchmark with the appropriate compiler optimization
flags. With some MPI implementations, the compiler wrappers (like
mpicc, mpif90, etc.) add optimization flags automatically.
Open MPI does not. Add -O or other flags explicitly.
- Make sure your benchmark runs for a sufficient amount of time.
Short-running benchmarks are generally less accurate because they take
fewer samples; longer-running jobs tend to take more samples
- If your benchmark is trying to benchmark extremely short events
(such as the time required for a single ping-pong of messages):
- Perform some "warmup" events first. Many MPI implementations
(including Open MPI) -- and other subsystems upon which the MPI uses
-- may use "lazy" semantics to setup and maintain streams of
communications. Hence, the first event (or first few events)
may well take significantly longer than subsequent events.
- Use a high-resolution timer if possible --
gettimeofday() only
returns milisecond precision (sometimes on the order of several
microseconds).
- Run the event many, many times (hundreds or thousands, depending
on the event and the time it takes). Not only does this provide a
more samples, it may also be necessary, especially when the precision
of the timer your using may be several orders of magnitude less
precise than the even you're trying to benchmark.
- Decide whether you are reporting minimum, average, or maximum
numbers, and have good reasons why.
- Accurately label and report all results. Reproducability is a
major goal of benchmarking; benchmark results are effectively useless
if they are not precisely labeled as to exactly what they are
reporting. Keep a log and detailed notes about the exact system
configuration that ou are benchmarking. Note, for example, all
hardware and software characteristics (to include hardware, firmware,
and software versions as appropriate).
| 203. I am getting a MPI_Win_free error from IMB-EXT -- what do I do? |
When you run IMB-EXT with Open MPI, you'll see a
message like this:
[node01.example.com:2228] *** An error occurred in MPI_Win_free
[node01.example.com:2228] *** on win
[node01.example.com:2228] *** MPI_ERR_RMA_SYNC: error while executing rma sync
[node01.example.com:2228] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
|
This is due to a bug in the Intel MPI Benchmarks, known to be in at
least versions v3.1 and v3.2. Intel was notified of this bug in May
of 2009, but there hasn't been a new IMB release since then.
Here is a small patch that fixes the bug in IMB v3.2:
diff -u imb-3.2-orig/src/IMB_window.c imb-3.2-fixed/src/IMB_window.c
--- imb-3.2-orig/src/IMB_window.c 2008-10-21 04:17:31.000000000 -0400
+++ imb-3.2-fixed/src/IMB_window.c 2009-07-20 09:02:45.000000000 -0400
@@ -140,6 +140,9 @@
c_info->rank, 0, 1, c_info->r_data_type,
c_info->WIN);
MPI_ERRHAND(ierr);
}
+ /* Added a call to MPI_WIN_FENCE, per MPI-2.1 11.2.1 */
+ ierr = MPI_Win_fence(0, c_info->WIN);
+ MPI_ERRHAND(ierr);
ierr = MPI_Win_free(&c_info->WIN);
MPI_ERRHAND(ierr);
}
|
And here is the corresponding patch for IMB v3.1:
Index: IMB_3.1/src/IMB_window.c
===================================================================
--- IMB_3.1/src/IMB_window.c(revision 1641)
+++ IMB_3.1/src/IMB_window.c(revision 1642)
@@ -140,6 +140,10 @@
c_info->rank, 0, 1, c_info->r_data_type, c_info->WIN);
MPI_ERRHAND(ierr);
}
+ /* Added a call to MPI_WIN_FENCE here, per MPI-2.1
+ 11.2.1 */
+ ierr = MPI_Win_fence(0, c_info->WIN);
+ MPI_ERRHAND(ierr);
ierr = MPI_Win_free(&c_info->WIN);
MPI_ERRHAND(ierr);
}
|
The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
The sm BTL has high exclusivity. That is, if one process can reach another
process via sm, then no other BTL will be considered for that connection.
Note that with OMPI 1.3.2, the sm so-called "FIFOs" were reimplemented and
the sizing of the shared-memory area was changed. So, much of this FAQ will
distinguish between releases up to OMPI 1.3.1 and releases starting with OMPI 1.3.2.
| 205. How do I specify use of sm for MPI messages? |
Typically, it is unnecessary to do so; OMPI will use the best BTL available
for each communication.
Nevertheless, you may use the MCA parameter btl. You should also specify the
self BTL for communications between a process and itself. Further, if not all
processes in your job will run on the same, single node, then you also need
to specify a BTL for internode communications. For example:
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
|
| 206. How does the sm BTL work? |
A point-to-point user message is broken up by the PML into fragments.
The sm BTL only has to transfer individual fragments. The steps are:
- The sender pulls a shared-memory fragment out of one of its free lists.
Each process has one free list for smaller (e.g., 4Kbyte) eager
fragments and another free list for larger (e.g., 32Kbyte) max fragments.
- The sender packs the user-message fragment into this shared-memory
fragment, including any header information.
- The sender posts a pointer to this shared fragment into the
appropriate FIFO (first-in-first-out) queue of the receiver.
- The receiver polls its FIFO(s). When it finds a new fragment
pointer, it unpacks data out of the shared-memory fragment and notifies
the sender that the shared fragment is ready for reuse (to be
returned to the sender's free list).
On each node where an MPI job has two or more processes running, the job creates
a file that each process mmaps into its address space. Shared-memory
resources that the job needs -- such as FIFOs and fragment free lists -- are
allocated from this shared-memory area.
| 207. Why does my MPI job no longer start when there are too many processes on
one node? |
If you are using OMPI 1.3.1 or earlier, it is possible that the shared-memory
area set aside for your job was not created large enough. Make sure you're running
in 64-bit mode (compiled with -m64) and set the MCA parameter mpool_sm_max_size
to be very large -- even several Gbytes. Exactly how large is discussed further
below.
Regardless of which OMPI release you're using, make sure that there is sufficient
space for a large file to back the shared memory -- typically in /tmp.
| 208. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info command can display all the parameters available for the
sm BTL and sm mpool:
shell$ ompi_info --param btl sm
shell$ ompi_info --param mpool sm
|
| 209. How can I tune these parameters to improve performance? |
Mostly, the default values of the MCA parameters have already been
chosen to give good performance. To improve performance further is a little
bit of an art. Sometimes, it's a matter of trading off performance for memory.
btl_sm_eager_limit:
If message data plus header information fits within this limit,
the message is sent "eagerly"... -- that is, a sender attempts
to write its entire message to shared buffers without waiting for a receiver
to be ready. Above this size, a sender will only write the first part of a
message, then wait for the receiver to acknowledge its ready before continuing.
Eager sends can improve performance by decoupling senders from receivers.
btl_sm_max_send_size:
Large messages are sent in fragments of this size. Larger segments can
lead to greater efficiencies, though they could perhaps also inhibit
pipelining between sender and receiver.
btl_sm_num_fifos:
Starting in OMPI 1.3.2, this is the number of FIFOs per receiving process.
By default, there is only one FIFO per process. Conceivably, if many senders
are all sending to the same process and contending for a single FIFO, there
will be congestion. If there are many FIFOs, however, the receiver must
poll more FIFOs to find in-coming messages. Therefore, you might try
increasing this parameter slightly if you have many (at least dozens) of
processes all sending to the same process. For example, if 100 senders are
all contending for a single FIFO for a particular receiver, it may suffice
to increase btl_sm_num_fifos from 1 to 2.
btl_sm_fifo_size:
Starting in OMPI 1.3.2, FIFOs could no longer grow. If you believe the
FIFO is getting congested because a process falls far behind in reading
in in-coming message fragments, increase this size manually.
btl_sm_free_list_num:
This is the initial number of fragments on each (eager and max) free list.
The free lists can grow in response to resource congestion, but you can
increase this parameter to pre-reserve space for more fragments.
mpool_sm_min_size:
You can reserve headroom for the shared-memory area to grow by increasing
this parameter.
| 210. Where is the file that sm will mmap in? |
The file will be in the OMPI session directory, which is typically
something like /tmp/openmpi-sessions-myusername@mynodename* .
The file itself will have the name
shared_mem_pool.mynodename. For example, the full path could be
/tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.
To place the session directory in a non-default location, use the MCA parameter
orte_tmpdir_base.
| 211. Why am I seeing incredibly poor performance with the sm BTL? |
The most common problem with the shared memory BTL is when the
Open MPI session directory is placed on a network filesystem (e.g., if
/tmp is not a local disk). This is because the shared memory BTL
places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the
session directory is located on a network filesystem, the shared
memory BTL latency will be extremely high.
Try not mounting /tmp as a network filesystem, and/or moving the Open
MPI session directory to a local filesystem.
Some users have reported success and possible performance
optimizations with having /tmp mounted as a "tmpfs" filesystem
(i.e., a RAM-based filesystem). However, before doing configuring
your system this way, be aware of a few items:
- Open MPI writes a few small meta data files into
/tmp and may
therefore consume some extra memory that could have otherwise been
used for application instruction or data state.
- If you use the "filem" system in Open MPI for moving
executables between nodes, these files are stored under
/tmp.
- Open MPI's checkpoint / restart files can also be saved under
/tmp.
- If the Open MPI job is terminated abnormally, there are some
circumstances where files (including memory-mapped shared memory
files) can be left in
/tmp. This can happen, for example, when a
resource manager forcibly kills an Open MPI job and does not give it
the chance to clean up /tmp files and directories.
Some users have reported success with configuring their resource
manager to run a script between jobs to forcibly empty the /tmp
directory.
| 212. Can I use SysV instead of mmap? |
In the 1.3 and 1.4 Open MPI series, shared memory is established
via mmap. In future releases, there may be an option for using SysV
shared memory.
| 213. How much shared memory will my job use? |
Your job will create a shared-memory area on each node where it has
two or more processes. This area will be fixed during the lifetime of your
job. Shared-memory allocations (for FIFOs and fragment free lists) will be
made in this area. Here, we look at the size of that shared-memory area.
If you want just one, hard number, then go with approximately 128 Mbytes per
node per job, shared by all the job's processes on that node. That is, an OMPI
job will need more than a few Mbytes per node, but typically less than a few Gbytes.
Better yet, read on.
Up through OMPI 1.3.1, the shared-memory file would basically be sized:
nbytes = n * mpool_sm_per_peer_size
if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size
|
where n is the number of processes in the job running on that particular node
and the mpool_sm_* are MCA parameters.
For small n, this size is typically excessive. For large n (e.g.,
128 MPI processes on the same node), this size may not be sufficient for the job
to start.
Starting in OMPI 1.3.2, a more sophisticated formula was introduced to model more
closely how much memory was actually needed. That formula is somewhat complicated
and subject to change. It guarantees that there will be at least enough shared
memory for the program to start up and run. See this
FAQ item to see how much is needed. Alternatively, the motivated user can
examine the OMPI source code to see the formula used -- for example, here is the formula in OMPI revision SVN r20906.
OMPI 1.3.2 also uses the MCA parameter mpool_sm_min_size to set a minimum size
-- e.g., so that there is not only enough shared memory for the job to start, but
additionally headroom for further shared-memory allocations (e.g., of more eager
or max fragments).
Once the shared-memory area is established, it will not grow further during the
course of the MPI job's run.
| 214. How much shared memory do I need? |
In most cases, OMPI will start your job with sufficient shared memory.
Nevertheless, if OMPI doesn't get you enough shared memory (e.g., you're using OMPI 1.3.1
or earlier with roughly 128 processes or more on a single node) or you want to
trim shared-memory consumption, you may want to know how much shared memory is really needed.
As we saw earlier, the shared memory area contains:
- FIFOs
- eager fragments
- max fragments
In general, you need only enough shared memory for the FIFOs and fragments
that are allocated during MPI_Init().
Beyond that, you might want additional shared memory for performance reasons,
so that FIFOs and fragment lists can grow if your program's message traffic encounters
resource congestion. Even if there is no room to grow, however, your correctly
written MPI program should still run to complete in the face of congestion;
performance simply degrades somewhat. Note that while shared-memory resources
can grow after MPI_Init(), they cannot shrink.
So, how much shared memory is needed during MPI_Init() ?
You need approximately the total of:
- FIFOs:
- (≤ OMPI 1.3.1):
3 × n × n × pagesize
- (≥ OMPI 1.3.2):
n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
- eager fragments:
n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
- max fragments:
n × btl_sm_free_list_num × btl_sm_max_send_size
where
-
n is the number of MPI processes in your job on the node
-
pagesize is the OS page size (4K for Linux and 8K for Solaris)
-
btl_sm_* are MCA parameters
| 215. How can I decrease my shared-memory usage? |
There are two parts to this question.
First, how does one reduce how big the mmap file is? The answer is:
- up to OMPI 1.3.1: reduce
mpool_sm_per_peer_size, mpool_sm_min_size,
and mpool_sm_max_size
- starting with OMPI 1.3.2: reduce
mpool_sm_min_size
Second, how does one reduce how much shared memory is needed? (Just making
the mmap file smaller doesn't help if then your job won't start up.) The
answers are:
- For small values of
n -- that is, for few processes per node --
shared-memory usage during MPI_Init() is predominantly for max free lists.
So, you can reduce the MCA parameter btl_sm_max_send_size. Alternatively,
you could reduce btl_sm_free_list_num, but it is already pretty small by
default.
- For large values of
n -- that is, for many processes per node -- there
are two cases:
- up to OMPI 1.3.1: shared-memory usage is dominated by the
FIFOs, which consume a certain number of pages. Usage is
high and cannot be reduced much via MCA parameter tuning.
- starting with OMPI 1.3.2: shared-memory usage is dominated
by the eager free lists. So, you can reduce the MCA parameter
btl_sm_eager_limit.
| 216. How do I specify to use the TCP network for MPI messages? |
In general, you specify that the tcp BTL component should be
used. However, note that you should also specify that the self
BTL component should be used. self is for loopback communication
(i.e., when an MPI process sends to itself), and is technically a
different communication channel than TCP. For example:
shell$ mpirun --mca btl tcp,self ...
|
Failure to specify the self BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run
fine until a process tries to send to itself).
Note that if the tcp BTL is available at run time (which it should
be on most POSIX-like systems), Open MPI should automatically use it
by default (ditto for self). Hence, it's usually unnecessary to
specify these options on the mpirun command line. They are
typically only used when you want to be absolutely positively
definitely sure to use the specific BTL.
If you are using a high-speed network (such as Myrinet or InfiniBand),
be sure to also see this FAQ entry.
| 217. But wait -- I'm using a high-speed network. Do I have to
disable the TCP BTL? |
No. Following the so-called "Law of Least Astonishment",
Open MPI assumes that if you have both a TCP network and at least one
high-speed network (such as Myrinet or InfiniBand), you will likely
only want to use the high-speed network(s) for MPI message passing.
Hence, the tcp BTL component will sense this and automatically
deactivate itself.
That being said, Open MPI may still use TCP for setup and teardown
information -- so you'll see traffic across your TCP network during
startup and shutdwon of your MPI job. This is normal and does not
affect the MPI message passing channels.
| 218. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info command can display all the parameters
available for the tcp BTL component (i.e., the component that uses
TCP for MPI communications):
shell$ ompi_info --param btl tcp
|
| 219. Does Open MPI use the TCP loopback interface? |
Usually not.
In general message passing usage, there are two scenarios where using
the TCP loopback interface could be used:
- Sending a message from one process to itself
- Sending a message from one process to another process on the same
machine
The TCP BTL does not handle "send-to-self" scenarios in Open MPI;
indeed, it is not even capable of doing so. Instead, the self BTL
component is used for all send-to-self MPI communications (this allows
all Open MPI BTL components to avoid special case code for
send-to-self scenarios). The self component uses its own mechanisms
for send-to-self scenarios; it does not use network interfaces.
When sending to processes on the same machine, Open MPI will default
to using the shared memory (sm) BTL. If the user has
deactivated this BTL, depending on what other BTL components are
available, it is possible that the TCP BTL will be chosen for message
passing to processes on the same node, in which case the TCP lookback
device will likely be used. But this is not the default; either
shared memory has to fail to startup properly or the user must
specifically request not to use the shared memory BTL.
| 220. I have multiple TCP networks on some/all of my cluster nodes. Which ones will Open MPI use? |
In general, Open MPI will greedily use all TCP networks that
it finds per its reachability
computations.
To change this behavior, you can either specifically include certain
networks or specifically exclude certain networks. See this FAQ entry for more details.
| 221. I'm getting TCP-related errors. What do they mean? |
TCP-related errors are usually reported by Open MPI in a
message similar to these:
btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113
mca_btl_tcp_frag_send: writev failed with errno=104
|
If an error number is displayed with no explanation string, you can
see what that specific error number means on your operating system
with the following command (the following example was run on Linux;
results may be different on other operating systems):
linux_shell$ perl -e 'die$!=113'
No route to host at -e line 1.
linux_shell$ perl -e 'die$!=104'
Connection reset by peer at -e line 1.
|
Two types of errors are commonly reported to the Open MPI user's
mailing list:
- No route to host: These types of errors usually mean that
there are multiple TCP interfaces available and they do not obey Open
MPI's assumptions about routability. See these two FAQ items for more
information:
- Connection reset by peer: These types of errors usually occur
after MPI_INIT has completed, and typically indicate that an MPI
process has died unexpectedly (e.g., due to a seg fault). The
specific error message indicates that a peer MPI process tried to
write to the now-dead MPI process and failed.
| 222. How do I tell Open MPI which TCP networks to use? |
In some parallel environments, it is not uncommon to have
multiple TCP interfaces on each node -- for example, one TCP network
may be "slow" and used for control information such as a batch
scheduler, a networked filesystem, and/or interactive logins. Another
TCP network (or networks) may be "fast" and be intended for parallel
applications to use during their runs. As another example, some
operating systems may also have virtual interfaces for communicating
with virtual machines.
Unless otherwise specified, Open MPI will greedily use all "up" TCP
networks that it can find and try to connect to all peers upon
demand (i.e., Open MPI does not open sockets to all of its MPI peers
during MPI_INIT -- see this FAQ entry
for more details). Hence, if you want MPI jobs to not use specific
TCP networks -- or not use any TCP networks at all -- then you need to
tell Open MPI.
NOTE: Aggressively using all "up" interfaces can cause problems in
some cases. For example, if you have a machine with a local-only
interface (e.g., the loopback device, or a virtual-machine bridge
device that can only be used on that machine, and cannot be used to
communicate with MPI processes on other machines), you will likely
need to tell Open MPI to ignore these networks. Open MPI usually
ignores loopback devices by default, but other local-only devices
must be manually ignored. Users have reported cases where RHEL6
automatically installed a "virbr0" device for Xen virtualization.
This interface was automatically given an IP address in the
192.168.1.0/24 subnet and marked as "up". Since Open MPI saw this
192.168.1.0/24 "up" interface in all MPI processes on all nodes, it
assumed that that network was usable for MPI communications. This is
obviously incorrect, and it led to MPI applications hanging when they
tried to send or receive MPI messages.
- To disable Open MPI from using TCP for MPI communications, the
tcp MCA parameter should be set accordingly. You can either
exclude the TCP component or include all other components.
Specifically:
# This says to exclude the TCP BTL component
# (implicitly including all others)
shell$ mpirun --mca btl ^tcp ...
# This says to include only the listed BTL components
# (tcp is not listed, and therefore will not be used)
shell$ mpirun --mca btl self,openib ...
|
- If you want to use TCP for MPI communications, but want to
restrict it from certain networks, use the
btl_tcp_if_include or
btl_tcp_if_exclude MCA parameters (only one of the two should be
set). The values of these parameters can be a comma-delimited list of
network interfaces. For example:
# This says to not use the eth0 and lo interfaces.
# (an implicitly use all the rest) Per the description
# above, TCP loopback and all local-only devices *must*
# be included if the exclude list is specified.
shell$ mpirun --mca btl_tcp_if_exclude lo,eth0 ...
# This says to only use the eth1 and eth2 interfaces
# (and implicitly ignore the rest)
shell$ mpirun --mca btl_tcp_if_include eth1,eth2 ...
|
- Starting in the Open MPI v1.5 series, you can specify subnets in the
include or exclude lists in CIDR notation. For example:
# Only use the 192.168.1.0/24 and 10.10.0.0/16 subnets for MPI
# communications:
shell$ mpirun --mca btl_tcp_if_include 192.168.1.0/24,10.10.0.0/16 ...
|
NOTE: If you use the
btl_tcp_if_include and btl_tcp_if_exclude MCA parametes to shape
the behavior of the TCP BTL for MPI communications, you may also
need/want to investigate the corresponding MCA parameters
oob_tcp_if_include and oob_tcp_if_exclude, which are used to shape
non-MPI TCP-based communication (e.g., communications setup and
coordination during MPI_INIT and MPI_FINALIZE).
Note that Open MPI will still use TCP for control messages, such as
data between mpirun and the MPI processes, rendezvous information
during MPI_INIT, etc. To disable TCP altogether, you also need to
disable the tcp component from the OOB framework.
| 223. Does Open MPI open a bunch of sockets during MPI_INIT? |
Although Open MPI is likely to open multiple TCP sockets
during MPI_INIT, the tcp BTL component does not open one socket per
MPI peer process during MPI_INIT. Open MPI opens
sockets as they are required -- so the first time a process sends a
message to a peer and there is a TCP connection between the two, Open
MPI will automatically open a new socket.
Hence, you should not have scalability issues with running large
numbers of processes (e.g., running out of per-process file
descriptors) if your parallel application is sparse in its
communication with peers.
| 224. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.2? |
This is a fairly complicated question -- there can be
ambiguity when hosts have multiple TCP NICs and/or there are multiple
TCP networks that are not routable to each other in a single MPI job.
It is important to note that Open MPI's atomic unit of routing is a
process -- not an IP address. Hence, Open MPI makes connections
between processes, not nodes (these processes are almost always on
remote nodes, but it's still better to think in terms of processes,
not nodes).
Specifically, since OMPI can span multiple TCP networks, each MPI
process may be able to use multiple IP addresses to each to each other
MPI process (and vice versa). So for each process, Open MPI needs to
determine which IP address -- if any - to use to connect to a peer MPI
process.
For example, say that you have a cluster with 16 nodes on a private
ethernet network. One of these nodes doubles as the head node for the
cluster and therefore has 2 ethernet NICs -- one to the external
network and one to the internal cluster network. But since 16 is a
nice number, you also want to use it for computation as well. So when
you mpirun spanning all 16 nodes, OMPI has to figure out to not use
the external NIC on the head node and only use the internal NIC.
To explain what happens, we need to explain some of what happens in
MPI_INIT. Even though Open MPI only makes TCP connections between
peer MPI processes upon demand (see this FAQ
entry), each process publishes its TCP contact information which
is then made available to all processes. Hence, every process knows
the TCP address(es) and corresponding port number(s) to contact every
other process.
But keep in mind that these addresses may span multiple TCP networks
and/or not be routable to each other. So when a connection is
requested, the TCP BTL component in Open MPI creates pairwise
combinations of all the TCP addresses of the localhost to all the TCP
addresses of the peer process, looking for a match.
A "match" is defined by the following rules:
- If the two IP addresses match after the subnet mask is applied,
assume that they are mutually routable and allow the connection
- If the two IP addresses are public, assume that they are mutually
routable and allow the connection
- Otherwise, the connection is disallowed (this is not an error --
we just disallow this connection on the hope that some other
device can be used to make a connection)
These rules tend to cover the following scenarios:
- A cluster on a private network with a head node that has a NIC on
the private network and the public network
- Clusters that have all public addresses
These rules do not cover the following cases:
- Running an MPI job that spans public and private networks
- Running an MPI job that spans a bunch of private networks with
narrowly-scoped netmasks, such as nodes that have IP addresses
192.168.1.10 and 192.168.2.10 with netmasks of 255.255.255.0 (i.e.,
the network fabric makes these two nodes be routable to each other,
even though the netmask implies that they are on different
subnets).
| 225. How does Open MPI know which TCP addresses are routable to each other in Open MPI 1.3 (and beyond)? |
The 1.3 series assumptions about routability are much different than
in the 1.2 series assumption. In the 1.3 series, we assume that
all interfaces are routable as long as they have the same address
family, IPv4 or IPv6. We use graph theory and give each possible
connection a weight depending on the quality of the connection. This
allows the library to select the best connections between nodes. This
method also supports striping but prevents more than one connection to
any interface.
The quality of the connection is defined as follows, with a higher
number meaning better connection. Note that when giving a weight to a
connection consisting of a private address and a public address, it will
give it the weight of PRIVATE_DIFFERENT_NETWORK.
NO_CONNECTION = 0
PRIVATE_DIFFERENT_NETWORK = 1
PRIVATE_SAME_NETWORK = 2
PUBLIC_DIFFERENT_NETWORK = 3
PUBLIC_SAME_NETWORK = 4
|
At this point, an example will best illustrate how two processes on two
different nodes would connect up. Here we have two nodes with a variety
of interfaces.
NodeA NodeB
--------------- ---------------
| lo0 | | lo0 |
'| 127.0.0.1 | | 127.0.0.1 |
| 255.0.0.0 | | 255.0.0.0 |
| | | |
| eth0 | | eth0 |
| 10.8.47.1 | | 10.8.47.2 |
| 255.255.255.0 | | 255.255.255.0 |
| | | |
| ibd0 | | ibd0 |
| 192.168.1.1 | | 192.168.1.2 |
| 255.255.255.0 | | 255.255.255.0 |
| | | |
| ibd1 | | |
| 192.168.2.2 | | |
| 255.255.255.0 | | |
--------------- ---------------
|
From these two nodes, the software builds up a bipartite graph that
shows all the possible connections with all the possible weights. The
lo0 interfaces are excluded as the btl_tcp_if_exclude mca parameter
is set to lo by default. Here is what all the possible connections
with their weights look like.
NodeA NodeB
eth0 --------- 2 -------- eth0
\
\
\------- 1 -------- ibd0
ibd0 --------- 1 -------- eth0
\
\
\------- 2 -------- ibd0
ibd1 --------- 1 -------- eth0
\
\
\------- 1 -------- ibd0
|
The library then examines all the connections and picks the optimal
ones. This leaves us with two connections being established between
the two nodes.
If you are curious about the actual connect() calls being made by
the processes, then you can run with --mca btl_base_verbose 30.
This can be useful if you notice your job hanging and believe it may
be the library trying to make connections to unreachable hosts.
# Here is an example with some of the output deleted for clarity.
# One can see the connections that are attempted.
> mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -np 2 -host NodeA,NodeB a.out
[...snip...]
[NodeA:18003] btl: tcp: attempting to connect() to address 10.8.47.2 on port 59822
[NodeA:18003] btl: tcp: attempting to connect() to address 192.168.1.2 on port 59822
[NodeB:16842] btl: tcp: attempting to connect() to address 192.168.1.1 on port 44500
[...snip...]
|
In case you want more details about the theory behind the connection
code, you can find the whole background story in
chapter 3 of this thesis
or check out a brief
IEEE paper.
| 226. Does Open MPI ever close TCP sockets? |
As of v1.2, no.
Although TCP sockets are opened "lazily" (meaning that MPI
connections / TCP sockets are only opened upon demand -- as opposed to
opening all possible sockets between MPI peer processes during
MPI_INIT), they are never closed.
| 227. Does Open MPI support IP interfaces that have more than one IP address? |
As of v1.6, no.
For example, if the output from your ifconfig has a single IP device
with multiple IP addresses like this:
0: eth0: mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:18:ae:f4:d2:29 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.3/24 brd 192.168.0.255 scope global eth0:1
inet 10.10.0.3/24 brf 10.10.0.255 scope global eth0
inet6 fe80::218:aef2:29b4:2c4/64 scope link
valid_lft forever preferred_lft forever
|
(note the two "inet" lines in there)
Then Open MPI will be unable to use this device.
| 228. Does Open MPI support virtual IP interfaces? |
As of v1.6.2, no.
For exampe, if the output of your ifconfig has both "eth0" and
"eth0:0", Open MPI will get confused if you use the TCP BTL, and
will likely hang.
Note that using btl_tcp_if_include or btl_tcp_if_exclude to avoid
using the virtual interface will not solve the issue.
This may get fixed in a future release. See Trac bug
#3339 to follow the progress on this issue.
| 229. What Myrinet-based components does Open MPI have? |
Open MPI supports both GM and MX for MPI communications. GM is
supported by the gm component in the BTL framework. MX is supported by
both the mx component in the BTL framework and (as of v1.2) the mx
component in the MTL framework.
| 230. How do I specify to use the Myrinet GM network for MPI messages? |
In general, you specify that the gm BTL component should be used.
However, note that you should also specify that the self BTL component
should be used. self is for loopback communication (i.e., when an MPI
process sends to itself). This is technically a different
communication channel than Myrinet. For example:
shell$ mpirun --mca btl gm,self ...
|
Failure to specify the self BTL may result in Open MPI being unable
to complete send-to-self scenarios (meaning that your program will run
fine until a process tries to send to itself).
To use Open MPI's shared memory support for on-host communication
instead of GM's shared memory support, simply include the sm BTL.
For example:
shell$ mpirun --mca btl gm,sm,self ...
|
Finally, note that if the gm component is
available at run time, Open MPI should automatically use it by
default (ditto for self and sm). Hence, it's usually unnecessary to
specify these options on the mpirun command line. They are
typically only used when you want to be absolutely positively
definitely sure to use the specific BTL.
| 231. How do I specify to use the Myrinet MX network for MPI messages? |
As of version 1.2, Open MPI has two different components
to support Myrinet MX, the mx BTL and the mx MTL, only one of which can be
used at a time. Prior versions only have the mx BTL.
If available, the mx BTL is used by default. However, to be sure it is
selected you can specify it. Note that you should also specify the
self BTL component (for loopback communication) and the sm BTL
component (for on-host communication). For example:
shell$ mpirun --mca btl mx,sm,self ...
|
To use the mx MTL component, it must be specified. Also, you must use
the cm PML component. For example:
shell$ mpirun --mca mtl mx --mca pml cm ...
|
Note that one cannot use both the mx MTL and the mx BTL components
at once. Deciding which to use largely depends on the application being
run.
| 232. But wait -- I also have a TCP network. Do I need to explicitly
disable the TCP BTL? |
No. See this FAQ entry for more details.
| 233. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info command can display all the parameters
available for the gm and mx BTL components and the mx MTL component:
# Show the gm BTL parameters
shell$ ompi_info --param btl gm
# Show the mx BTL parameters
shell$ ompi_info --param btl mx
# Show the mx MTL parameters
shell$ ompi_info --param mtl mx
|
| 234. I'm experiencing a problem with Open MPI on my Myrinet-based network; how do I troubleshoot and get help? |
In order for us to help you, it is most helpful if you can
run a few steps before sending an e-mail to both perform some basic
troubleshooting and provide us with enough information about your
environment to help you. Please include answers to the following
questions in your e-mail:
- Which Myricom software stack are you running: GM or MX? Which
version?
- Are you using "fma", the "gm_mapper", or the "mx_mapper"?
- If running GM, include the output from running the
gm_board_info
from a known "good" node and a known "bad" node.
If running MX, include the output from running mx_info from a known
"good" node and a known "bad" node.
- Is the "Map version" value from this output is the same across
all nodes?
- NOTE: If the map version
is not the same, ensure that you are not running a mixture of FMA on
some nodes and the mapper on others. Also check the connectivity of
nodes that seem to have an inconsistent map version.
- What are the contents of the file
/var/run/fms/fma.log?
Gather up this information and see
this page about how to submit a help request to the user's mailing
list.
| 235. How do I adjust the MX first fragment size? Are there constraints? |
The MX library limits the maximum message fragment size for
both on-node and off-node messages. As of MX v1.0.3, the inter-node
maximum fragment size is 32k, and the intra-node maximum fragment size
is 16k -- fragments sent larger than these sizes will fail.
Open MPI automatically fragments large messages; it currently limits
its first fragment size on MX networks to the lower of these two
values -- 16k. As such, increasing the value of the MCA parameter
named btl_mx_first_frag_size larger than 16k may cause failures in
some cases (i.e., when using MX to send large messages to processes on
the same node); it will cause failures in all cases if it is set above
32k.
Note that this only affects the first fragment of messages; latter
fragments do not have this size restriction. The MCA parameter
btl_mx_max_send_size can be used to vary the maximum size of
subsequent fragments.
| 236. What is different between Sun Microsystems ClusterTools 7 and Open
MPI in regards to the uDAPL BTL? |
Sun's ClusterTools is based off of Open MPI with one significant
difference: Sun's ClusterTools includes uDAPL RDMA capabilities in the
uDAPL BTL. Open MPI v1.2 uDAPL BTL does not include the RDMA
capabilities. These improvements do exist today in the Open MPI trunk
and will be included in future Open MPI releases.
| 237. What values are expected to be used by the btl_udapl_if_include and btl_udapl_if_exclude mca parameter? |
The uDAPL BTL looks for a match from the uDAPL static registry which is contained in the dat.conf file. Each non commented or blank line is considered an interface. The first field of each interface entry is the value which must be supplied to the mca parameter in question.
Solaris Example:
shell% datadm -v
ibd0 u1.2 nonthreadsafe default udapl_tavor.so.1 SUNW.1.0 " " "driver_name=tavor"
shell% mpirun --mca btl_udapl_if_include ibd0 ...
|
Linux Example:
shell% cat /etc/dat.conf
OpenIB-cma u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "ib0 0" ""
OpenIB-bond u1.2 nonthreadsafe default /usr/local/ofed/lib64/libdaplcma.so dapl.1.2 "bond0 0" ""
shell% mpirun --mca btl_udapl_if_exclude OpenIB-bond ...
|
| 238. Where is the static uDAPL Registry found? |
Solaris: /etc/dat/dat.conf
Linux: /etc/dat.conf
| 239. How come the value reported by "ifconfig" is not accepted by the btl_udapl_if_include/btl_udapl_if_exclude MCA parameter? |
uDAPL queries a static registry defined in the dat.conf file to find available interfaces which can be used. As such, the uDAPL BTL needs to match the names found in the registry and these may differ from what is reported by "ifconfig".
| 240. I get a warning message about not being able to register memory and possibly out of privileged memory while running on Solaris, what can I do? |
The error message probably looks something like this:
WARNING: The uDAPL BTL is not able to register memory. Possibly out of
allowed privileged memory (i.e. memory that can be pinned). Increasing
the allowed privileged memory may alleviate this issue.
|
One thing to do is increase the amount of available privileged
memory. On Solaris your system adminstrator can increase the amount of
available privileged memory by editing the /etc/project file on the
nodes. For more information see Solaris "project" man page.
As an example of increasing the privileged memory first determine the
amount available (example of typical value is 978MB):
shell% prctl -n project.max-device-locked-memory -i project default
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.max-device-locked-memory
privileged 978MB - deny -
system 16.0EB max deny -
|
To increase the amount of privileged memory edit /etc/project file:
Default /etc/project file.
system:0::::
user.root:1::::
noproject:2::::
default:3::::
group.staff:10::::
|
Change to, for example 4GB.
system:0::::
user.root:1::::
noproject:2::::
default:3::::project.max-device-locked-memory=(priv, 4294967296, deny)
group.staff:10::::
|
| 241. What is special about MPI performance analysis? |
The synchronization among the MPI processes can be a key performance
concern. For example, if a serial program spends a lot of time in function
foo(), you should optimize foo(). In contrast, if an MPI process spends a lot of time
in MPI_Recv(), not only is the optimization target probably not MPI_Recv(),
but you should in fact probably be looking at some other process altogether.
You should ask, "What is happening on other processes when this process has
the long wait?"
Another issue is that a parallel program (in the case of MPI, a multi-process program)
can generate much more performance data than a serial program due to the greater
number of execution threads. Managing that data volume can be a challenge.
| 242. What are "profiling" and "tracing"? |
These terms are sometimes used to refer to two different kinds of
performance analysis.
In profiling, one aggregates statistics at run time -- e.g., total
amount of time spent in MPI, total number of messages or bytes sent, etc.
Data volumes are small.
In tracing, an event history is collected. It is common to display such event
history on a timeline display. Tracing data can provide much interesting detail,
but data volumes are large.
| 243. How do I sort out busy wait time from idle wait, user time from system
time, and so on? |
Don't.
MPI synchronization delays, which are key performance inhibitors you
will probably want to study, can show up as user or system time, all
depending on the MPI implementation, the type of wait, what run-time
settings you have chosen, etc. In many cases, it makes most sense for
you just to distinguish between time spent inside MPI from time spent
outside MPI. Elapsed wallclock time will probably be your key metric.
Exactly how the MPI implementation spends time waiting is less important.
PMPI refers to the MPI standard profiling interface.
Each standard MPI function can be called with an MPI_ or PMPI_ prefix.
For example, you
can call either MPI_Send() or PMPI_Send(). This feature of the MPI standard
allows one to write functions with the MPI_ prefix that call the equivalent
PMPI_ function. Specifically, a function so written has the behavior of
the standard function plus any other behavior one would like to add.
This is important for MPI performance analysis in at least two ways.
First, many performance analysis tools take advantage of PMPI. They
capture the MPI calls made by your program. They perform the associated
message-passing calls by calling PMPI functions, but also capture important
performance data.
Second, you can use such wrapper functions to customize MPI behavior.
E.g., you can add barrier operations to collective calls, write out
diagnostic information for certain MPI calls, etc.
OMPI generally layers the various function interfaces as follows:
- Fortran
MPI_ interfaces are weak symbols for ...
- Fortran
PMPI_ interfaces, which call ...
- C
MPI_ interfaces, which are weak symbols for ...
- C
PMPI_ interfaces, which provide the specified functionality.
Since OMPI generally implements MPI functionality for all languages in C,
you only need to provide profiling wrappers in C, even if your program
is in another programming language. Alternatively, you may write the wrappers in
your program's language, but if you provide wrappers in both languages
then both sets will be invoked.
There are a handful of exceptions. For example, MPI_ERRHANDLER_CREATE()
in Fortran does not call MPI_Errhandler_create(). Instead, it calls some
other low-level function. Thus, to intercept this particular Fortran call, you need a Fortran wrapper.
Be sure you make the library dynamic.
A static library can experience the linker problems described in
the Complications section of the Profiling Interface chapter of the MPI standard.
See the section on Profiling Interface in the MPI standard for more details.
| 245. Should I use those switches --enable-mpi-profile and --enable-trace when
I configure OMPI? |
Probably not.
The --enable-mpi-profile switch enables building of the PMPI interfaces.
While this is important for performance analysis, this setting is already
turned on by default.
The --enable-trace enables internal tracing of OMPI/ORTE/OPAL calls. It
is used only for developer debugging, not MPI application performance tracing.
| 246. What support does OMPI have for performance analysis? |
The OMPI source base has some instrumentation to capture performance data,
but that data must be analyzed by other non-OMPI tools.
PERUSE was a proposed MPI standard that gives information about
low-level behavior of MPI internals. Check the PERUSE web site for
any information about analysis tools. When you configure OMPI, be
sure to use --enable-peruse. Information is available describing
its integration with OMPI.
Unfortunately, PERUSE didn't win standardization, so it didn't really
go anywhere. Open MPI may drop PERUSE support at some point in the
future.
MPI-3 standardised the MPIT tools interface API (see Chapter 14 in the
MPI-3.0 specification). As of v1.6.3, Open MPI does not yet support
this interface, but it is actively being developed. It is expected
that Open MPI will include a full implementation of MPIT in a future
release.
VampirTrace traces the entry to and exit from the MPI layer, along with important
performance data, writing data using the open OTF format. VT is available freely
and can be used with any MPI. Information is available
describing its integration with OMPI.
| 247. How do I view VampirTrace output? |
While OMPI includes VampirTrace instrumentation, it does not provide a
tool for viewing OTF trace data. There is simply a primitive otfdump utility
in the same directory where other OMPI commands (mpicc, mpirun, etc.) are
located.
Another simple utility, otfprofile, comes with OTF software and allows you to produce a short profile in LaTeX format from an OTF trace.
The main way to view OTF data is with the Vampir tool. Evaluation licenses are available.
| 248. Are there MPI performance analysis tools for OMPI that I can download for free? |
The OMPI distribution includes no such tools, but some general MPI tools can
be used with OMPI. You can search the Internet for such tools, and we
list a few candidates here.
For tracing, there are:
- MPE
(or here)
is a software package for MPI programmers. It is associated with MPICH, but works
with other MPIs. The Jumpshot trace viewer has similar functionality to Vampir.
- Sun Studio Performance Analyzer has MPI tracing support (like Vampir
and Jumpshot), but it also has whole-program performance analysis support as well.
- TAU
can be used to trace data, but its viewer aggregates data. For viewing
TAU traces, one must convert data and use another viewer.
- Paraver
can be used to view trace data, and has been used to view PERUSE events.
If you don't need traces and are concerned about large trace files,
profiling tools include:
There are also more sophisticated tools that attempt to incorporate analysis
heuristics, adapt data gathering based on performance characteristics, or
otherwise automate analysis based on expert knowledge. Examples include:
| 249. Any other kinds of tools I should know about? |
Well, there are other tools you should consider. Part of performance
analysis is not just analyzing performance per se, but generally understanding
the behavior of your program.
As such, debugging tools can help you step through or pry into the execution
of your MPI program. Popular tools include TotalView,
which can be downloaded for free trial use, and Allinea DDT
which also provides evaluation copies.
The command-line job inspection tool padb
has been ported to orte and OMPI
| 250. How does Open MPI handle HFS+ / UFS filesystems? |
Generally, Open MPI does not care whether it is running from
an HFS+ or UFS filesystem. However, the C++ wrapper compiler historically
has been called mpiCC, which of course is the same file
as mpicc when running on HFS+. During the configure
process, Open MPI will attempt to determine if the build filesystem is
case sensitive or not, and assume the install file system is the same
way. Generally, this is all that is needed to deal with HFS+.
However, if you are building on UFS and installing to HFS+, you should
specify --without-cs-fs to configure to make sure Open
MPI does not build the mpiCC wrapper. Likewise, if you
build on HFS+ and install to UFS, you may want to specify
--with-cs-fs to ensure that mpiCC is
installed.
| 251. How do I use the Open MPI wrapper compilers in XCode? |
XCode has a non-public interface for adding compilers to XCode. A
friendly Open MPI user sent in configuration file for XCode 2.3,
MPICC.pbcompspec, which will add
support for the Open MPI wrapper compilers. The file should be
placed in /Library/Application Support/Apple/Developer Tools/Specifications/. Upon starting XCode, this file is loaded and added to the list of
known compilers.
To use the mpicc compiler, open the project, get info on the
target, click the rules tab, and add a new entry. Change the process rule
for "C source files" and select using MPICC.
Before moving the file, the ExecPath parameter should be set
to the location of the Open MPI install. The BasedOn parameter
should be updated to refer to the compiler version that mpicc
will invoke -- generally gcc-4.0 on OS X 10.4 machines.
Thanks to Karl Dockendorf for this information.
| 252. How do I run jobs under XGrid? |
XGrid support is included in Open MPI and will be build if the
XGrid tools are installed.
We unfortunately have little documentation on how to run with XGrid at
this point other than a fairly lengthy e-mail that Brian Barrett wrote
on the Open MPI user's mailing list:
Since Open MPI 1.1.2, we also support authentication using Kerberos.
The process is essentially the same, but there is no need to specify
the XGRID_PASSWORD field. Open MPI applications will then run as
the authenticated user, rather than nobody.
| 253. Where do I get more information about running under XGrid? |
Please write to us on the user's mailing list. Hopefully any
replies that we send will contain enough information to create proper
FAQ's about how to use Open MPI with XGrid.
| 254. Is Open MPI included in OS X? |
Open MPI v1.2.3 was included in OS X starting with version 10.5
(Leopard). Note that the Leopard does not include a Fortran compiler,
so the OS X-shipped version of Open MPI does not include Fortran
support.
If you need/want Fortran support, you will need to build your own copy
of Open MPI (assumedly when you have a Fortran compiler installed).
The Open MPI team strongly recomends not overwriting the OS
X-installed version of Open MPI, but rather installing it somewhere
else (e.g., /opt/openmpi).
| 255. How do I not use the OS X-bundled Open MPI? |
There are a few reasons you might not want to use the OS
X-bundled Open MPI, such as wanting Fortran support, upgrading to a
new version, etc.
If you wish to use a community version of Open MPI, You can download
and build Open MPI on OS X just like any other supported platform. We
strongly recomend not replacing the OS X-installed Open MPI, but
rather installing to an alternate location (such as /opt/openmpi).
Once you successfully install Open MPI, ensure to prefix your PATH
with the bindir of Open MPI. This will ensure that you are using
your newly-installed Open MPI, not the OS X-installed Open MPI. For
example:
# Not showing the complete URL/tarball name because it changes over time :-)
shell$ wget http://www.open-mpi.org/.../open-mpi....
shell$ tar zxf openmpi-...gz
shell$ cd openmpi-...
shell$ ./configure --prefix=/opt/openmpi 2>&1 | tee config.out
[...lots of output...]
shell$ make -j 4 2>&1 | tee make.out
[...lots of output...]
shell$ sudo make install 2>&1 | tee install.out
[...lots of output...]
shell$ export PATH=/opt/openmpi/bin:$PATH
shell$ ompi_info
[...see output from newly-installed Open MPI...]
|
Of course, you'll want to make your PREFIX changes permanent. One
way to do this is to edit your shell startup
files.
Note that there is no need to add Open MPI's libdir to
LD_LIBRARY_PATH; Open MPI's shared library build process
automatically uses the "rpath" mechanism to automatically find the
correct shared libraries (i.e., the ones associated with this build,
vs., for example, the OS X-shipped OMPI shared libraries). Also note
that we specifically do not recommend adding Open MPI's libdir to
DYLD_LIBRARY_PATH.
If you build static libraries for Open MPI, there is an ordering
problem such that /usr/lib/libmpi.dylib will be found before
$libdir/libmpi.a, and therefore user-linked MPI applications that
use mpicc (and friends) will use the "wrong" libmpi. This can be
fixed by editing
OMPI's wrapper compilers to force the use of the Right libraries,
such as with the following flag when configuring Open MPI:
shell$ ./configure --with-wrapper-ldflags="-Wl,-search_paths_first" ...
|
| 256. Is AIX a supported operating system for Open MPI? |
No. AIX used to be supported, but none of the current Open
MPI developers has any platforms that require AIX support for Open
MPI.
Since Open MPI is an open source project, its features and
requirements are driven by the union of its developers. Hence, AIX
support has fallen away because none of us currently use AIX. All
this means that is we do not develop or test on AIX; there is no
fundamental technology reason why Open MPI couldn't be supported on
AIX.
AIX support could certainly be re-instated if someone who wanted AIX
support joins the core group of developers and contributes the
development and testing to support AIX.
| 257. Does Open MPI work on AIX? |
There have been reports from random users that a small number
of changes are required to the Open MPI code base to make it work
under AIX. For example, see the following post on the Open MPI user's
list, reported by Ricardo Fonseca:
| 258. What is VampirTrace? |
VampirTrace is a program tracing package that can collect a
very fine grained event trace of your sequential or parallel
program. The traces can be visualized by the Vampir tool and a number
of other tools that read the Open Trace Format (OTF).
Tracing is interesting for performance analysis and optimization of
parallel and HPC (High Performance Computing) applications in general
and MPI programs in particular. In fact, that's where the letters
'mpi' in Vampir come from. Therefore, it is integrated into Open MPI
for convenience.
VampirTrace is included in Open MPI v1.3 and later.
VampirTrace consists of two main components: Firstly, the
instrumentation part which slightly modifies the target program in
order to be notified about run-time events of interest. Simply replace
the compiler wrappers to activate it: mpicc to mpicc-vt, mpicxx
to mpicxx-vt and so on (note that the *-vt variants of the wrapper
compilers are unavailable before Open MPI v1.3). Secondly, the
run-time measurement part is responsible for data collection. This
can only be effective when the first part was performed -- otherwise
there will be no effect on your program at all.
VampirTrace has been developed at ZIH, TU Dresden in collaboration with
the KOJAK project from JSC/FZ Juelich and is available as open source
software under BSD license, see ompi/contrib/vt/vt/COPYING.
The software is also available as a stand-alone source code
package. The latest version can always be found at http://www.tu-dresden.de/zih/vampirtrace/.
| 259. Where can I find the complete documentation of VampirTrace? |
A complete documentation of VampirTrace comes with the Open
MPI software package as PDF and HTML (in Open MPI v1.3 and later). You
can find it in the Open MPI source tree ompi/contrib/vt/vt/doc/ or
after installing Open MPI in
$(install-prefix)/share/vampirtrace/doc/.
| 260. How to instrument my MPI application with VampirTrace? |
All the necessary instrumentation of user functions as well as
MPI and OpenMP events is handled by special compiler wrappers (
mpicc-vt, mpicxx-vt, mpif77-vt, mpif90-vt ). Unlike the normal
wrappers ( mpicc and friends) these wrappers call VampirTrace's
compiler wrappers ( vtcc, vtcxx, vtf77, vtf90 ) instead of the
native compilers. The vt* wrappers use underlying platform compilers
to perform the necessary instrumentation of the program and link the
suitable VampirTrace library.
Original:
shell$ mpicc -c hello.c -o hello
|
With instrumentation:
shell$ mpicc-vt -c hello.c -o hello
|
For your application, simply change the compiler definitions in your
Makefile(s):
# original definitions in Makefile
## CC=mpicc
## CXX=mpicxx
## F90=mpif90
# replace with
CC=mpicc-vt
CXX=mpicxx-vt
F90=mpif90-vt
|
| 261. Does VampirTrace cause overhead to my application? |
By using the default MPI compiler wrappers ( mpicc etc.) your
application will be run without any changes at all. The VampirTrace
compiler wrappers ( mpicc-vt etc.) link the VampirTrace library which intercepts
MPI calls and some user level function/subroutine calls. This causes a certain
amount of runtime overhead to applications. Usually, the overhead is reasonably
small (0.x% - 5%) and VampirTrace by default enables precautions to avoid
excessive overhead. However, it can be configured to produce very substantial
overhead using non-default settings.
| 262. How can I change the underlying compiler of the mpi*-vt wrappers? |
Unlike the standard MPI compiler wrappers ( mpicc etc.) the
environment variables OMPI_CC, OMPI_CXX, OMPI_F77, OMPI_F90 do not
affect the VampirTrace compiler wrappers. Please, use the environment
variables VT_CC, VT_CXX, VT_F77, VT_F90 instead. In addition, you
can set the compiler with the wrapper's option -vt:[cc|cxx|f77|f90]
The following two are equivalent, setting the underlying compiler to
gcc:
shell$ VT_CC=gcc mpicc-vt -c hello.c -o hello
shell$ mpicc-vt -vt:cc gcc -c hello.c -o hello
|
Futhermore, you can modify the default settings in
/share/openmpi/mpi*-wrapper-data.txt.
| 263. How can I pass VampirTrace related configure options through the
Open MPI configure? |
To give options to the VampirTrace configure script you can add these
to the configure option --with-contrib-vt-flags.
The following example passes the options --with-papi-lib-dir and --with-papi-lib
to the VampirTrace configure script to specify the location and the name of the PAPI
library:
shell$ ./configure --with-contrib-vt-flags='--with-papi-lib-dir=/usr/lib64 --with-papi-lib=-lpapi64' ...
|
| 264. How to disable the integrated VampirTrace, completely? |
By default, the VampirTrace part of Open MPI will be built and
installed. If you would like to disable building and installing of
VampirTrace add the value vt to the configure option
--enable-contrib-no-build.
shell$ ./configure --enable-contrib-no-build=vt ...
|
|