Open MPI logo

FAQ:
Troubleshooting building and running MPI jobs

  |   Home   |   Support   |   FAQ   |   all just the FAQ

Table of contents:

  1. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?
  2. Can I build shared libraries on AIX with the IBM XL compilers?
  3. Why am I getting a seg fault in libopal?
  4. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?
  5. Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?
  6. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying
  7. How do I find out what MCA parameters are being seen/used by my job?


1. What is mca_pml_teg.so? Why am I getting warnings about not finding the mca_ptl_base_modules_initialized symbol from it?

You may wonder why you see this warning message (put here verbatim so that it becomes web-searchable):

mca_pml_teg.so:undefined symbol:mca_ptl_base_modules_initialized

This happens when you upgrade to Open MPI v1.1 (or later) over an old installation of Open MPI v1.0.x without previously uninstalling v1.0.x. There are fairly uninteresting reasons why this problem occurs; the simplest, safest solution is to uninstall version 1.0.x and then re-install your newer version. For example:

shell# cd /path/to/openmpi-1.0
shell# make uninstall
[... lots of output ...]
shell# cd /path/to/openmpi-1.1
shell# make install

The above example shows changing into the Open MPI 1.1 directory to re-install, but the same concept applies to any version after Open MPI version 1.0.x.

Note that this problem is fairly specific to installing / upgrading Open MPI from the source tarball. Pre-packaged installers (e.g., RPM) typically do not incur this problem.


2. Can I build shared libraries on AIX with the IBM XL compilers?

Short answer: in older versions of Open MPI, maybe.

Add "LDFLAGS=-Wl,-brtl" to your configure command line:

shell$ ./configure LDFLAGS=-Wl,-brtl ...

This enables "runtimelinking", which will make GNU Libtool name the libraries properly (i.e., *.so). More importantly, runtimelinking will cause the runtime linker to behave more or less like an ELF linker would (with respect to symbol resolution).

Future versions of OMPI may not require this flag (and "runtimelinking" on AIX).

NOTE: As of OMPI v1.2, AIX is no longer supported.


3. Why am I getting a seg fault in libopal?

It is likely that you did not get a segv in libopal; it is likely that you are seeing a message like this (with OMPI v1.0 and v1.1):

[0] func:/opt/ompi/lib/libopal.so.0 [0x2a958de8a7]

or something like this (OMPI v1.2 and beyond; Linux output shown below -- looks slightly different on other OS's):

[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]

This is actually the function that is printing out the stack trace message; it is not the function that caused the segv itself. The function that caused the problem will be a few below this. Future versions of OMPI will simply not display this libopal function in the segv reporting to avoid confusion.

Let's provide a concrete example. Take the following trivial MPI program that is guaranteed to cause a seg fault in MPI_COMM_WORLD rank 1:

shell$ cat segv.c
#include 

int main(int argc, char **argv)
{
    int rank;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank == 1) {
        char *d = 0;
        /* This will cause a seg fault */
        *d = 3;
    }

    MPI_Finalize();
    return 0;
}

Running this code, you'll see something similar to the following:

shell$ mpicc segv.c -o segv -g
shell$ mpirun -np 2 --mca btl tcp,self segv
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0] func:/opt/ompi/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a958de8a7]
[1] func:/opt/ompi/lib/libopal.so.0 [0x2a958dd2b7]
[2] func:/lib64/tls/libpthread.so.0 [0x3be410c320]
[3] func:segv(main+0x3c) [0x400894]
[4] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3be361c4bb]
[5] func:segv [0x4007ca]
*** End of error message ***

The real error was back up in main, which is #3 on the stack trace. But Open MPI's stack-tracing function (opal_backtrace_print, in this case) is what is displayed as #0, so it's an easy mistake to assume that libopal is the culprit.


4. Why am I getting seg faults / MPI parameter errors when compiling C++ applications with the Intel 9.1 C++ compiler?

Early versions of the Intel 9.1 C++ compiler series had problems with the Open MPI C++ bindings. Even trivial MPI applications that used the C++ MPI bindings could incur process failures (such as segmentation violations) or generate MPI-level errors complaining about invalid parameters.

Intel released a new version of their 9.1 series C++ compiler on October 5, 2006 (build 44) that seems to solve all of these issues. The Open MPI team recommends that all users needing the C++ MPI API upgrade to this version (or later) if possible. Since the problems are with the compiler, there is little that Open MPI can do to work around the issue; upgrading the compiler seems to be the only solution.


5. Why can't I attach my parallel debugger (TotalView, DDT, fx2, etc.) to parallel jobs?

As noted in this FAQ entry, Open MPI supports parallel debuggers that utilize the TotalView API for parallel process attaching. However, it can sometimes fail of Open MPI is not installed correctly. Symptoms of this failure typically involve having the debugger hang (or crash) when attempting to attach to a parallel MPI application.

Parallel debuggers may rely on having Open MPI's mpirun program being compiled without optimization. Open MPI's configure and build process therefore attempts to identify optimization flags and remove them when compiling mpirun, but it does not have knowledge of all optimization flags for all compilers. Hence, if you specify some esoteric optimization flags to Open MPI's configure script, some optimization flags may slip through the process and create an mpirun that cannot be read by TotalView and other parallel debuggers.

If you run into this problem, you can manully build mpirun without optimization flags. Go into the tree where you built Open MPI:

shell$ cd /path/to/openmpi/build/tree
shell$ cd orte/tools/orterun
shell$ make clean
[...output not shown...]
shell$ make all CFLAGS=-g
[...output not shown...]
shell$

This will build mpirun (also known as orterun) with just the "-g" flag. Once this completes, run make install, also from within the orte/tools/orterun directory, and possibly as root depending on where you installed Open MPI. Using this new orterun ([mpirun]), your parallel debugger should be able to attach to MPI jobs.

Additionally, a user reported to us that setting some TotalView flags may be helpful with attaching. The user specifically cited the Open MPI v1.3 series compiled with the Intel 11 compilers and TotalView 8.6, but it may also be helpful for other versions, too:

shell$ export with_tv_debug_flags="-O0 -g -fno-inline-functions"


6. When launching large MPI jobs, I see messages like: mca_oob_tcp_peer_complete_connect: connection failed: Connection timed out (110) - retrying

This is a known issue in the Open MPI v1.2. series. Try the following:

  1. If you are using Linux-based systems, increase some of the limits on the node where mpirun is invoked (you must have administrator/root privlidges to increase these limits):

    # The default is 128; increase it to 10,000
    shell# echo 10000 > /proc/sys/net/core/somaxconn
    
    # The default is 1,000; increase it to 100,000
    shell# echo 100000 > /proc/sys/net/core/netdev_max_backlog
    

  2. Set the oob_tcp_listen_mode MCA parameter to the string value listen_thread. This enables Open MPI's mpirun to respond much more quickly to incoming TCP connections during job launch, for example:

    shell$ mpirun --mca oob_tcp_listen_mode listen_thread -np 1024 my_mpi_program
    

    See
    this FAQ entry for more details on how to set MCA parameters.


7. How do I find out what MCA parameters are being seen/used by my job?

As described elsewhere, MCA parameters are the "life's blood" of Open MPI. MCA parameters are used to control both detailed and large-scale behavior of Open MPI and are present throughout the code base.

This raises an important question: since MCA parameters can be set from a file, the environment, the command line, and even internally within Open MPI, how do I actually know what MCA params my job is seeing, and their value?

One way, of course, is to use the ompi_info command, which is documented elsewhere (you can use "man ompi_info", or "ompi_info --help" to get more info on this command). However, this still doesn't fully answer the question since ompi_info isn't an MPI process.

To help relieve this problem, Open MPI (starting with the 1.3 release) provides the MCA parameter mpi_show_mca_params that directs the rank=0 MPI process to report the name of MCA parameters, their current value as seen by that process, and the source that set that value. The parameter can take several values that define which MCA parameters to report:

  • all: report all MCA params. Note that this typically generates a rather long list of parameters since it includes all of the default parameters defined inside Open MPI
  • default: MCA params that are at their default settings - i.e., all MCA params that are at the values set as default within Open MPI
  • file: MCA params that had their value set by a file
  • api: MCA params set using Open MPI's internal APIs, perhaps to override an incompatible set of conditions specified by the user
  • enviro: MCA params that obtained their value either from the local environment or the command line. Open MPI treats environmental and command line parameters as equivalent, so there currently is no way to separate these two sources

These options can be combined in any order by separating them with commas.

Here is an example of the output generated by this parameter:

$ mpirun -mca grpcomm basic -mca mpi_show_mca_params enviro ./hello
ess=env (environment or cmdline)
orte_ess_jobid=1016725505 (environment or cmdline)
orte_ess_vpid=0 (environment or cmdline)
grpcomm=basic (environment or cmdline)
mpi_yield_when_idle=0 (environment or cmdline)
mpi_show_mca_params=enviro (environment or cmdline)
Hello, World, I am 0 of 1

Note that several MCA parameters set by Open MPI itself for internal uses are displayed in addition to the ones actually set by the user.

Since the output from this option can be long, and since it can be helpful to have a more permanent record of the MCA parameters used for a job, a companion MCA parameter mpi_show_mca_params_file is provided. If mpi_show_mca_params is also set, the output listing of MCA parameters will be directed into the specified file instead of being printed to stdout.