MPICH mpiexec (MPI) process terminating upon error, unable to debug in lldb

0

EDIT I had a typo in my command to launch lldb (see comment below) and I'm updating the post to get to a different larger issue

I'm trying to debug my MPI application in lldb and upon an error (e.g., segv or abort). Here's how I'm invoking my mpi run:

/usr/local/bin/mpiexec -np 3 -disable-auto-cleanup xterm -e "lldb -s lldb.commands --  app_binary <args> ; sleep 100

Immediately when I start running, I get this error trace. I think the most relevant line is PMI_Get_appnum returned -1

[cli_0]: write_line error; fd=8 buf=:cmd=init pmi_version=1 pmi_subversion=1
:
system msg for write_line failure : Bad file descriptor
[cli_0]: Unable to write to PMI_fd
[cli_0]: write_line error; fd=8 buf=:cmd=get_appnum
:
system msg for write_line failure : Bad file descriptor
Fatal error in MPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(565): 
MPID_Init(175).......: channel initialization failed
MPID_Init(463).......: PMI_Get_appnum returned -1
[cli_0]: write_line error; fd=8 buf=:cmd=abort exitcode=1094415
:
system msg for write_line failure : Bad file descriptor
Process 19063 exited with status = 15 (0x0000000f) 

Unfortunately, some mailing lists show that this is a general bug with MPICH on OSX (see https://github.com/pmodels/mpich/issues/2063 -- currently still unresolved). Does anyone have a workaround?

mpi
lldb
mpich
asked on Stack Overflow Jul 17, 2019 by Kulluk007 • edited Jul 22, 2019 by Kulluk007

1 Answer

0

Since you're using lldb and you're probably also using clang, you could use something called the address sanitizer to compile your code with runtime checks for memory errors.

Just add the following to your compile command: -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address. It would look like

mpicc object.o -o exec -g -fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=address

When using the address sanitizer your code will print a small stack trace to when you made a move to index out of bounds or address memory you don't own.

If you combine the address sanitizer with lldb then it should stop the execution at the line where a memory problem occurred. Although, I haven't had much success with running lldb and MPI at the same time. Either way the address sanitizer should help you.

answered on Stack Overflow Jul 19, 2019 by dpuleri

User contributions licensed under CC BY-SA 3.0