id	summary	reporter	owner	description	type	status	priority	milestone	component	resolution	keywords	cc
694	Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by  signal 4	Tim Sliwinski <tss06@…>	buntinas	"

{{{

mpich2-maint,

I am working with a colleague to update a cluster based on mpich1 and a
much older version of Fedora. We were unable to get the Fedora 11
package to work and decided to build from the source. Configure (with the
--with-pm=mpd and --prefix options), Make, and Make Install all worked
without any truly noticeable errors (by my untrained eyes through the log
files),
but we continuously get this problem. Currently we have a cluster of two nodes
we are attempting to implement this on. Both are running Fedora 11 and are
named hydra1 and hydra3. Hydra1 is the main server where mpdboot is being
run.

First, when we start the mpd ring, we get a Depreciation Warning:

[administrator@hydra1 ~]$ mpdboot -n 2 -f ~/mpd.hosts
/home/administrator/mpich2-install/bin/mpdboot:56: DeprecationWarning:
The popen2 module is deprecated.  Use the subprocess module.
  from popen2   import Popen4, Popen3, popen2

mpd.hosts has the lines:

[administrator@hydra1 ~]$ cat mpd.hosts
hydra1
hydra3

But everything still seems to work from that point on:

[administrator@hydra1 ~]$ mpdtrace
hydra1
hydra3
[administrator@hydra1 ~]$ mpiexec -n 2 hostname
hydra1.<domain removed>
hydra3.<domain removed>

However, when we attempt to run any parallel code (either on a single
machine or the mpd ring using mpiexec) we get the following or something
similar:

[administrator@hydra1 ~]$ mpiexec -n 4 ~/cpi
rank 2 in job 1  hydra1.<domain removed>_54122   caused collective abort
of all ranks
  exit status of rank 2: killed by signal 9
rank 0 in job 1  hydra1.<domain removed>_54122   caused collective abort
of all ranks
  exit status of rank 0: killed by signal 4

We are unsure what to do with this. We have tried other programs built
for parallel execution, but the same error occurs. Even if we attempt to
mpdboot from hydra3, it will give the exact same error with hydra3
replacing hydra1 in the above error.

The executables are being run from the mpich2-install/bin in a shared home
directory via nfs and path variables are defined on both machines.

Attached are the c.txt, m.txt, and mi.txt for your review.

Thanks for your help in advance!!!

-Timothy Sliwinski
Florida State University Student


}}}"	bug	closed	blocker	mpich2-1.1.1	mpich2	fixed		
