Ticket #694 (closed bug: fixed)
Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by signal 4
| Reported by: | Tim Sliwinski <tss06@…> | Owned by: | buntinas |
|---|---|---|---|
| Priority: | blocker | Milestone: | mpich2-1.1.1 |
| Component: | mpich2 | Keywords: | |
| Cc: |
Description
mpich2-maint, I am working with a colleague to update a cluster based on mpich1 and a much older version of Fedora. We were unable to get the Fedora 11 package to work and decided to build from the source. Configure (with the --with-pm=mpd and --prefix options), Make, and Make Install all worked without any truly noticeable errors (by my untrained eyes through the log files), but we continuously get this problem. Currently we have a cluster of two nodes we are attempting to implement this on. Both are running Fedora 11 and are named hydra1 and hydra3. Hydra1 is the main server where mpdboot is being run. First, when we start the mpd ring, we get a Depreciation Warning: [administrator@hydra1 ~]$ mpdboot -n 2 -f ~/mpd.hosts /home/administrator/mpich2-install/bin/mpdboot:56: DeprecationWarning: The popen2 module is deprecated. Use the subprocess module. from popen2 import Popen4, Popen3, popen2 mpd.hosts has the lines: [administrator@hydra1 ~]$ cat mpd.hosts hydra1 hydra3 But everything still seems to work from that point on: [administrator@hydra1 ~]$ mpdtrace hydra1 hydra3 [administrator@hydra1 ~]$ mpiexec -n 2 hostname hydra1.<domain removed> hydra3.<domain removed> However, when we attempt to run any parallel code (either on a single machine or the mpd ring using mpiexec) we get the following or something similar: [administrator@hydra1 ~]$ mpiexec -n 4 ~/cpi rank 2 in job 1 hydra1.<domain removed>_54122 caused collective abort of all ranks exit status of rank 2: killed by signal 9 rank 0 in job 1 hydra1.<domain removed>_54122 caused collective abort of all ranks exit status of rank 0: killed by signal 4 We are unsure what to do with this. We have tried other programs built for parallel execution, but the same error occurs. Even if we attempt to mpdboot from hydra3, it will give the exact same error with hydra3 replacing hydra1 in the above error. The executables are being run from the mpich2-install/bin in a shared home directory via nfs and path variables are defined on both machines. Attached are the c.txt, m.txt, and mi.txt for your review. Thanks for your help in advance!!! -Timothy Sliwinski Florida State University Student
Attachments
Change History
Note: See
TracTickets for help on using
tickets.
