Ticket #905 (closed bug: fixed)

Opened 5 weeks ago

Last modified 3 weeks ago

mpiexec -machinefile :broken

Reported by: https://me.yahoo.com/a/R_cfb2Qo1sVFwXJFDN6j0k8vRQ--#638d0 Owned by: goodell
Priority: major Milestone: mpich2-1.2.1
Component: mpich2 Keywords:
Cc:

Description (last modified by balaji) (diff)

Hi all

I have made an mich2 cluster, I have one quadcore and three dualcore machines, I have tested it with a simple program. It runs well but the only way to make quadcore accept 4 processes and dualcores 2 processes per node is via machinefile, but it turned out to be broken. The sourcecode of the test program is attached. Here is the output with 2 of the nodes removed from mpd pool for simpler error output:

mpi@ubuntu:~$ mpich2version 
MPICH2 Version:         1.2
MPICH2 Release date:    Unknown, built on Thu Oct 22 21:45:41 EEST 2009
MPICH2 Device:          ch3:nemesis
MPICH2 configure:       --prefix=/home/mpi/mpich2
MPICH2 CC:      gcc  -O2
MPICH2 CXX:     c++  -O2
MPICH2 F77:      
MPICH2 F90:   

mpi@ubuntu:~$ mpdtrace
ubuntu
desky

mpi@ubuntu:~$ cat mmap 
desky:2
ubuntu:4

mpi@ubuntu:~$ mpiexec -n 6 ./pi_mpi
MPI procs to spawn: 6
Estimate accuracy: 0.00000000100000000000 (1.000e-09) with 64 bit arythmetic
Calculated: Pi = 3.14159265309880275652
Elapsed time: 18.924 s

mpi@ubuntu:~$ mpiexec -machinefile mmap -n 6 ./pi_mpi
rank 3 in job 5  ubuntu_52199   caused collective abort of all ranks
  exit status of rank 3: killed by signal 9 
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(394).................: Initialization failed
MPID_Init(118)........................: channel initialization failed
MPIDI_CH3_Init(43)....................: 
MPID_nem_init(202)....................: 
MPIDI_CH3I_Seg_commit(366)............: 
MPIU_SHMW_Hnd_deserialize(358)........: 
MPIU_SHMW_Seg_open(901)...............: 
MPIU_SHMW_Seg_create_attach_templ(673): open failed - No such file or directory
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(394).................: Initialization failed
MPID_Init(118)........................: channel initialization failed
MPIDI_CH3_Init(43)....................: 
MPID_nem_init(202)....................: 
MPIDI_CH3I_Seg_commit(366)............: 
MPIU_SHMW_Hnd_deserialize(358)........: 
MPIU_SHMW_Seg_open(901)...............: 
MPIU_SHMW_Seg_create_attach_templ(673): open failed - No such file or directory
rank 2 in job 5  ubuntu_52199   caused collective abort of all ranks
  exit status of rank 2: return code 1 
mpi@ubuntu:~$ mpiexec -machinefile mmap -n 4 ./pi_mpi
MPI procs to spawn: 4
Estimate accuracy: 0.00000000100000000000 (1.000e-09) with 64 bit arythmetic
Calculated: Pi = 3.14159265309204016603
Elapsed time: 19.383 s

mpi@ubuntu:~$

Attachments

pi_mpi.c (1.6 KB) - added by https://me.yahoo.com/a/R_cfb2Qo1sVFwXJFDN6j0k8vRQ--#638d0 5 weeks ago.
failing testcode

Change History

Changed 5 weeks ago by https://me.yahoo.com/a/R_cfb2Qo1sVFwXJFDN6j0k8vRQ--#638d0

failing testcode

Changed 4 weeks ago by https://me.yahoo.com/a/R_cfb2Qo1sVFwXJFDN6j0k8vRQ--#638d0

by the way the same error appears with examples/cpi

Changed 4 weeks ago by buntinas

Thanks for the bug report. I have been able to reproduce this here. We'll take a look at it.

Changed 4 weeks ago by buntinas

  • owner set to goodell
  • status changed from new to assigned

It looks like mpd is passing an incorrect mapping string. With the machine file:

octopus:2
octagon:4

mpd the mapping string "(vector,(0,2,3))", while the correct mapping is "(vector,(0,1,2),(1,1,4))", which is what hyrda (on trunk) gives.

Giving this to Dave.

Changed 4 weeks ago by goodell

  • status changed from assigned to accepted
  • milestone set to mpich2-1.2.1

Changed 4 weeks ago by balaji

  • description modified (diff)

Cleaned up the description formatting a bit.

Changed 3 weeks ago by goodell

This should be fixed in r5639. This functionality in mpd is getting to be a bit gross :( Pavan is looking for a simpler solution within mpd, but really hydra is the right long term fix.

Changed 3 weeks ago by balaji

  • status changed from accepted to closed
  • resolution set to fixed

There doesn't seem to be a cleaner way to do this because of how mpd is structured (unless we restructure it entirely, which is wasteful given that it'll no longer be the default in the 1.3 series). Resolving.

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #905 (mpiexec -machinefile :broken)

Author



Change Properties
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.