Ticket #623 (new bug)
Nemesis on windows fails in MPI_Allreduce() for (32+ cores and 128+ procs)
| Reported by: | jayesh | Owned by: | jayesh |
|---|---|---|---|
| Priority: | major | Milestone: | mpich2-1.3 |
| Component: | mpich2 | Keywords: | |
| Cc: | jeffb@… |
Description
This bug was reported by Jeff Baxter@MS.
================================================= Thanks Jayesh,
The nemesis stuff seems cool, and i am seeing significant improvements on small message all reduces for example at 128 core ( 16 node ) scale. I don’t seem to be seeing much improvements on bcast for either small or large messages, and i was wondering whether there were particular areas you had concentrated on, and which i should look at first? One thing i do seem to get consistently is a crash at high message sizes for allreduce - this is the output from a 4MB allreduce across 128 cores, not sure if it is a known issue?
C:\mpich2drop>.\mpiexec -channel nemesis -machinefile \\marlinhn01\c$\mpich2drop\nodes.txt -n 128 c:\mpich2drop\colltestmpich2.exe allreduce 4000000 10
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(773)....................: MPI_Allreduce(sbuf=00000000065B0040, rbuf=0000000024E00040, count=4000000, MPI_CHAR, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Reduce(759)......................:
MPIR_Reduce_redscat_gather(485).......:
MPIC_Sendrecv(161)....................:
MPIC_Wait(405)........................:
MPIDI_CH3I_Progress(207)..............:
MPID_nem_handle_pkt(489)..............:
pkt_RTS_handler(238)..................:
do_cts(498)...........................:
MPID_nem_lmt_shm_start_recv(173)......:
MPID_nem_allocate_shm_region(824).....:
MPIU_SHMW_Seg_create_and_attach(933)..:
MPIU_SHMW_Seg_create_attach_templ(786): unable to allocate shared memory - CreateFileMapping? Cannot create a file when that file already exists.
Cheers Jeff =================================================
