Ticket #623 (new bug)

Opened 6 months ago

Last modified 3 weeks ago

Nemesis on windows fails in MPI_Allreduce() for (32+ cores and 128+ procs)

Reported by: jayesh Owned by: jayesh
Priority: major Milestone: mpich2-1.3
Component: mpich2 Keywords:
Cc: jeffb@…

Description

This bug was reported by Jeff Baxter@MS.

================================================= Thanks Jayesh,

The nemesis stuff seems cool, and i am seeing significant improvements on small message all reduces for example at 128 core ( 16 node ) scale. I don’t seem to be seeing much improvements on bcast for either small or large messages, and i was wondering whether there were particular areas you had concentrated on, and which i should look at first? One thing i do seem to get consistently is a crash at high message sizes for allreduce - this is the output from a 4MB allreduce across 128 cores, not sure if it is a known issue?

C:\mpich2drop>.\mpiexec -channel nemesis -machinefile \\marlinhn01\c$\mpich2drop\nodes.txt -n 128 c:\mpich2drop\colltestmpich2.exe allreduce 4000000 10

Fatal error in MPI_Allreduce: Other MPI error, error stack:

MPI_Allreduce(773)....................: MPI_Allreduce(sbuf=00000000065B0040, rbuf=0000000024E00040, count=4000000, MPI_CHAR, MPI_SUM, MPI_COMM_WORLD) failed

MPIR_Reduce(759)......................:

MPIR_Reduce_redscat_gather(485).......:

MPIC_Sendrecv(161)....................:

MPIC_Wait(405)........................:

MPIDI_CH3I_Progress(207)..............:

MPID_nem_handle_pkt(489)..............:

pkt_RTS_handler(238)..................:

do_cts(498)...........................:

MPID_nem_lmt_shm_start_recv(173)......:

MPID_nem_allocate_shm_region(824).....:

MPIU_SHMW_Seg_create_and_attach(933)..:

MPIU_SHMW_Seg_create_attach_templ(786): unable to allocate shared memory - CreateFileMapping? Cannot create a file when that file already exists.

Cheers Jeff =================================================

Attachments

part0001.html (0.6 KB) - added by Jayesh Krishna 5 months ago.
Added by email2trac

Change History

Changed 6 months ago by jayesh

Jeff,

I am not able to recreate the problem here (I am running an MPI_Allreduce() with 2bytes-6MB range with 120 procs on 8 cores... As you mentioned in your email I probably need ~32 cores to reproduce the problem) at our lab. However, I think I know where the problem lies. We name the shared mem segments on windows using the lower part of the query perf cnt values, this might be causing conflicts in the segment names for shm segments used for large message transfers in nemesis.

Thanks for reporting the bug. I have created a ticket for this bug (https://trac.mcs.anl.gov/projects/mpich2/ticket/623) & will provide you a custom build with a fix for the problem over the weekend and you can use that build for your testing.

Regards, Jayesh

Changed 6 months ago by jayesh

  • cc mpich-ms@… added

Changed 6 months ago by jayesh

  • milestone changed from mpich2-1.1 to mpich2-1.1.1

Changed 6 months ago by jayesh

  • cc jeffb@… added; mpich-ms@… removed

Changed 5 months ago by jayesh

  • milestone changed from mpich2-1.1.1 to mpich2-1.1.2

This might not get done by 1.1.1 . Moving this ticket to 1.1.2 . r4638 has a possible fix to the problem but Jeff still gets failures with 32+ cores (Could be the same or a different problem).

-Jayesh

Changed 5 months ago by Jayesh Krishna

Added by email2trac

Changed 5 months ago by Jayesh Krishna

Jeff,
 We can use this ticket (Keep replying to this email instead of mpich-ms)
to track the failures that you see with MPICH2+nemesis on 32+ cores.

Regards,
Jayesh

Changed 3 months ago by balaji

  • milestone changed from mpich2-1.1.2 to mpich2-1.2

Milestone mpich2-1.1.2 deleted

Changed 3 weeks ago by jayesh

  • milestone changed from mpich2-1.2.1 to mpich2-1.3

We will re-visit this bug after we integrate async progress engine for windows network module (which will be merged after 1.2.1).

-Jayesh

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #623 (Nemesis on windows fails in MPI_Allreduce() for (32+ cores and 128+ procs))

Author



Change Properties
Action
as new
 
Note: See TracTickets for help on using tickets.