Ticket #690 (closed bug: fixed)

Opened 4 months ago

Last modified 4 months ago

Nemesis' MPIU_Memcpy broken

Reported by: "Rajeev Thakur" <thakur@…> Owned by: buntinas
Priority: minor Milestone: mpich2-1.1.1
Component: mpich2 Keywords:
Cc:

Description

I noticed that the tests with Hydra were hanging (and not timing out)
for the shared library case in the nightlies (on crunch and triumph). In
trying to reproduce the problem manually, I find the following:

* build with shared library: all tests hang

* build without shared library: all tests either hang or give this error

crunch:/sandbox/thakur/tmp/test/mpi/attr% mpiexec -n 2 attrend
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(307)..................: MPI_Finalize failed
MPI_Finalize(198)..................:
MPID_Finalize(92)..................:
PMPI_Barrier(476)..................: MPI_Barrier(comm=0x44000002) failed
MPIR_Barrier(82)...................:
MPIC_Sendrecv(161).................:
MPIC_Wait(405).....................:
MPIDI_CH3I_Progress(150)...........:
MPID_nem_mpich2_blocking_recv(1074):
MPID_nem_tcp_connpoll(1663)........: Communication error
Unknown signal 271

The exact command line is
/homes/thakur/cvs/mpich2/configure --enable-nemesis-dbg-localoddeven
--disable-spawn --with-device=ch3:nemesis --enable-sharedlibs=gvv
--with-pm=hydra --prefix=/sandbox/thakur/mpich2-install

On 64-bit systems everything runs peachy.

Attachments

Change History

Changed 4 months ago by Rajeev Thakur

  • id set to 690

This message has 0 attachment(s)

Changed 4 months ago by Rajeev Thakur

>  The exact command line is
>  --disable-spawn --with-device=ch3:nemesis --enable-sharedlibs=gvv

That should be gcc. I wrote that option manually.

Changed 4 months ago by thakur

  • owner set to balaji
  • priority changed from major to blocker
  • milestone set to mpich2-1.1.1

Changed 4 months ago by thakur

Disabled this test in the nightlies as it runs all day.

Changed 4 months ago by thakur

Now that I disabled the hydra test, I find that it also fails in the same way with gforker. It does not fail with mpd.

Specifically, if you build on crunch or triumph with

/home/MPI/testing/mpich2/mpich2/configure --prefix=/sandbox/thakur/cb/mpi2-inst --enable-romio --enable-cxx --enable-sharedlibs=gcc --enable-nemesis-dbg-localoddeven --with-device=ch3:nemesis --with-pm=gforker

you get this

crunch:/sandbox/thakur/cb/mpich2/examples% mpiexec -n 4 cpi Process 0 of 4 is on crunch.mcs.anl.gov Process 2 of 4 is on crunch.mcs.anl.gov Process 3 of 4 is on crunch.mcs.anl.gov Process 1 of 4 is on crunch.mcs.anl.gov Fatal error in PMPI_Bcast: Other MPI error, error stack: PMPI_Bcast(1301)......................: MPI_Bcast(buf=0xbffe84c8, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed MPIR_Bcast(998).......................: MPIR_Bcast_scatter_ring_allgather(842): MPIR_Bcast_binomial(187)..............: MPIC_Send(41).........................: MPIC_Wait(405)........................: MPIDI_CH3I_Progress(150)..............: MPID_nem_mpich2_blocking_recv(1074)...: MPID_nem_tcp_connpoll(1663)...........: Communication error Unknown signal 271

I wonder if it is something related to the MPI_Init scalability fix that was done only for MPD, because MPD works.

Surprisingly, if you build without shared libraries it seems to work (e.g., nightly tests on schwinn).

Changed 4 months ago by balaji

I can reproduce this error for both shared and static libraries. I just need these two configure arguments: --enable-fast --enable-nemesis-dbg-localoddeven.

Changed 4 months ago by balaji

  • owner changed from balaji to buntinas
  • priority changed from blocker to minor
  • summary changed from Hydra not working on 32-bit systems to Nemesis' MPIU_Memcpy broken

It looks like the problem is that Nemesis' MPIU_Memcpy is broken in some cases. In r4707, we started using Nemesis' memcpy for all copies which revealed this problem (earlier we were directly using memcpy). I didn't check why this doesn't show up for MPD, though. In r4913, I've temporarily fallen back to using the default MPIUI_Memcpy implementation so the nightly tests can proceed till this is fixed.

Darius: can you take a look at the MPIU_Memcpy implementation?

Rajeev: can you re-enable the nightly tests for this?

I'm dropping the priority level for this since the tests pass now. Still leaving this for 1.1.1, but we can decide whether this is critical enough for that.

Changed 4 months ago by goodell

Do we have a simple regression test for this that doesn't rely on noticing whether or not hydra hung last night? We can't really play around with stuff like memcpy, we need to have a high level of confidence that it is working correctly.

Changed 4 months ago by balaji

If you revert r4913, even cpi hangs or errors out for all process managers, except MPD. I don't know why it doesn't hang for MPD.

Changed 4 months ago by buntinas

  • status changed from new to closed
  • resolution set to fixed

Fixed in r4959. We were missing "memory" in the clobber list of the inline assembly for the memory copy implementations.

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #690 (Nemesis' MPIU_Memcpy broken)

Author



Change Properties
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.