Ticket #694 (closed bug: fixed)

Opened 5 months ago

Last modified 4 months ago

Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by signal 4

Reported by: Tim Sliwinski <tss06@…> Owned by: buntinas
Priority: blocker Milestone: mpich2-1.1.1
Component: mpich2 Keywords:
Cc:

Description

mpich2-maint,

I am working with a colleague to update a cluster based on mpich1 and a
much older version of Fedora. We were unable to get the Fedora 11
package to work and decided to build from the source. Configure (with the
--with-pm=mpd and --prefix options), Make, and Make Install all worked
without any truly noticeable errors (by my untrained eyes through the log
files),
but we continuously get this problem. Currently we have a cluster of two nodes
we are attempting to implement this on. Both are running Fedora 11 and are
named hydra1 and hydra3. Hydra1 is the main server where mpdboot is being
run.

First, when we start the mpd ring, we get a Depreciation Warning:

[administrator@hydra1 ~]$ mpdboot -n 2 -f ~/mpd.hosts
/home/administrator/mpich2-install/bin/mpdboot:56: DeprecationWarning:
The popen2 module is deprecated.  Use the subprocess module.
  from popen2   import Popen4, Popen3, popen2

mpd.hosts has the lines:

[administrator@hydra1 ~]$ cat mpd.hosts
hydra1
hydra3

But everything still seems to work from that point on:

[administrator@hydra1 ~]$ mpdtrace
hydra1
hydra3
[administrator@hydra1 ~]$ mpiexec -n 2 hostname
hydra1.<domain removed>
hydra3.<domain removed>

However, when we attempt to run any parallel code (either on a single
machine or the mpd ring using mpiexec) we get the following or something
similar:

[administrator@hydra1 ~]$ mpiexec -n 4 ~/cpi
rank 2 in job 1  hydra1.<domain removed>_54122   caused collective abort
of all ranks
  exit status of rank 2: killed by signal 9
rank 0 in job 1  hydra1.<domain removed>_54122   caused collective abort
of all ranks
  exit status of rank 0: killed by signal 4

We are unsure what to do with this. We have tried other programs built
for parallel execution, but the same error occurs. Even if we attempt to
mpdboot from hydra3, it will give the exact same error with hydra3
replacing hydra1 in the above error.

The executables are being run from the mpich2-install/bin in a shared home
directory via nfs and path variables are defined on both machines.

Attached are the c.txt, m.txt, and mi.txt for your review.

Thanks for your help in advance!!!

-Timothy Sliwinski
Florida State University Student


Attachments

c.txt (89.1 KB) - added by Tim Sliwinski 5 months ago.
Added by email2trac
m.txt (1.7 MB) - added by Tim Sliwinski 5 months ago.
Added by email2trac
mi.txt (83.1 KB) - added by Tim Sliwinski 5 months ago.
Added by email2trac
config.log (58.6 KB) - added by Timothy S. Sliwinski 5 months ago.
Added by email2trac
config.status (39.3 KB) - added by Timothy S. Sliwinski 5 months ago.
Added by email2trac
cpuinfo (412 bytes) - added by Timothy S. Sliwinski 5 months ago.
Added by email2trac

Change History

Changed 5 months ago by Tim Sliwinski

Added by email2trac

Changed 5 months ago by Tim Sliwinski

  • id set to 694

This message has 3 attachment(s)

Changed 5 months ago by Rajeev Thakur

Someone else reported this problem recently with Fedora 11 (signal 4, illegal
instruction). We need to look into it.

In the meanwhile, try configuring with --with-device=ch3:ssm, run make clean,
make, and make install.

If that also gives the same error, let us know. In that case, configure with
--with-device=ch3:sock. It should work.

Changed 5 months ago by Timothy S. Sliwinski

Rajeev,

I attempted to configure with --with-device=ch3:ssm and the problem did
again occur with the same error code. However, --with-device=ch3:sock works
perfectly now.

Thanks for your help!

-Timothy Sliwinski


-----Original Message-----
From: owner-mpich2-bugs@mcs.anl.gov [mailto:owner-mpich2-bugs@mcs.anl.gov]
On Behalf Of mpich2
Sent: Thursday, July 02, 2009 3:28 PM
To: undisclosed-recipients:
Subject: Re: [mpich2-maint] #694: Fedora 11, MPICH2-1.1 collective abort of
all ranks, killed by signal 4

-------------------------------------------+--------------------------------
 Reporter:  Tim Sliwinski <tss06@fsu.edu>  |        Owner:
     Type:  bug                            |       Status:  new
 Priority:  major                          |    Milestone:
Component:  mpich2                         |   Resolution:
 Keywords:                                 |
-------------------------------------------+--------------------------------


Comment (by Rajeev Thakur):

 {{{

 Someone else reported this problem recently with Fedora 11 (signal 4,
 illegal
 instruction). We need to look into it.

 In the meanwhile, try configuring with --with-device=ch3:ssm, run make
 clean,
 make, and make install.

 If that also gives the same error, let us know. In that case, configure
 with
 --with-device=ch3:sock. It should work.

 }}}

--
Ticket URL: <https://trac.mcs.anl.gov/projects/mpich2/ticket/694#comment:>

Changed 5 months ago by Rajeev Thakur

>  I attempted to configure with --with-device=ch3:ssm and the
> problem did  again occur with the same error code. However,
> --with-device=ch3:sock  works  perfectly now.

Thanks for letting us know. Something strange is happening with Fedora 11.
Fedora 10 works; everything else works. We will look into
it.

Rajeev

Changed 5 months ago by balaji

  • owner set to buntinas
  • priority changed from major to blocker
  • summary changed from Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by signal 4 to Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by signal 4
  • milestone set to mpich2-1.1.1

It looks like the atomic operations are in some way broken on Fedora 11. Can you send us your /proc/cpuinfo information as well as the src/openpa/config* files?

Bumping the priority level up for this one.

Changed 5 months ago by Timothy S. Sliwinski

Added by email2trac

Changed 5 months ago by Timothy S. Sliwinski

Balaji,

Attached are the files you requested. If you need anything else, I'll be
sure to get them to you as soon as I can.

-Timothy Sliwinski

-----Original Message-----
From: owner-mpich2-bugs@mcs.anl.gov [mailto:owner-mpich2-bugs@mcs.anl.gov]
On Behalf Of mpich2
Sent: Friday, July 03, 2009 12:36 PM
To: undisclosed-recipients:
Subject: Re: [mpich2-maint] #694: Fedora 11, MPICH2-1.1 collective abort of
all ranks, killed by signal 4

-------------------------------------------+--------------------------------
 Reporter:  Tim Sliwinski <tss06@fsu.edu>  |        Owner:  buntinas
     Type:  bug                            |       Status:  new
 Priority:  blocker                        |    Milestone:  mpich2-1.1.1
Component:  mpich2                         |   Resolution:
 Keywords:                                 |
-------------------------------------------+--------------------------------

Changes (by balaji):

  * owner:  => buntinas
  * priority:  major => blocker
  * summary:  Fedora 11, MPICH2-1.1 collective abort of all ranks, killed
              by
 signal 4 => Fedora 11, MPICH2-1.1 collective abort of all
              ranks, killed by  signal 4
  * milestone:  => mpich2-1.1.1

Comment:

 It looks like the atomic operations are in some way broken on Fedora 11.
 Can you send us your /proc/cpuinfo information as well as the
 src/openpa/config* files?

 Bumping the priority level up for this one.

--
Ticket URL: <https://trac.mcs.anl.gov/projects/mpich2/ticket/694#comment:5>

Changed 5 months ago by buntinas

Hi Tim,

I haven't been able to reproduce this on a my x86_64 Fedora 11 machine, so we'll have to ask you to do some more tests for us.

From the build directory, where you ran configure, can you do this:

cd src/openpa make check

And let us know if any of the tests don't pass?

Also, I just want to verify that you said this is reproducible with a single machine (i.e. when mpdtrace shows only one machine).

Thanks, -d

Changed 4 months ago by buntinas

  • status changed from new to assigned

I found a pentium 3 machine to try this on and I was able to reproduce the error. We get an illegal instruction on "mfence".

Pentium 3 doesn't support mfence or lfence. In configure we do test for supported architectures, but we're only doing tests to see if it compiles and links, rather than testing to see if it runs (because it makes cross compiling considerably less complicated), so we don't detect this. We'll have to add run tests to check for such things.

We're working on a fix for this.

-d

Changed 4 months ago by Tim Sliwinski

buntinas,

I have run the make check for openpa (cd /src/openpa make check) and it finished
with 1 of 3 tests failed. Both sanity and test_primitives successfully passed,
however, test_barriers failed with this output:

Testing memory barrier sanity           /bin/sh: line 4:  7297 Illegal
instruction     ${dir}$tst
FAIL: test_barriers
===================================================================
1 of 3 tests failed
Please report to https://trac.mcs.anl.gov/projects/openpa/newticket
===================================================================
make[2]: *** [check-TESTS] Error 1
make[2]: Leaving directory `/tmp/administrator/mpich2/src/openpa/test'
make[1]: *** [check-am] Error 2
make[1]: Leaving directory `/tmp/administrator/mpich2/src/openpa/test'
make: *** [check-recursive] Error 1

Also, I can verify that the initial problem did happen with only one machine in
mpdtrace output and did occur again even when mpdtrace showed multiple machines.

Hope this helps!

-Timothy Sliwinski


----- Original Message -----
From: mpich2 <mpich2-maint@mcs.anl.gov>
Date: Wednesday, July 8, 2009 5:53 pm
Subject: Re: [mpich2-maint] #694: Fedora 11, MPICH2-1.1 collective abort of all
ranks, killed by  signal 4
To: undisclosed-recipients: ;

> -------------------------------------------+------------------------
> --------
> Reporter:  Tim Sliwinski <tss06@fsu.edu>  |        Owner:
> buntinas
>     Type:  bug                            |       Status:  new
>
> Priority:  blocker                        |    Milestone:  mpich2-
> 1.1.1Component:  mpich2                         |   Resolution:
>
> Keywords:                                 |
> -------------------------------------------+------------------------
> --------
>
>
> Comment (by buntinas):
>
> Hi Tim,
>
> I haven't been able to reproduce this on a my x86_64 Fedora 11
> machine, so
> we'll have to ask you to do some more tests for us.
>
> From the build directory, where you ran configure, can you do this:
>
>   cd src/openpa
>   make check
>
> And let us know if any of the tests don't pass?
>
> Also, I just want to verify that you said this is reproducible
> with a
> single machine (i.e. when mpdtrace shows only one machine).
>
> Thanks,
> -d
>
> --
> Ticket URL:
> <https://trac.mcs.anl.gov/projects/mpich2/ticket/694#comment:7>

Changed 4 months ago by goodell

(minor housekeeping)

This is tracked on the openpa side at: https://trac.mcs.anl.gov/projects/openpa/ticket/10

Changed 4 months ago by Tim Sliwinski

buntinas,

Thanks for the update! Let me know if you'd like me to test anything that seems
to work on your end here. Currently, three computers are working with the
current "fix" and we look forward to a stable version to replace this with.

-Timothy Sliwinski

----- Original Message -----
From: mpich2 <mpich2-maint@mcs.anl.gov>
Date: Thursday, July 9, 2009 5:24 pm
Subject: Re: [mpich2-maint] #694: Fedora 11, MPICH2-1.1 collective abort of all
ranks, killed by  signal 4
To: undisclosed-recipients: ;

> -------------------------------------------+------------------------
> --------
> Reporter:  Tim Sliwinski <tss06@fsu.edu>  |        Owner:
> buntinas
>     Type:  bug                            |       Status:
> assigned
> Priority:  blocker                        |    Milestone:  mpich2-
> 1.1.1Component:  mpich2                         |   Resolution:
>
> Keywords:                                 |
> -------------------------------------------+------------------------
> --------
>
> Changes (by buntinas):
>
>  * status:  new => assigned
>
> Comment:
>
> I found a pentium 3 machine to try this on and I was able to
> reproduce the
> error.  We get an illegal instruction on "mfence".
>
> Pentium 3 doesn't support mfence or lfence.  In configure we do
> test for
> supported architectures, but we're only doing tests to see if it
> compiles and links, rather than testing to see if it runs (because
> it makes cross
> compiling considerably less complicated), so we don't detect this.
> We'll
> have to add run tests to check for such things.
>
> We're working on a fix for this.
>
> -d
>
> --
> Ticket URL:
> <https://trac.mcs.anl.gov/projects/mpich2/ticket/694#comment:8>

Changed 4 months ago by buntinas

  • status changed from assigned to closed
  • resolution set to fixed

This has been fixed in openpa revision 81 and in mpich2 r4945.

-d

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #694 (Fedora 11, MPICH2-1.1 collective abort of all ranks, killed by signal 4)

Author



Change Properties
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.