Ticket #902 (closed bug: fixed)

Opened 5 weeks ago

Last modified 2 weeks ago

mpd PM post-execution error: close failed in file object destructor

Reported by: Deji Akingunola <dakingun@…> Owned by: goodell
Priority: major Milestone: mpich2-1.2.1
Component: mpich2 Keywords:
Cc:

Description (last modified by balaji) (diff)

I've been testing MPICH2-1.2 for the upcoming Fedora 12 release and have noticed that running a mpi job with using the mpd process manager always produces an error after execution. mpiexec.hydra doesn't produce an error.

[deji@rhema ~]$ mpiexec -np 2 ./cpi
Process 1 of 2 is on rhema
Process 0 of 2 is on rhema
pi is approximately 3.1415926544231318, Error is 0.0000000008333387
wall clock time = 0.000297
close failed in file object destructor:
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/bin/mpdlib.py", line 162, in mpd_uncaught_except_tb
    tb = extract_tb(arg3)
TypeError: 'NoneType' object is not callable

Original exception was:
IOError: [Errno 9] Bad file descriptor
close failed in file object destructor:
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/bin/mpdlib.py", line 162, in mpd_uncaught_except_tb
    tb = extract_tb(arg3)
TypeError: 'NoneType' object is not callable

Original exception was:
IOError: [Errno 9] Bad file descriptor

Attachments

Change History

  Changed 4 weeks ago by balaji

  • owner set to goodell
  • status changed from new to assigned
  • description modified (diff)
  • milestone set to mpich2-1.2.1

Cleaned up the description a bit and assigned to Dave. Set milestone to 1.2.1 as well.

  Changed 3 weeks ago by Deji Akingunola <dakingun@…>

I think this issue is closely related to the one described at http://bugs.python.org/issue7111 . It's happening because in this instance I brought up the mpd ring by running 'mpd &'. If I drop the re-direction to null (the '&'), then mpiexec runs correctly without the errors and tracebacks as above.

follow-up: ↓ 4   Changed 3 weeks ago by goodell

  • status changed from assigned to accepted

Hmm... issue7111 might be the problem, but I suspect something more like this is happening: http://bugs.python.org/issue4192

We probably need to put a call in to either runner.wait() or runner.communicate(), but I'll have to read up a little bit more on this before we do anything.

What would really help is if I can get an environment setup that can reproduce the problem. What version of python are you using?

in reply to: ↑ 3   Changed 3 weeks ago by Deji Akingunola <dakingun@…>

Replying to goodell:

What would really help is if I can get an environment setup that can reproduce the problem. What version of python are you using?

I'm using python-2.6.4 . It seems python <= 2.6 doesn't exhibit this issue, which made me think the change causing this is the one at http://bugs.python.org/issue5313

  Changed 3 weeks ago by Deji Akingunola <dakingun@…>

I'm sorry, I'm actually using python-2.6.2

  Changed 3 weeks ago by goodell

issue5313 does indeed look to be the most promising possibility so far. I'll take a closer look and try harder to reproduce this later today.

  Changed 3 weeks ago by goodell

Found it. We do indeed have a variation on issue5313. The mpdman is closing all file descriptors greater than 2 via os.close(fd). This breaks on fc12 because the fc12 python interpreter opens a file descriptor to /etc/abrt/pyhook.conf via a higher level file object, which has the fd closed out from under it. When that higher level file object is garbage collected at mpdman exit time, the call to close the fd results in an IOError(EBADF).

  Changed 2 weeks ago by goodell

  • status changed from accepted to closed
  • resolution set to fixed

Fixed and/or worked-around in r5720. There might be something smarter to do here, but for now we just won't close the extra file descriptors.

WARNING! You need to either login using OpenID here or enter your email address here before you can create or edit tickets. Otherwise the ticket will get treated as spam. More information on creating tickets can be found here.

Don't forget to add your email address to the cc list to make sure that you get updated of the ticket status.

Add/Change #902 (mpd PM post-execution error: close failed in file object destructor)

Author



Change Properties
Action
as closed
Next status will be 'reopened'
 
Note: See TracTickets for help on using tickets.