Ticket #83 (new bug)

Opened 8 months ago

server assert() on shutdown

Reported by: carns Owned by: slang
Priority: major Component: SERVER
Version: HEAD Keywords: assert shutdown server
Cc: harms@…

Description

got this stack trace from a core file on intrepid:

(gdb) where
#0  0x00002ac734f9bb95 in raise () from /lib64/libc.so.6
#1  0x00002ac734f9cf90 in abort () from /lib64/libc.so.6
#2  0x00002ac734f95256 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000430f3c in job_desc_q_remove ()
#4  0x000000000042d131 in job_bmi_unexp_cancel ()
#5  0x00000000004128ed in server_sig_handler ()
#6  <signal handler called>
#7  0x00002ac734fd7435 in malloc_consolidate () from /lib64/libc.so.6
#8  0x00002ac734fd8813 in _int_free () from /lib64/libc.so.6
#9  0x00002ac734fd895c in free () from /lib64/libc.so.6
#10 0x00002ac734bd5da7 in __db_c_destroy () from /usr/lib64/libdb-4.3.so
#11 0x00002ac734bcfcd8 in __db_refresh () from /usr/lib64/libdb-4.3.so
#12 0x00002ac734bd02ce in __db_close () from /usr/lib64/libdb-4.3.so
#13 0x00002ac734bdf3e0 in __db_close_pp () from /usr/lib64/libdb-4.3.so
#14 0x0000000000466cc3 in db_close ()
#15 0x0000000000468849 in dbpf_collection_clear ()
#16 0x0000000000412bae in server_shutdown ()
#17 0x00000000004133ae in main ()

My theory is that it got one signal and was working on graceful shutdown (from server_shutdown()) function when another signal came in. The signal handler itself is triggering cleanup of unexp bmi jobs (rather than the shutdown function) so it got triggered again when it wasn't necessary.

The HA shutdown script tries a few kills (separated by some period of time) before resorting to kill -9, so it may have managed to signal the server twice before it got fully shut down.

Note: See TracTickets for help on using tickets.