Ticket #106 (new bug)

Opened 6 months ago

Segmentation Fault with trove debugging enabled

Reported by: harms Owned by: slang
Priority: major Component: SERVER
Version: 2.8.1 Keywords:
Cc:

Description

Hey everyone -

Nick and I are digging a little bit into trove and have found a bit of a bug. When trove debugging is enabled (by way of the config file "trove" flag) the server will crash under I/O calls (namely pvfs2-cp).

It sometimes runs for a few seconds before crashing but it's

consistent enough to seg fault every time I try to transfer a 256 MB file onto or off of the server. I tested on both 32 bit RHEL5 and 64 bit Fedora 10, release 2.8.1 on both. Only one server was running and it was acting as both a metadata and an I/O server.

I believe it's something to do with threading since it happens when printing out a status message (I'm fairly certain the call to gossip_debug() on line 327 of dbpf-bstream.c is the culprit). Here is the last bit of the log file and the stack trace from gdb on RHEL5 32 bit:

[D 05/26 13:39] aio_progress_notification: BSTREAM_READ_LIST complete: aio_return() says 262144 [fd = 11] [D 05/26 13:39] *** starting delayed ops if any (state is LIST_PROC_ALLPOSTED) [D 05/26 13:39] DBPF I/O ops in progress: 1 [New Thread 0xb56a0b90 (LWP 1272)] [Thread 0xb2cfeb90 (LWP 1271) exited] [D 05/26 13:39] issue_or_delay_io_operation: lio_listio posted 0xa0d0ec8 (handle 9223372036854775805, ret 0) [D 05/26 13:39] --- aio_progress_notification called with handle 9223372036854775805 (0xa0d0ec8) [D 05/26 13:39] aio_progress_notification: BSTREAM_READ_LIST complete: aio_return() says 262144 [fd = 11]

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0xb56a0b90 (LWP 1272)] 0x00c04993 in strlen () from /lib/libc.so.6 (gdb) bt #0 0x00c04993 in strlen () from /lib/libc.so.6 #1 0x00bd4bce in vfprintf () from /lib/libc.so.6 #2 0x00bf53b4 in vsnprintf () from /lib/libc.so.6 #3 0x08059bed in gossip_debug_fp_va (fp=0xb569fb5c,

prefix=<value optimized out>, format=0xb569fc80 "*** starting delayed ops if any (state is ST

complete: aio_return() says 262144 [fd = 11]\n", ap=0xb56a00d0 "t: hpz\016\b", ts=13455348)

at src/common/gossip/gossip.c:506

#4 0x0805a041 in gossip_debug (mask=65536, prefix=63 '?',

format=0x80dc3b0 "*** starting delayed ops if any (state is %s)\n") at src/common/gossip/gossip.c:281

#5 0x080a9ed9 in aio_progress_notification (sig=

{sival_int = 168627912, sival_ptr = 0xa0d0ec8})

at src/io/trove/trove-dbpf/dbpf-bstream.c:237

#6 0x080ba89c in alt_lio_thread (foo=0xa0d0ce8)

at src/io/trove/trove-dbpf/dbpf-alt-aio.c:275

#7 0x00d0f49b in start_thread () from /lib/libpthread.so.0 #8 0x00c6642e in clone () from /lib/libc.so.6

Thanks, - Dave

Note: See TracTickets for help on using tickets.