Ticket #26 (closed defect: duplicate)
compute node freezes by cqdel
Reported by: | kazutomo | Owned by: | kazutomo |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | ZeptoOS | Version: | |
Keywords: | Cc: |
Description
Some compute nodes freeze (not all, something like 10 nodes out of 64 nodes for example) when job is killed by cqdel.
It seems that kernel stops at spin_lock_irq() which is called by get_signal_to_deliver().
Attachments
Change History
comment:1 Changed 15 years ago by kazutomo
- Owner changed from [email protected]… to kazutomo
- Priority changed from critical to major
- Status changed from new to assigned
Note: See
TracTickets for help on using
tickets.
Things changes when I add debug msg. I still couldn't find root cause but I found that BGP IPI mechanism sometimes get messed up. Here is trace.
Oops: kernel access of bad area, sig: 11 #2 SMP NR_CPUS=4 NIP: 835F5DC4 LR: 80005220 CTR: 835F5DC0 REGS: 8272dc10 TRAP: 0300 Not tainted (2.6.19.2) MSR: 00021000 <ME> CR: 22008442 XER: 00000000 DAR: 00006133, DSISR: 00000000 signal.c(1984) cnt=16 pid=147 TASK = 80db8980[143] 'nsdperf' THREAD: 8272c000 CPU: 3 GPR00: 835F5DB0 8272DCC0 80DB8980 8000538C 00000000 00000007 FFFFC000 00000000 GPR08: 802EBA01 835F5D90 FDFF0000 00000000 80DB8B60 10028320 7E09F97C 100C0000 GPR16: 00000000 10030B48 10030B40 10004090 3002F018 00000001 33C50000 8272DE68 GPR24: 8272DF50 80240000 302C2031 00000000 00000000 00000001 00000000 80240000 NIP [835F5DC4] 0x835f5dc4 LR [80005220] smp_call_function_interrupt+0x40/0x74 signal.c(1861) cnt=20 pid=140 Call Trace: [8272DCC0] [00000003] 0x3 (unreliable) [8272DCE0] [80012B8C] bluegene_ipi_call_function+0x28/0x3c [8272DCF0] [80045B80] handle_IRQ_event+0x54/0xa0 [8272DD10] [80045CB0] do_IRQ+0xe4/0x164 [8272DD40] [80006BC8] do_IRQ+0x100/0x104 signal.c(1984) cnt=17 pid=140 [8272DD60] [80002408] ret_from_except+0x0/0x18 [8272DE20] [8002FA00] get_signal_to_deliver+0x3d8/0x50c [8272DE60] [80007D58] do_signal+0x44/0x664 [8272DF40] [80002834] do_user_signal+0x74/0xc4 Instruction dump: signal.c(1857) cnt=22 pid=142 signal.c(1857) cnt=21 pid=141 kernel panic Aiee, killing interrupt handler! signal.c(1861) cnt=21 pid=142 stop_this_cpu() cpuid=1 signal.c(1984) cnt=18 pid=142 bluegene_halt() stop_this_cpu() cpuid=0 Kernel panic - not syncing: Aiee, killing interrupt handler!
smp_call_function on cpu 1: other cpus not responding (2) <- from smpcallfunction
IPI cpu 0 no response (me=1). Retrying. <- flush_tlb_others()??
Oops: kernel access of bad area, sig: 11 #2
# this problem does not happen on regular MPI task, so I changed the priority.