Ticket #26 (closed defect: duplicate)

Opened 15 years ago

Last modified 14 years ago

compute node freezes by cqdel

Reported by: kazutomo Owned by: kazutomo
Priority: major Milestone:
Component: ZeptoOS Version:
Keywords: Cc:

Description

Some compute nodes freeze (not all, something like 10 nodes out of 64 nodes for example) when job is killed by cqdel.

It seems that kernel stops at spin_lock_irq() which is called by get_signal_to_deliver().

Attachments

tractmp.txt (1.7 KB) - added by kazutomo 15 years ago.

Change History

comment:1 Changed 15 years ago by kazutomo

  • Owner changed from [email protected] to kazutomo
  • Priority changed from critical to major
  • Status changed from new to assigned

Things changes when I add debug msg. I still couldn't find root cause but I found that BGP IPI mechanism sometimes get messed up. Here is trace.

Oops: kernel access of bad area, sig: 11 #2 SMP NR_CPUS=4 NIP: 835F5DC4 LR: 80005220 CTR: 835F5DC0 REGS: 8272dc10 TRAP: 0300 Not tainted (2.6.19.2) MSR: 00021000 <ME> CR: 22008442 XER: 00000000 DAR: 00006133, DSISR: 00000000 signal.c(1984) cnt=16 pid=147 TASK = 80db8980[143] 'nsdperf' THREAD: 8272c000 CPU: 3 GPR00: 835F5DB0 8272DCC0 80DB8980 8000538C 00000000 00000007 FFFFC000 00000000 GPR08: 802EBA01 835F5D90 FDFF0000 00000000 80DB8B60 10028320 7E09F97C 100C0000 GPR16: 00000000 10030B48 10030B40 10004090 3002F018 00000001 33C50000 8272DE68 GPR24: 8272DF50 80240000 302C2031 00000000 00000000 00000001 00000000 80240000 NIP [835F5DC4] 0x835f5dc4 LR [80005220] smp_call_function_interrupt+0x40/0x74 signal.c(1861) cnt=20 pid=140 Call Trace: [8272DCC0] [00000003] 0x3 (unreliable) [8272DCE0] [80012B8C] bluegene_ipi_call_function+0x28/0x3c [8272DCF0] [80045B80] handle_IRQ_event+0x54/0xa0 [8272DD10] [80045CB0] do_IRQ+0xe4/0x164 [8272DD40] [80006BC8] do_IRQ+0x100/0x104 signal.c(1984) cnt=17 pid=140 [8272DD60] [80002408] ret_from_except+0x0/0x18 [8272DE20] [8002FA00] get_signal_to_deliver+0x3d8/0x50c [8272DE60] [80007D58] do_signal+0x44/0x664 [8272DF40] [80002834] do_user_signal+0x74/0xc4 Instruction dump: signal.c(1857) cnt=22 pid=142 signal.c(1857) cnt=21 pid=141 kernel panic Aiee, killing interrupt handler! signal.c(1861) cnt=21 pid=142 stop_this_cpu() cpuid=1 signal.c(1984) cnt=18 pid=142 bluegene_halt() stop_this_cpu() cpuid=0 Kernel panic - not syncing: Aiee, killing interrupt handler!

smp_call_function on cpu 1: other cpus not responding (2) <- from smpcallfunction

IPI cpu 0 no response (me=1). Retrying. <- flush_tlb_others()??

Oops: kernel access of bad area, sig: 11 #2

# this problem does not happen on regular MPI task, so I changed the priority.

Changed 15 years ago by kazutomo

comment:2 Changed 15 years ago by kazutomo

  • Milestone set to Release before SC08

comment:3 Changed 14 years ago by anonymous

  • Milestone 0 V1R3 release deleted

Milestone 0 V1R3 release deleted

comment:4 Changed 14 years ago by kazutomo

  • Status changed from assigned to closed
  • Resolution set to duplicate
Note: See TracTickets for help on using tickets.