May 2010 Archives

crock reboot again

| | Comments (0)
Crock needed to reboot again because the dom0 kernel hanged, with the same error:
BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbf9>] __read_lock_failed+0x5/0x14
 [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

So we're thinking this is a hardware problem and plan to put crock's disks into a new system that should be more stable.

Knife reboot

| | Comments (0)
Knife stopped responding again (the other time was March 28) and I rebooted it from the hypervisor. We may need to move the disks to a new system.

ungraceful reboot of crock

| | Comments (1)
The closest thing I have to a clue right now is
May 20 04:42:07 sysl@deathboat Buffering: S32.crock [10000 from 0000020e:9e83bdb6 to 00000000:00430076.  BUG: soft lockup detected on CPU#0!    Call Trace:    [] softlockup_tick+0xce/0xe0   [] timer_interrupt+0x3a0/0x3fa   [] handle_IRQ_event+0x4e/0x96   [] __do_IRQ+0xa4/0x105   [] do_IRQ+]
May 20 04:42:07 sysl@deathboat Buffering: S32.crock [0x44/0x4d   [] evtchn_do_upcall+0x19e/0x250   [] do_hypervisor_callback+0x1e]
May 20 04:42:07 sysl@deathboat Buffering: S32.crock [/0x2c    [] show_rd_sect+0x0/0x68   [] __read_lock_failed+0x5/0x14   [<]
May 20 04:42:07 sysl@deathboat Buffering: S32.crock [ffffffff80343f3e>] get_device+0x17/0x20   [] .text.lock.spinlock+0x53/0x8a   [] show_rd_sect+0x27/0x68   [] sysfs_read_file+0xa5/0x12e   [] vfs_read+0xc]

(this is from the syslog from our serial console server, deathboat. carrage returns are messed up.) Anyhow, this mostly affects people who signed up in the last few days. Note, this was the last gasp of our socket F hardware; I put this up because i was having a hard time with the new G34 hardware. I have plenty of spares for the socket F stuff, so if it crashes again before I figure it out, I'll just replace it with other hardware and chuck the old stuff. I will update when I know more

while the rebuild finishes.  

md1 : active raid1 sde2[2] sdb2[3](F) sda2[0]
      477901504 blocks [2/1] [U_]
      [>....................]  recovery =  0.9% (4778240/477901504) finish=243.6min speed=32367K/sec


I will be /extremely surprised/ if the rebuild finishes in anything like the listed estimate.   From past experience, we are probably looking at 10 hours or so of crummy performance on hamper.