August 2010 Archives

crock reboot again

| | Comments (0)
Crock suffered from the "soft lockup detected on CPU#0!" so I rebooted it. Hopefully this is all fixed in xen 4 so we could upgrade the dom0s that have this problem.

rebooting cattle and girdle

| | Comments (2)
Cattle and girdle both have failed disks in their raid mirrors. We will need to reboot them this afternoon (in the next few hours) to replace the disks, because hot swap sata is broken in the sata_nv nvidia chipset driver. If all goes well everything will be back up by 5 pm pacific time and the raid will be rebuilding, so there will be slow io for some time.

Edit:  there was a problem;  we brought the wrong disk.  cattle and girdle were built back when we were using consumer grade drives like morons, and we sized the raid for 1.5tb disks... so the 1tb 'enterprise' disks won't work without a /whole lot/ of work.  We ended up hitting frys and buying two more 2tb 'consumer grade' disks that we'll just short-stroke down to 1.5tb, 'cause they didn't have any 1.5tb disks that were faster than 6000rpm... the last disks lasted north of a year;  if we can get another year out of these new disks, I'll be happy. 

Anyhow, cattle is coming back up as we speak;  when cattle is up we will shut down girdle and replace that drive, too. 

horn froze up this morning. it was down for about 3 hours before I rebooted it. Unfortunately, I'm an asshole and didn't remember that horn and chariot are both in the same chassis, so by yanking power to horn, I also yanked power to chariot. So chariot also got an unclean reboot (though, as it wasn't frozen up, total downtime there was more like 15 -30 minutes, depending on what order your domain is started in.

Everyone should be back up at this point

possible dish disk failure

| | Comments (0)
While the vps on dish were all restarting from the unclean reboot earlier today, one of the disks in the raid started having alot of sata link errors (following) and the load average became very high. After 1 hour, the sata link stopped having errors, the linux raid driver has started to rebuild the mirror, and the load is back to normal. We are going to run more smart tests on the drive and may need to replace it later this week, hopefully we can also find what was wrong with the sata link. There should be no data loss because the other disk in the mirror is still working well.

SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2
ata1.00: (irq_stat 0x40000008)
ata1.00: tag 0 cmd 0x60 Emask 0x49 stat 0x41 err 0x40 (internal error)
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
sda: Write Protect is off

BUG: soft lockup detected on CPU#0!

Call Trace:
  [] softlockup_tick+0xce/0xe0
 [] timer_interrupt+0x3a8/0x402
 [] handle_IRQ_event+0x4e/0x96
 [] __do_IRQ+0xa4/0x105
 [] do_IRQ+0x44/0x4d
 [] evtchn_do_upcall+0x19e/0x256
 [] do_hypervisor_callback+0x1e/0x2c
  [] show_rd_sect+0x0/0x68
 [] __read_lock_failed+0x8/0x14
 [] get_device+0x17/0x20
 [] .text.lock.spinlock+0x53/0x8a
 [] show_rd_sect+0x27/0x68
 [] sysfs_read_file+0xa5/0x12c
 [] vfs_read+0xcb/0x171
 [] sys_read+0x45/0x6e
 [] tracesys+0xab/0xb5

I will be tracking my debugging process here. (as of this moment, the server has been rebooted, and all domains should be back within 10 minutes or so.)

everyone ought to be back up now, please complain to support@ if you still have issues.

Edit: we're now having a 'infinite retry' disk error

SCSI device sda: drive cache: write back
ata1.00: limiting speed to UDMA/16
ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2
ata1.00: (irq_stat 0x40000008)
ata1.00: tag 0 cmd 0x60 Emask 0x41 stat 0x41 err 0x4 (internal error)
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: limiting speed to PIO4
ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2
ata1.00: (irq_stat 0x40000008)
ata1.00: tag 0 cmd 0x60 Emask 0x41 stat 0x41 err 0x4 (internal error)
end_request: I/O error, dev sda, sector 603497953
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata1.00: limiting speed to PIO3
ata1.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x2
ata1.00: (irq_stat 0x40000001)
ata1.00: tag 0 cmd 0x24 Emask 0x41 stat 0x41 err 0x4 (internal error)
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
which is weird, as I'd bet money that's an 'enterprise grade' drive that ought to fail straight out rather than looping like that. I'm heading down now.