August 2010 Archives
Edit: there was a problem; we brought the wrong disk. cattle and girdle were built back when we were using consumer grade drives like morons, and we sized the raid for 1.5tb disks... so the 1tb 'enterprise' disks won't work without a /whole lot/ of work. We ended up hitting frys and buying two more 2tb 'consumer grade' disks that we'll just short-stroke down to 1.5tb, 'cause they didn't have any 1.5tb disks that were faster than 6000rpm... the last disks lasted north of a year; if we can get another year out of these new disks, I'll be happy.
Anyhow, cattle is coming back up as we speak; when cattle is up we will shut down girdle and replace that drive, too.
horn froze up this morning. it was down for about 3 hours before I rebooted it. Unfortunately, I'm an asshole and didn't remember that horn and chariot are both in the same chassis, so by yanking power to horn, I also yanked power to chariot. So chariot also got an unclean reboot (though, as it wasn't frozen up, total downtime there was more like 15 -30 minutes, depending on what order your domain is started in.
Everyone should be back up at this point
SCSI device sda: drive cache: write back
ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2
ata1.00: (irq_stat 0x40000008)
ata1.00: tag 0 cmd 0x60 Emask 0x49 stat 0x41 err 0x40 (internal error)
SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB)
sda: Write Protect is off
BUG: soft lockup detected on CPU#0! Call Trace:[ ] softlockup_tick+0xce/0xe0 [ ] timer_interrupt+0x3a8/0x402 [ ] handle_IRQ_event+0x4e/0x96 [ ] __do_IRQ+0xa4/0x105 [ ] do_IRQ+0x44/0x4d [ ] evtchn_do_upcall+0x19e/0x256 [ ] do_hypervisor_callback+0x1e/0x2c [ ] show_rd_sect+0x0/0x68 [ ] __read_lock_failed+0x8/0x14 [ ] get_device+0x17/0x20 [ ] .text.lock.spinlock+0x53/0x8a [ ] show_rd_sect+0x27/0x68 [ ] sysfs_read_file+0xa5/0x12c [ ] vfs_read+0xcb/0x171 [ ] sys_read+0x45/0x6e [ ] tracesys+0xab/0xb5
I will be tracking my debugging process here. (as of this moment, the server has been rebooted, and all domains should be back within 10 minutes or so.)
everyone ought to be back up now, please complain to support@ if you still have issues.
Edit: we're now having a 'infinite retry' disk error
SCSI device sda: drive cache: write back ata1.00: limiting speed to UDMA/16 ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2 ata1.00: (irq_stat 0x40000008) ata1.00: tag 0 cmd 0x60 Emask 0x41 stat 0x41 err 0x4 (internal error) SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: limiting speed to PIO4 ata1.00: exception Emask 0x40 SAct 0x1 SErr 0x800 action 0x2 ata1.00: (irq_stat 0x40000008) ata1.00: tag 0 cmd 0x60 Emask 0x41 stat 0x41 err 0x4 (internal error) end_request: I/O error, dev sda, sector 603497953 SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata1.00: limiting speed to PIO3 ata1.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x2 ata1.00: (irq_stat 0x40000001) ata1.00: tag 0 cmd 0x24 Emask 0x41 stat 0x41 err 0x4 (internal error) SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB) sda: Write Protect is off SCSI device sda: drive cache: write backwhich is weird, as I'd bet money that's an 'enterprise grade' drive that ought to fail straight out rather than looping like that. I'm heading down now.