Last night we replaced a hard drive in rutledge and the raid was rebuilding normally until the disk completely froze. I rebooted it and I'm letting the raid rebuild in single user mode now. When its done, I will update the blog here. Email support@prgmr.com if you have any questions. Thanks!
update 20111214 8:36AM PST: Luke here. the dang thing rebuilt, then rebuilt again. I'm suspecting a bad drive. Smart on the thing hangs, and it reports drive errors (that all have to do with smart) So I don't have real solid evidence that the drive is bad, but no smart, if you ask me, is enough reason to trash the drive anyhow.
Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 37 00 00 00 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d6 01 be 4f c2 00 00 00:00:46.516 SMART WRITE LOG
b0 d5 20 bf 4f c2 00 00 00:00:46.016 SMART READ LOG
b0 d6 01 be 4f c2 00 00 00:00:46.006 SMART WRITE LOG
b0 d5 01 bf 4f c2 00 00 00:00:45.517 SMART READ LOG
b0 d6 01 be 4f c2 00 00 00:00:45.507 SMART WRITE LOG
edit at 20111214 10:30am PST:
the thing rebuilt successfully and was rebooted about an hour ago. SMART tests now look good (at least short and conveyance tests. Long test still has 10% to go; I'll update when that's done)
I'm no longer at all sure it was a disk problem; I've seen errors like this when it was rebuilding too fast (there's a /sys/ entry that lets you limit rebuilt speed, and we need to tweak that down next time. Used to be it limited itself to something reasonable.)
I'll update again when the smart error clears; for now, the machine is up, and I don't expect any more reboots.