so I had a customer the other day complaining that his domain was
crashing. Now, his console was spewing oom-killer errors, and it was a 64MiB domain, and the guy was trying to run drupal. (yeah, in 64 MiB ram. laughing would be rude, but I tried to explain that the very idea of drupal in that little ram was silly) So I gave him a little more ram and showed him how to add swap, figuring that would solve his problem.
He came back and complained of another error:
[98570.577133] INFO: task cron:9347 blocked for more than 120 seconds.
[98570.577153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Now, I haven't seen that before...ever. I could have made some jokes about trying to run drupal in 64GiB ram, but I am unnerved when I see new error messages (I've been doing this for half my life. I've had root on more than 60,000 servers. There are not many error messages I have not seen.)
a little searching came up with some possible debian bugs, but none of my other debian users are having trouble. So I nosed around on the Dom0, and found a bunch of disk errors:
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
raid1: sda2: redirecting sector 133440463 to another mirror
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
I ran a smart test, and sure enough, /dev/sdb is bad. it failed the 'short' test with a read error. I removed it from the mirror, and will replace it tomorrow. (yes, this means that stables' disk performance will suck this Monday, as the rebuild goes.)
the smartctl output:
SMART Error Log Version: 1
ATA Error Count: 27 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 27 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b3 38 f4 ec Error: UNC at LBA = 0x0cf438b3 = 217331891
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 ad 38 f4 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 35 50 f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 ed 4f f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 bd 4f f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 20 cd ec 7c ed 0a 45d+15:35:34.008 READ DMA
Error 26 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:19.338 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:18.368 IDENTIFY DEVICE
Error 25 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:17.418 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:16.448 IDENTIFY DEVICE
Error 24 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:15.488 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:14.648 IDENTIFY DEVICE
Error 23 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:13.648 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:12.608 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 4884 217331891
# 2 Extended offline Completed without error 00% 7 -
# 3 Short offline Completed without error 00% 3 -
SMART Selective self-test log data structure revision number 1
(the '# 1 Short offline Completed: read failure ' makes it absolutely clear that the drive is bad; but usually I return for warranty any disk that has more than zero errors in the 'smart error log')
So yea, if I had ignored him, many others would have suffered similar errors, and well, half-bad disks are much worse than fully-bad disks; they can, in fact, lead to data loss. It's a mirror, so as long as I get a new drive out there tomorrow, nobody should notice (other than, as I said, disk performance sucking during the rebuild tomorrow. I have lowered the rebuild speed, so maybe it won't have such a horrific impact this time. )