June 2009 Archives

Nick is provisioning users now.  

In other news, it turns out I don't have plague.  It's just allergies.  The doctor gave me some stuff, and I'm feeling a bit better already, so look for me to, ah, actually do something rather than just sitting here while Nick works like I have been for the last few days.  Thanks Nick. 

Apples is the first of a batch of 3 servers that have low-power 2.2Ghz shanghai quad-core opterons.  

rebuilding disk on stables

| | Comments (1)
ah. that disk that died on stables? i replaced it the other night but I did not start the rebuild until just now. I set:

[root@stables md]# pwd
/sys/block/md0/md
[root@stables md]# echo "10000"> sync_speed_max

so the rebuild should take around twice as long as usual, but hopefully will cause less performance degradation. Feedback is appreciated.

so I had a customer the other day complaining that his domain was crashing. Now, his console was spewing oom-killer errors, and it was a 64MiB domain, and the guy was trying to run drupal. (yeah, in 64 MiB ram. laughing would be rude, but I tried to explain that the very idea of drupal in that little ram was silly) So I gave him a little more ram and showed him how to add swap, figuring that would solve his problem.

He came back and complained of another error:

[98570.577133] INFO: task cron:9347 blocked for more than 120 seconds.
[98570.577153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

Now, I haven't seen that before...ever. I could have made some jokes about trying to run drupal in 64GiB ram, but I am unnerved when I see new error messages (I've been doing this for half my life. I've had root on more than 60,000 servers. There are not many error messages I have not seen.)

a little searching came up with some possible debian bugs, but none of my other debian users are having trouble. So I nosed around on the Dom0, and found a bunch of disk errors:


ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
raid1: sda2: redirecting sector 133440463 to another mirror
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
I ran a smart test, and sure enough, /dev/sdb is bad. it failed the 'short' test with a read error. I removed it from the mirror, and will replace it tomorrow. (yes, this means that stables' disk performance will suck this Monday, as the rebuild goes.)

the smartctl output:

SMART Error Log Version: 1
ATA Error Count: 27 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 27 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b3 38 f4 ec  Error: UNC at LBA = 0x0cf438b3 = 217331891

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 ad 38 f4 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 35 50 f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 ed 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 bd 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 20 cd ec 7c ed 0a  45d+15:35:34.008  READ DMA

Error 26 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:19.338  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:18.368  IDENTIFY DEVICE

Error 25 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:17.418  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:16.448  IDENTIFY DEVICE

Error 24 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:15.488  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:14.648  IDENTIFY DEVICE

Error 23 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:13.648  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:12.608  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%      4884         217331891
# 2  Extended offline    Completed without error       00%         7         -
# 3  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
(the '# 1 Short offline Completed: read failure ' makes it absolutely clear that the drive is bad; but usually I return for warranty any disk that has more than zero errors in the 'smart error log')

So yea, if I had ignored him, many others would have suffered similar errors, and well, half-bad disks are much worse than fully-bad disks; they can, in fact, lead to data loss. It's a mirror, so as long as I get a new drive out there tomorrow, nobody should notice (other than, as I said, disk performance sucking during the rebuild tomorrow. I have lowered the rebuild speed, so maybe it won't have such a horrific impact this time. )