Update: We're starting later for the first move. Now it will start more like midnight or after.
Recently in hosting status Category
Update: We're starting later for the first move. Now it will start more like midnight or after.
[root@hydra /]# uptime
18:56:26 up 410 days, 15:44, 2 users, load average: 0.09, 0.29, 0.25
18:12:12 up 451 days, 17:42, 8 users, load average: 0.01, 0.03, 0.00
as usual, if we don't screw it up it will be 20 minutes downtime and no reboot for you, due to xm save/restore
also: http://lists.xensource.com/archives/html/xen-devel/2009-10/msg00873.html
worst-case I will reboot the server with a known-good xen kernel. (I was using 3.4.2-rc; I will downgrade to 3.4.1.)
One way or another, the problem will be solved tonight.
Users who have not shut down are thusfar unaffected. I will try to reboot without interfering much with their operation.
I should have reported this the other night
Update: rebooting branch (2:56 PST on Oct 19th. )
update: branch is up and everything is confirmed good (4:08 PSD on Oct 19th)
BR-charon#show ip traffic IP Statistics 1916350241 received, 52550054 sent, 585142657 forwarded 629624 filtered, 67 fragmented, 78 reassembled, 2033812 bad header 14173 no route, 0 unknown proto, 0 no buffer, 632845943 other errors
so we're swapping it out with a procurve 2824 with firmware that was fresh this year (downright modern!)
anyhow, there shouldn't be more than 120 seconds or so downtime for anyone. we're doing the move incrementally. It should be done tonight.
Update: we are done.This is still fallout from the faiure of that cheap samsung; I am having trouble with hotplug on the 2.6.18.8-xen kernel, and need to reboot the box.
All users of birds.prgmr.com should have recieved a credit for two months for the earlier unplanned outage, so you won't recieve credits for this outage.
He came back and complained of another error:
[98570.577133] INFO: task cron:9347 blocked for more than 120 seconds. [98570.577153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Now, I haven't seen that before...ever. I could have made some jokes about trying to run drupal in 64GiB ram, but I am unnerved when I see new error messages (I've been doing this for half my life. I've had root on more than 60,000 servers. There are not many error messages I have not seen.)
a little searching came up with some possible debian bugs, but none of my other debian users are having trouble. So I nosed around on the Dom0, and found a bunch of disk errors:
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (BMDMA stat 0x0) ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error) ata2: EH complete raid1: sda2: redirecting sector 133440463 to another mirror SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write back ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (BMDMA stat 0x0) ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error) ata2: EH complete SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB) sdb: Write Protect is off sdb: Mode Sense: 00 3a 00 00 SCSI device sdb: drive cache: write backI ran a smart test, and sure enough, /dev/sdb is bad. it failed the 'short' test with a read error. I removed it from the mirror, and will replace it tomorrow. (yes, this means that stables' disk performance will suck this Monday, as the rebuild goes.)
the smartctl output:
SMART Error Log Version: 1
ATA Error Count: 27 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 27 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b3 38 f4 ec Error: UNC at LBA = 0x0cf438b3 = 217331891
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 ad 38 f4 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 35 50 f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 ed 4f f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 08 bd 4f f5 ec 0a 45d+15:35:34.008 READ DMA
c8 00 20 cd ec 7c ed 0a 45d+15:35:34.008 READ DMA
Error 26 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:19.338 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:18.368 IDENTIFY DEVICE
Error 25 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:17.418 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:16.448 IDENTIFY DEVICE
Error 24 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:15.488 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:14.648 IDENTIFY DEVICE
Error 23 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 bb 38 f4 ec Error: UNC at LBA = 0x0cf438bb = 217331899
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b5 38 f4 ec 0a 45d+15:35:13.648 READ DMA
ec 00 00 bb 38 f4 a0 0a 45d+15:35:12.608 IDENTIFY DEVICE
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 4884 217331891
# 2 Extended offline Completed without error 00% 7 -
# 3 Short offline Completed without error 00% 3 -
SMART Selective self-test log data structure revision number 1
(the '# 1 Short offline Completed: read failure ' makes it absolutely clear that the drive is bad; but usually I return for warranty any disk that has more than zero errors in the 'smart error log')
So yea, if I had ignored him, many others would have suffered similar errors, and well, half-bad disks are much worse than fully-bad disks; they can, in fact, lead to data loss. It's a mirror, so as long as I get a new drive out there tomorrow, nobody should notice (other than, as I said, disk performance sucking during the rebuild tomorrow. I have lowered the rebuild speed, so maybe it won't have such a horrific impact this time. )