Recently in hosting status Category

We are planning to move the servers apples, cerberus, bowl and branch 
on Tuesday evening this week (June 21, 2011) to the new data center.
Expect the downtime to start sometime after 9PM PDT, and everything
should be back up by midnight. Branch needs a new disk also, so we are
going to take it down earlier and rebuild the raid with the new disk before
the move. It will be down starting at 7PM PDT , and when it finishes
rebuilding we will start shutting down the other 3 servers.

If that goes well, hopefully we can move the second group of servers from the eliteweb cabinet on Wednesday night (knife, cauldron and council). See http://book.xen.prgmr.com/mediawiki/index.php/EGI_Moving
If you have any questions, please email support@prgmr.com. It might be delayed if there is much support email to answer from the earlier move or just regular tickets also though. Thanks!

Update: We're starting later for the first move. Now it will start more like midnight or after.

Update 00:08  -  we are rebooting and rebuilding branch before the move.  expect branch downtime to start in a few minutes,  expect downtime for the other servers to start in maybe three hours.

Update 00:35: Branch is down and the RAID is rebuilding.     nobody else is rebooting yet.

Update 05:54:  branch is done rebuilding, apples, cerberus, bowl down for moving

Update: 07:29:  all servers are coming  back up, but all network connectivity is down.  At this point, my provider seems to think it's something between, and not my network, but it's too early to say.  

Update: 08:26:  Cerberus doesn't want to boot 'cause I removed a bad drive (a spare had replaced it in the RAID)   -  I put the drive back and it boots.    Bowl is also down, I don't know what that problem is yet.  

Update:08:38: bowl and branch are both starting xendomains, the problem with bowl was an upgrade blew out the menu.st file.  the problem with branch was that we incorrectly set power on after power fail to off.

update: 08:51:  all users should now be up and running.  complain otherwise.   




ipv6 routing move

| | Comments (0)
Tonight I moved the ipv6 routing to our newer software router, as we did with ipv4 last Thursday. It looks like its working. Email support@prgmr.com and let us know if you are having any trouble. Thanks!

hydra rebooting shortly

| | Comments (6)
we're trying to see if we can xm save like we did on lion, unlike we did on boar, but it's a pretty old box, so we might be rebooting you. 

[root@hydra /]# uptime
 18:56:26 up 410 days, 15:44,  2 users,  load average: 0.09, 0.29, 0.25

all servers will be rebooted (as lion was today)  for some kernel upgrades, and to consolidate all my he.net servers to one rack.  
root@lion ~]# uptime
 18:12:12 up 451 days, 17:42,  8 users,  load average: 0.01, 0.03, 0.00


as usual, if we don't screw it up it will be 20 minutes downtime and no reboot for you, due to xm save/restore

Xen domains not starting up.

| | Comments (1)
http://book.xen.prgmr.com/mediawiki/index.php/Vif_doesnt_go_away_when_shutting_down

also: http://lists.xensource.com/archives/html/xen-devel/2009-10/msg00873.html

worst-case I will reboot the server with a known-good xen kernel.  (I was using 3.4.2-rc;  I will downgrade to 3.4.1.) 

One way or another, the problem will be solved tonight.

Users who have not shut down are thusfar unaffected.  I will try to reboot without  interfering much with their operation.

I should have reported this the other night


Update:  rebooting branch (2:56 PST on Oct 19th.  )
update:  branch is up and everything is confirmed good (4:08 PSD on Oct 19th)

 
we've been having some mysterious packet loss issues that look a lot like we are oversubscribing a 50Mbps connection, but we're not. we have a 100Mbps commit on a gig port. Our upstream believes the problem to be with our router, so I've been working on it. anyhow, I found this in my foundry:

BR-charon#show ip traffic
IP Statistics
  1916350241 received, 52550054 sent, 585142657 forwarded
  629624 filtered, 67 fragmented, 78 reassembled, 2033812 bad header
  14173 no route, 0 unknown proto, 0 no buffer, 632845943 other errors

so we're swapping it out with a procurve 2824 with firmware that was fresh this year (downright modern!)

anyhow, there shouldn't be more than 120 seconds or so downtime for anyone. we're doing the move incrementally. It should be done tonight.

Update: we are done.
birds.prgmr.com will be rebooted at 4pm Monday August 31st.  we will configure xen to save and restore all DomUs, so if everything goes as planned, you will detect a 10 minute network outage, but none of the DomUs will appear to have been rebooted.

This is still fallout from the faiure of that cheap samsung;  I am having trouble with hotplug on the 2.6.18.8-xen kernel, and need to reboot the box.

All users of birds.prgmr.com should have recieved a credit for two months for the earlier unplanned outage, so you won't recieve credits for this outage. 
so I had a customer the other day complaining that his domain was crashing. Now, his console was spewing oom-killer errors, and it was a 64MiB domain, and the guy was trying to run drupal. (yeah, in 64 MiB ram. laughing would be rude, but I tried to explain that the very idea of drupal in that little ram was silly) So I gave him a little more ram and showed him how to add swap, figuring that would solve his problem.

He came back and complained of another error:

[98570.577133] INFO: task cron:9347 blocked for more than 120 seconds.
[98570.577153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

Now, I haven't seen that before...ever. I could have made some jokes about trying to run drupal in 64GiB ram, but I am unnerved when I see new error messages (I've been doing this for half my life. I've had root on more than 60,000 servers. There are not many error messages I have not seen.)

a little searching came up with some possible debian bugs, but none of my other debian users are having trouble. So I nosed around on the Dom0, and found a bunch of disk errors:


ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
raid1: sda2: redirecting sector 133440463 to another mirror
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (BMDMA stat 0x0)
ata2.00: tag 0 cmd 0xc8 Emask 0x9 stat 0x51 err 0x40 (media error)
ata2: EH complete
SCSI device sdb: 1953525168 512-byte hdwr sectors (1000205 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
I ran a smart test, and sure enough, /dev/sdb is bad. it failed the 'short' test with a read error. I removed it from the mirror, and will replace it tomorrow. (yes, this means that stables' disk performance will suck this Monday, as the rebuild goes.)

the smartctl output:

SMART Error Log Version: 1
ATA Error Count: 27 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 27 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b3 38 f4 ec  Error: UNC at LBA = 0x0cf438b3 = 217331891

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 ad 38 f4 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 35 50 f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 ed 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 08 bd 4f f5 ec 0a  45d+15:35:34.008  READ DMA
  c8 00 20 cd ec 7c ed 0a  45d+15:35:34.008  READ DMA

Error 26 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:19.338  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:18.368  IDENTIFY DEVICE

Error 25 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:17.418  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:16.448  IDENTIFY DEVICE

Error 24 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:15.488  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:14.648  IDENTIFY DEVICE

Error 23 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bb 38 f4 ec  Error: UNC at LBA = 0x0cf438bb = 217331899

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 b5 38 f4 ec 0a  45d+15:35:13.648  READ DMA
  ec 00 00 bb 38 f4 a0 0a  45d+15:35:12.608  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%      4884         217331891
# 2  Extended offline    Completed without error       00%         7         -
# 3  Short offline       Completed without error       00%         3         -

SMART Selective self-test log data structure revision number 1
(the '# 1 Short offline Completed: read failure ' makes it absolutely clear that the drive is bad; but usually I return for warranty any disk that has more than zero errors in the 'smart error log')

So yea, if I had ignored him, many others would have suffered similar errors, and well, half-bad disks are much worse than fully-bad disks; they can, in fact, lead to data loss. It's a mirror, so as long as I get a new drive out there tomorrow, nobody should notice (other than, as I said, disk performance sucking during the rebuild tomorrow. I have lowered the rebuild speed, so maybe it won't have such a horrific impact this time. )

we're still backlogged a little, so provisioning will be slower than usual for a while longer, but the delay shouldn't exceed 24 hours, so you can order now if you like.  

About this Archive

This page is a archive of recent entries in the hosting status category.

hardware is the previous category.

ipv6 is the next category.

Find recent content on the main index or look in the archives to find all content.