May 2011 Archives

mantle crashed

| | Comments (1)
Today mantle crashed today and stopped in the bios with the error "HT Link SYNC Error". It booted up normally after power cycling, and we have emailed Supermicro to investigate this error. When we know everybody is back up and things are all ok again, we will upate this post. Thanks, and email support@prgmr.com if you are still having trouble.

ipv6 routing move

| | Comments (0)
Tonight I moved the ipv6 routing to our newer software router, as we did with ipv4 last Thursday. It looks like its working. Email support@prgmr.com and let us know if you are having any trouble. Thanks!

bug: soft lockup on cauldron

| | Comments (0)
cauldron hung around the same time that our router started having huge packet loss.    it's coming  back up now.  


network problems at svtix

| | Comments (3)
After the problems with packet loss that started during the server move, we moved the routing from our hp layer 3 switch to our linux router we were using for bgp. It helped alot, but it has realtek ethernet chipsets (supermicro X7SLA motherboard) that are also starting to have problems. Earlier this week we noticed the kernel saying "
[891920.491578] NETDEV WATCHDOG: eth0: transmit timed out
[891920.516542] r8169: eth0: link up
" about every 20 minutes. At about 10:30 this morning it started to do that every few minutes and there is much higher packet loss to the linux router starting at the same time. 

We've been planning to setup another dedicated pc router with intel chipsets (supermicro X7SBL) this week, so that should be ready today. Hopefully it will fix things enough so we can work more on moving the rest of the xen servers to market post tower which should also help decrease the network load.

Update: We've upgraded the kernel on the router to 2.6.32 from 2.6.26 and so far its helping.

Update: We're all finished moving to this linux router now, and things seem to be working. Email support if you are still having trouble. Thanks!

DHCP is back

| | Comments (0)
dhcp was broken for some customers this week.   Now, when we set people up, we set up static IP addresses, so most people were not effected unless they tried to boot into the rescue image, or unless they had switched back to DHCP (common for customers who have installed their own image or re-installed) 

The problem was that when we moved routers, we did not setup the dhcp forwarding on the new router.   It's fixed now.  If you were taken down by this problem email us and we'll give you some credit.  

The problem is fixed now;  you may need to kick your dhcp client (rebooting should do the job) 
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
      966277504 blocks [2/2] [UU]
      
md2 : active raid1 sde2[2] sdd2[1] sdc2[3](F)
      966277504 blocks [2/1] [_U]
      [>....................]  recovery =  0.0% (717568/966277504) finish=583.0min speed=27598K/sec
      
md0 : active raid1 sde1[2] sdd1[3] sdc1[4](F) sdb1[1] sda1[0]
      10482304 blocks [4/4] [UUUU]


packet loss at SVTIX

| | Comments (0)
We're replacing the router we have now with a linux-based router, which should either fix the problem (as it's got more cpu, which seems to be the problem)  or it should give us more tools to figure out what the problem is.  

we currently have most subnets over, and they are looking good.  

edit at 07:33  - we have all Xen guests moved over, well, sortof.  the old router is still in the loop and we're not really comfortable removing it until we get some sleep.   packet loss is hovering around 0.1% which is not great, but it's good enough that maybe we should stop messing with it until we get some sleep.  

move to MPT, day 2

| | Comments (0)
tonight, we are continuing our move,

load 1 includes: pearl, jay, rutledge, chessboard, whetstone, and mantle

edit at 22:35: load 1 downtime starts,  
edit at  00:13  load 1 downtime ends.   everyone should be back up.  

It has begun

| | Comments (1)
we've begun moving servers from SVTIX to Market Post Tower.

I'm starting with test servers blood and frogs,  no customers are impacted yet, but blood
and frogs and the associated rails are on my cart and about to begin the five block dash to the other data center.   

edit at 00:54: downtime for customers is beginning now for customers on beak, dow and hydra.

edit at 02:06: beak, dow and hydra are back and domains are starting up again

edit at 02:15: beak, dow and hydra are back, all domains started.  complain if you are still down.  

edit at 03:18:  lozenge, dish, crock, coat, coins, and chime  are shutting down.

edit at 3:44: dish and crock crashed, BUG: soft lockup detected on CPU#0!  unknown how many domans were shut down uncleanly.  all others in the list were shut down cleanly.

edit at 05:17: lozenge, crock, coat, coins, and chime are back up and are starting up guests.   Dish is still down.

edit at 05:46:  dish is up, guests on dish are starting
  
edit at 05:59: all guests are now up. 

  
We're moving most of our servers from SVTIX[1] to Market Post Tower[2] this
weekend.    we've got layer 2 between the two locations already, so we'll
be bringing down servers in batches of five and moving them five at a shot.  
each user should experience something like two hours of downtime, if all 
goes well.  

This, along with the bandwidth outage earlier this month puts me way
over the SLA, so all users we move will be receiving 25% of a month off.  

I am sorry for the short notice;  I've made several bad decisions
that led to this short notice, but I think I am correcting some of those.
The deal wasn't even final until two days ago, but the solution to that 
is more transparency, not less.  Well, more transparency and longer
term agreements.   I am sorry.  
 
I'm now signing a long-term contract, and I'll put data as to when these
contracts expire in a publicly accessible place, so this sort of thing
will be more predictable in the future.  

Now, market post tower is generally considered a better data center 
than SVTIX.   It's usually more expensive, It's much nicer looking and 
has /much/ better and cheaper bandwidth available.  I will immediately 
double the bandwidth "don't worry about it" allowances for everyone.   
Now, svtix has a better history for power, but CoreSite, I am assured, 
has recently upgraded the power systems and the bad old days should be 
behind us.  

In any case, I am maintaining space at both MPT and SVTIX, though SVTIX
will be largely co-location;  most of my Xen hosts will be moved to MPT.  
For now, my bandwidth at svtix will be coming from MPT, which should 
be an improvement for users still at svtix. 

please see http://wiki.xen.prgmr.com/xenophilia/  for the blow-by-blow.

[1]http://www.svtix.com/  at 250 market st. in san jose
[2]http://en.wikipedia.org/wiki/Market_Post_Tower  55 s. market in san jose

I am replacing Table tonight

| | Comments (0)
from the email I just sent table users:

Your vps is on a piece of hardware that we call table;  it's a 6 core
socket c32 opteron with 16GiB ram.   Ever since the power outage at he.net
Fremont, it has rebooted once every 24-48 hours.   On a hunch, we replaced
the power supply, but this just delayed the expected reboot another 12
hours; just long enough to let us think we had solved the problem.

Anyhow, there's not much more I can do with the server online, so
here is what I'm going to do.  I'm pulling one of my spare 8 core dual-socket
nehalam xeon boxes out of the spares pool, and putting all the spare ram I
can in it (I have 2GiB modules coming out my ears)  so this server only has
12GiB ram, but as there is 5GiB free on table, this should work just fine.
This new server has considerably more horsepower than the old one.

The plan tonight is to gracefully shut down table, then to swap the
drives in to this new server, then take table out for better testing.
Assuming that the problem is something other than the hard drives or the
data on the hard drives, this should solve our problem.

Now, I know the reliability of table has been completely unacceptable,
so in Afton to the 1/4 month of credit everyone on prgmr.com servers
at he.net Fremont is getting, all table users will get another free month.
I understand this doesn't make up for the problem, but consider it an
apology.


edit: table is back up, I'm bringing up the xendomains as we speak.

edit: all domains are back up.  complain loudly if yours is still broken. 

table outage

| | Comments (0)
table is still having problems after yesterday's power issue.  We are attempting to rebuild the mirror in single-user mode, which should finish 1-2 hours.    This solved a similar problem on other hosts at he.net Friday. 

update: Table is back, users are coming back up.  

update: all users on table should be up.  

update: we replaced the power supply, and that looked like it fixed the issue, but it rebooted again last night.    I'm preparing a complete hardware move now.

According to the incident report I got from my upstream, there was a PG&e outage  that caused power loss to parts of southern Alameda County.

He.net fremont 1 was affected when the automatic transfer switches properly started the backup generator but failed to cutover.   The incident lasted approximately one hour.   

now, this affected network connectivity to svtix, as right now svtix is he.net bandwidth only, and it goes through he.net fremont 1, so we had an hour or so of network interruption at svtix.  

I'll be taking steps to make my network at svtix more diverse later today.  For now, all xen servers at he.net save for ingot should be back up.   45 minute ETA on ingot.  

Update: kvm guests up.   coral and ingot still down, weird disk issues.  I'm bringing hardware and boot disks down to the co-lo to support nick in person.  

edit: Ingot is up   Eta on coral 110 minutes.   

Note, the main prgmr.com website is on coral, as is our support@prgmr.com email system.  lsc@prgmr.com still works

edit: Coral is up.    

edit:  we missed table.prgmr.com;  it's still down.   nick is working on it now.