Recently in outage Category

bull is down again after the move

| | Comments (0)
It looks like bull went down around 11:30 Pacific Time here on Tuesday, with no errors on the console. I suspect a power supply or other hardware failure, because when we first turned it on there was a loud pop. Trying again a few minutes later worked, but resetting the power port now doesn't help. Luke is going to try to put its disks in a newer system.

We will update here when its working again, and users on bull should get a free month also. Please email support@prgmr.com if you have any questions. Thanks.

edit at 01:35 by Luke

I've got a spare PSU, and if that doesn't work I have a whole spare server that is largely identical to bull; I'll just swap the disks and be done with it. 


edit at 02:42 by luke

I swapped the PSU and bull is coming back online.

edit at 02:51 by Luke:

all customers on bull are back up.


We are planning to move the servers apples, cerberus, bowl and branch 
on Tuesday evening this week (June 21, 2011) to the new data center.
Expect the downtime to start sometime after 9PM PDT, and everything
should be back up by midnight. Branch needs a new disk also, so we are
going to take it down earlier and rebuild the raid with the new disk before
the move. It will be down starting at 7PM PDT , and when it finishes
rebuilding we will start shutting down the other 3 servers.

If that goes well, hopefully we can move the second group of servers from the eliteweb cabinet on Wednesday night (knife, cauldron and council). See http://book.xen.prgmr.com/mediawiki/index.php/EGI_Moving
If you have any questions, please email support@prgmr.com. It might be delayed if there is much support email to answer from the earlier move or just regular tickets also though. Thanks!

Update: We're starting later for the first move. Now it will start more like midnight or after.

Update 00:08  -  we are rebooting and rebuilding branch before the move.  expect branch downtime to start in a few minutes,  expect downtime for the other servers to start in maybe three hours.

Update 00:35: Branch is down and the RAID is rebuilding.     nobody else is rebooting yet.

Update 05:54:  branch is done rebuilding, apples, cerberus, bowl down for moving

Update: 07:29:  all servers are coming  back up, but all network connectivity is down.  At this point, my provider seems to think it's something between, and not my network, but it's too early to say.  

Update: 08:26:  Cerberus doesn't want to boot 'cause I removed a bad drive (a spare had replaced it in the RAID)   -  I put the drive back and it boots.    Bowl is also down, I don't know what that problem is yet.  

Update:08:38: bowl and branch are both starting xendomains, the problem with bowl was an upgrade blew out the menu.st file.  the problem with branch was that we incorrectly set power on after power fail to off.

update: 08:51:  all users should now be up and running.  complain otherwise.   




mantle crashed

| | Comments (1)
Today mantle crashed today and stopped in the bios with the error "HT Link SYNC Error". It booted up normally after power cycling, and we have emailed Supermicro to investigate this error. When we know everybody is back up and things are all ok again, we will upate this post. Thanks, and email support@prgmr.com if you are still having trouble.
So we are preparing to move our servers from svtix to market post tower. We are going to change some of our vlan numbers during the move so we can share vlan number space with egi and run vlans over their network between the 2 sites. I changed one of the vlan numbers that I didn't think would affect the routing for everything here, but it did.

 Luckily I was able to get to SVTIX pretty quickly and fix it within an hour, but of course it was a stupid mistake and I should have done it when I was already here. I'm still planning to be here tomorrow for more of this work, which is when I should have done this, but users should be prepared for another shorter outage tomorrow. We are going to move our port with egi to a tagged port so that we can run more vlans over it for moving the servers. Anyway, sorry for the trouble. I should really know better, and in the future I will make sure to do these changes when I'm on site.
Last night our provider's (rippleweb at the herakles data center in sacramento) pdu reset somehow that cycled all of the power ports. Apparently, when they moved our servers to a different room, they also put them on a different kind of pdu than they used before and it has problems. Now they have replaced it with the old kind of pdu. Everybody should be back up by now, email us if you are still having trouble. The raid is also resyncing, so expect slow disk performance until that is done. Thanks.

crock reboot

| | Comments (0)
Crock crashed at 7:06 PST this morning, with the same "soft lockup detected on cpu #0" error we've seen before with it and some of the other dom0s, mostly running Xen 3.4 I think. I rebooted it now and domUs are starting to come back up.

Update: everybody on crock should be back up now except for one person whose menu.lst file seems to be unreadable. Email support@prgmr.com if you are still having trouble. Thanks.

hamper reboot

| | Comments (0)
Hamper.prgmr.com rebooted now, and we don't have logging setup for its serial console so we may not be able to find out why. I'm looking at the logfiles though for other clues. I'll update here again when everybody on hamper is up and running again. We're also planning to setup conserver with logging for the servers at Fremont like it is at svtix, we should probably do that sooner because of this.
update: everybody should be back up now, except for one person who I emailed. Email support if you still have problems. Thanks!

council network outage

| | Comments (0)
So council ran out of memory in the dom0, it didn't have any swap setup, and so the network stopped working. When I looked on the serial console, setting the peth0 interface down and up fixed the network, then I saw the memory error in dmesg and added swap space. Let us know at support@prgmr.com if there are any more problems. The downtime was about 2 hours.

another power outage at he.net

| | Comments (2)
Around 7:30 this morning, there was another power outage at Fremont. We will post more information when we have it, and now that power is back we will make sure everything boots up correctly. boar, hydra, stables, birds and hamper are back up but lion and the kvm servers are not. Luke and I are going to Fremont now to see about them. If your vps is on one of the servers that is back up now and is still not working, email support@prgmr.com.

edit:  word has it that he.net was in 'reduced redundancy' mode to work on a UPS when the power event occurred.  

svtix upstream outage

| | Comments (1)
The fiber optic line card for our provider EGI Hosting's connection at SVTIX is apparently still having problems. EGI is working it out with XO, who owns the line. Here is what they said:
We apologize for any inconvenience this might have caused.
There seems to be a problem with one of the XO devices that one of our transport lines into SVTIX is connected to.
XO moved our port to another line card yesterday as it was rebooting itself.
The new line card we were allocated today had frozen and needed a restart and thus is the reason for today's outage.
We have been working very closely with XO in the past two days and they think that today's issue is not related to yesterday's issue. They think that a card reload was need in order to push all the recent configurations to the router and they say believe we should not be seeing this anymore. We have however asked XO to escalate today's event to their tier 3 support for further investigation.

This is the most up-to-date information we have on the situation. We certainly hope that XO have truly fixed the issue.
They should be getting back to us with a final confirmation later today or early tomorrow.

Please be advised that these issues are affecting every carrier in SVTIX that is connected through this XO device, it is not isolated to EGI's network only.
We will update this post again when we have more information.

About this Archive

This page is a archive of recent entries in the outage category.

new features is the previous category.

security is the next category.

Find recent content on the main index or look in the archives to find all content.