July 2012 Archives

lozenges crash

| | Comments (0)
lozenges has crashed now for the second time tonight also, after crashing while trying to start up the vps, so I booted it up in single user mode. The raid is trying to resync, hopefully it will be able to boot up again after the raid is sorted. I'm looking for clues to a bad disk as well.

Update at 22:38: the raid finished resyncing in single user mode, and guests are booting up again. I hope it sticks this time, if not I'm guessing lozenges has some sort of hardware failure.

double disk failure on crock

| | Comments (1)
So crock's raid looks like it has had 2 disks fail from the same mirror, which is quite horrible. /proc/mdstat shows this now
Personalities : [raid1] 
md2 : active raid1 sdc2[1] sdb2[0]
      477901504 blocks [2/2] [UU]
      
md1 : active raid1 sdd2[2](F) sda2[0]
      477901504 blocks [2/1] [U_]
      
md0 : active raid1 sdd1[4](F) sdc1[2] sdb1[1] sda1[5](F)
      10482304 blocks [4/2] [_UU_]
      
unused devices: <none>
and so sda1 has failed from md0 and so sda2 should probably also fail from md1 but that wouldn't leave any disks in md1 at all! I'm going to go to mpt and take a spare disk there, then see about trying to replace sdd :-/ I may end up having to do an offline recovery or something though :( I'm going to power off crock until I get there and try to recover in rescue mode.

Update at 20:27: After booting up in single user mode, I added a new drive to the raid and it was able to fully recover! At least according to the raid rebuild, we still have to see the results of fscks on the users filesystems. Before booting up multiuser again, I'm rebuilding the raid onto a second new drive as well, so when we boot it up multiuser it will be completely set. At the current estimate, that will be done in 42 minutes. -Nick

Update at 21:29: The raid has completely finished rebuilding with 4 good drives, and guests are starting up. Let us know if you have any data loss! We will also be giving a free month to all users on crock. Thanks, Nick

Update at 23:14: crock crashed because I didn't fix the ipv6 multicast problem correctly, but now it is working. I think I'm done here, I'm not being careful enough about this anymore. Somehow ipv6 is working also without having set the ports to be multicast routers.

network problem at rippleweb

| | Comments (0)
There just seems to have been a network outage with Rippleweb who is our provider at the Herakles data center in Sacramento. don't have any information about what the problem was yet, but I was able to reach Rippleweb on the phone and they said they will tell us what happened after it is fixed.

network downtime

| | Comments (0)
We just had a short network downtime, when I restarted quagga after upgrading the package and it didn't start up properly. I'm still not sure why, but I reinstalled the old package and it started properly. We also now have all the customers who we moved from SVTIX to Market Post Tower now routed through the router at Market Post Tower. This way, when we have another provider at Market Post Tower the customers there will have a more direct route. We still need to install the newer version of quagga (it fixes a BGP security flaw). I will try upgrading quagga again tomorrow (at midnight PDT), we also need to work on getting the outgoing prefix list correct which caused a problem before. Hopefully if that causes more downtime also I will be able to figure out whats wrong tomorrow. Thanks.
-Nick

Distros updated

| | Comments (0)
We now have images for:

Fedora 17
CentOS 6.3

The Debian image has been updated from 6.0.0 to 6.0.3

I've also put a distrolist file in /distros in all of the dom0's that I've updated the /distros for, so you can tell what version the images on your dom0 are (they are always named distroXX.tar.gz, where XX is 32 or 64, representing 32 or 64 bit).

We will eventually put the new images in /distros on every dom0, however, if you want them quicker, please email support@prgmr.com
No downtime expected.  (fuller is now on the CentOS5-xen kernel, which should be able to get through rebuilding even a giant 2tb drive.  It's also raid10, which is a whole lot better for rebuilds like this than raid5 or 6) 

fuller and hydra rebuilding;  I've gotta go down to the parking substructure and get a fresh drive from the van for crock.


Fuller and hydra look good, crock is going to need a reboot.   Scheduling a reboot and kernel upgrade for 20:00 PST Friday evening. 

mysterious taney hang.

| | Comments (0)
my pager just went off;  taney was hung hard, and nothing has been in the serial console for months;  I'm trying to fix the serial console and then will bring up the xen guests

update 22:09 local time:   serial works in non-xen mode. 

update 22:13 local time:  I think I fixed the console for xen mode;  rebooting and bringing up xen guests.

update 22:26:  alll guests back up.