April 2012 Archives

New IPs for some dom0's and ipv6 for dom0's

| | Comments (0)
You'll be seeing new IPs for the dom0's at MPT that don't have one of our new ips soon.  Some are already done, and the rest will be done soon.  The old ips will also work for now.  They will be removed at a later date.

I'm also adding IPv6 addresses for all of our dom0's at MPT, SVTIX, and he.net.

Please let me know if you experience any issues.

nb

Disk slowness on whetstone

| | Comments (1)
whetstone.prgmr.com is currently experiencing slow disk errors.

Luke has determined the cause is that whetstone has a bad drive in it's
RAID array. He has pulled the drive and will install a replacement
soon. Performance will be degraded for a while after the replacement is
installed while the array is rebuilding.

We will let you know when the rebuild is complete. Please check back on this blog entry for more updates.

update on rehnquist

| | Comments (2)
well, it's down again, so I don't know what the heck is going on.  I'm going to swap to new hardware this evening (will involve a graceful shutdown) 

Note, until then, all new provisioning is on hold.

taking it down for reboot now.

Ugh.  that took way longer than it should have, but it's done now.  it's back.  sorry.   I need to test my netboot rescue images sometime when it's not an emergency (and I probably should have a backup rescue usb key on me, and all of that would not have been required if I had remembered to put the new driver in the initrd before swapping cards.) 

rhenquist rebooted again

| | Comments (0)
sorry, I should not have waited to replace that sata card.  I'm bringing the new one down right now.

rehnquist crash.

| | Comments (2)



sata_mv 0000:03:06.0: PCI ERROR; PCI IRQ cause=0x30000040


I rebooted it, it will be returning shortly.  Unless that error means something rather different than I think, I will be shutting down to replace the sata_mv card soon (pci-x card... used, I should not have used it.)

The sata_mv card in question is the older marvell supermicro 8 port sata card:

http://www.supermicro.com/products/accessories/addon/AOC-SAT2-MV8.cfm

which I only used because it was all I could find;   the store was out of what I use on
some of the other rebuilt mcp55 servers like burger:
http://www.supermicro.com/products/accessories/addon/AOC-SASLP-MV8.cfm

Update: rehnquist crashed again this morning, and I rebooted it. -Nick 11:36
We're going to be replacing sphinx with manticore.   It should be a matter of moving the cables; if we don't screw it up, it should be a matter of seconds.   worst case, 5 minutes downtime and we roll back.    It's the same quagga config, it's just better hardware.

And we're back.  It was about 5 minutes, but spread, which makes it worse.    around 30 seconds around 17:17 then around a minute at 17:23, then around two minutes around 17:53

We screwed up the vlan config;  we use a quagga software router, and the vlans are written in /etc/network/interfaces, while everything else is in quagga.   Being as we haven't rebooted the router in... a long time[1] this means that  we had a error in our interfaces file.  We rolled back, figured out the problem, fixed it, and rolled forward.

Anyhow, we're back online with a quagga box with a rather more powerful CPU (an E3-1220;  the full power 3.1ghz quad core version, not the dual core low power version I've been talking about using as a utility server) and we're keeping the old quagga server around just in case something horrible happens.  

[1]root@sphinx:~# uptime
 17:39:21 up 190 days, 17:47,  5 users,  load average: 0.00, 0.00, 0.00
 
Also note, the mac address of the router changed, so people that had statically routed to the link local address fe80::230:48ff:febc:a19a were broken until just now, when nick bound it to the new router.  Don't use that as the default gateway, please.