October 2012 Archives

if we feel alert enough for another go after that, we might move the remaining servers (birds stables and lion... birds and stables users were warned, but we probably aren't moving them tonight, just out of fear of doing too much at once.)  but we will see.

we are at he.net, we are eating dinner, then downtime will begin.


update 21:31
tooth downtime begins

update 21:33

pillar downtime begins

update 21:35 kvm node downtime begins.

update 00:57 -  pillar and tooth are back


update 1:37  the KVM hosts are online, the drbds look good, we're bringing the kvm guests up now.

packet loss at our fremont location

| | Comments (0)
http://connexinternet.com/mrtg/r1.0/r1.0_prgmr.com.html

(I believe the problem was at that router.  No confirmation)

I think the problem lasted for 10 minutes or so and things are okay now.  I need to go back to sleep to prepare for moving more servers tonight.

Note, due to the moves, a lot of that traffic goes down the link above, then through the vpn, and then to 55 s. market (most likely through this link:)

http://traffic.he.net/port.php?key=NckClsQDVOBA6eIv5yB2Te14ILiLy9G+42RSVtlPLQda1k7oYS8PVdz1xjzBCjhEfPDtohjwQkNqIlUguWKRSQ==


We are going to move the servers pillar, tooth, and the kvm servers out of Fremont at 8pm on Wednesday night (October 31). We were going to move stables and birds as well, but it seems too much so they will be delayed. They are in the other cabinet which we don't need to move out of immediately anyway. We will also send out another email to the users on stables and birds notifying about the delay.

Unplanned downtime on horn

| | Comments (0)
so I get paged about horn during rush hour.  Nothing.  no serial.  It's on a 2 in 1u, so I can't flip the PDU port. 

I head in and that side of the 2 in 1u is /hot/   the other side is just fine.  (It feels like there is plenty of air flowing through that side, but  I don't have a meter or anything, and I can't de-rack that server without taking down it's twin)  

Anyhow, I take another server out of my van and swap disks.   New server has a LSI sas controller and intel nic, so I'm doing the necessary with the kernel to make that work now.    It shouldn't be too much longer.


Update: 21:11

Still down.  working on the kernel upgrade





update 21:37

Kernel upgrade complete, system booted, test domains up, customer domains booting.  I think we are okay. 



update 21:57

and I discovered that I screwed up IPv6.   fixed.
so I think we're as ready as we are going to be.  We've got all the destination ports set up;  all that remains is to shut down the servers, unscrew rails, strap the servers down in the back of my van


22:45: downtime begins

01:55:  confirmed Ingot is back

01:55: confirmed scepter is back

01:57: confirmed stone is back

Table and coral are both still down.
[lsc@bowl ~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdc2[0] sdb2[1]
      972566976 blocks [2/2] [UU]
      
md2 : active raid1 sdg2[2] sdf2[3](F) sde2[1] sdd2[4](F) sda2[5](F)
      972566976 blocks [2/1] [_U]
      [>....................]  recovery =  0.8% (8320448/972566976) finish=1388.9min speed=11568K/sec
     
md0 : active raid1 sdg1[0] sdf1[4](F) sde1[1] sdd1[5](F) sdc1[2] sdb1[3] sda1[6](F)
      2096384 blocks [4/4] [UUUU]
     
unused devices: <none>
[lsc@bowl ~]$
 
(bowl is from back when we used LVM to do striping.  I will be happy when we are off all of these servers.)

 

[lsc@rehnquist ~]$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid10 sdf2[4] sde2[2] sdd2[0] sdc2[5](F) sdb2[3] sda2[6](F)
      955802624 blocks 256K chunks 2 near-copies [4/3] [U_UU]
      [====>................]  recovery = 20.6% (98878656/477901312) finish=141.7min speed=44576K/sec
     
md0 : active raid1 sdf1[1] sde1[2] sdd1[0] sdc1[4](F) sdb1[3] sda1[5](F)
      10482304 blocks [4/4] [UUUU]
     
unused devices: <none>


Rehnquist is more modern. 

both are rebuilding, no downtime expected. 
We are planning to move the servers ingot, coral, scepter, stone, and table to Market Post Tower on Monday, October 29. There will be no need to change ip addresses for this. The downtime should start at 9:30pm Pacific time and last 2-4 hours.Update: this was going to be on Saturday, but we didn't post this soon enough so we are postponing it until Monday evening.

going down now (20:20 local time)


ugh.  that took way too long.  22:33 local time, and it's coming back up. 


Note, this was my fault.    I thought I was having serial console problems, but I wasn't.  prgmr.com policy is to set 'power on after power fail on' (so that we can always turn things on by bouncing the PDU port. prgmr policy is also to have all PDUs be remotely rebootable)    But this server was set incorrectly.  I plugged it in and went to work on the serial console, where I expected problems.   It took me a very long time to realize that the box was actually off.  

I blame the fact that I didn't schedule anyone else to stand by with me.  I think it's usually best to have one person in the data center and one person in a comfortable chair with good internet access.   This was just me, though.



Also note, I warned the customers with the following message:

We'll be moving hamper to 55 s. market tomorrow night as part of our
plan to move out of he.net Fremont.  This will give us more reliable power
and additionally will actually save us money.   He.net has the cheapest
racks around, but limits power so much that it's cheaper to go with more
expensive, higher density racks at 55 s. market.  



but I failed to notify the other prgmr.com employees properly. guh. anyhow, we have a bunch more of these to do this week. The next ones, I hope, will go more smoothly.



replaced bad disk in ingot.prgmr.com

| | Comments (0)
[lsc@ingot ~]$ cat /proc/mdstat
Personalities : [raid1] [raid10]
md1 : active raid10 sde2[4] sdb2[1] sda2[0] sdc2[5](F) sdd2[3]
      955802624 blocks 256K chunks 2 near-copies [4/3] [UU_U]
      [====>................]  recovery = 23.0% (110047232/477901312) finish=342.3min speed=17908K/sec
      
md0 : active raid1 sde1[2] sdb1[1] sda1[0] sdc1[4](F) sdd1[3]
      10482304 blocks [4/4] [UUUU]
     
unused devices: <none>
 

everything looks good.

recent packet loss.

| | Comments (0)
so yeah, uh, this is what I think about the recent packet loss issues.

I mean, the root of the problem (I think) is that we are overwhelming the 1G link between 1435 and 1460 at MPT.      Most of this is not prgmr.com traffic, most of this is traffic being trucked from 250 stockton to 55 s. market and pushed out he.net..

Here are the two sides of the link in question


http://panoptes.prgmr.com/cgi-bin/14all.cgi?log=bairin_24

http://panoptes.prgmr.com/cgi-bin/14all.cgi?log=biruwa_2


compared to our he.net link, here:

 http://traffic.he.net/port.php?key=NckClsQDVOBA6eIv5yB2Te14ILiLy9G+42RSVtlPLQda1k7oYS8PVdz1xjzBCjhEfPDtohjwQkNqIlUguWKRSQ==

so more traffic is going over our cross connect than over the he.net link.  This is because of my poor network design.  



As you can see, we aren't pegging it hard or anything;  70% is our 5 minute peak.  My belief (now that the cable tested good) is that we are seeing 'microbursts'  wherein traffic is peaking above 1G/sec

With my switches, HP procurve 2824s,  I don't know how to verify my theory without a MIRROR port.    

I should bring in a server and setup a mirror port on that thing, and measure the throughput every 5 seconds rather than every 5 minutes or something;  that would allow me to prove or disprove the microbursts theory. 

This customer at 250 stockton  that is 500 or so Mbps of this traffic  is in the process of moving off our network, which should be complete within a week.  That will resolve the issue for us, at least until our own traffic increases about 3x. 

but the problem of this cross connect getting saturated before the he.net uplink will come back when our traffic grows (and as you have noticed, my bandwidth allocations are currently, well, 2007.  I really, really need to start giving customers more bandwidth, which /will/ cause me to use more bandwidth overall.) 

The problem here is that cross connect is part of a  'router on a stick' configuration, without a big enough stick.  I mean, packets from cogent to 1460 (where most of the dom0s are)  go from cogent to 250 stockton, over the transport link from egi to suite 1460 at 55 s. market, up the cross connect from 1460  to 1435, down the same cross connect, and then to the server in question.

So, uh, yeah.  three ways to solve this

1. put a router at 55 s. market in suite 1460;   this is probably the cleanest way to do it, but now we've got another router that can break.    Don't use the router on a stick configuration.

2. get a bigger stick.   For like $300-$350 a month (vs the $100-$175 I'm paying for a copper cross connect)  I can get an optical link from 1460 to 1435.  I then need 10G capable switches in both 1460 and 1435.  Figure 2-6 grand, depending on brand, features, and if I go used or not.     (I can also just get a second copper link,  this will cost about the same monthly, but could be done with our current switches.)

3. traffic engineering.  make sure that all traffic at 250 stockton goes in/out Cogent and all traffic at 55 s. market goes in/out he.net (maybe just prepend so hard that unless a link is completely down, traffic goes in/out the local link?)



I do have an extra box that could be used as a router right now.   Several, in fact. 

The UnixSurplus guy has some arista 48 port 1g switches with 4 10GbE sfp+ ports he might give me a sweetheart deal on  (I have optics)  he will also rent 'em to me if I want.

so I guess I'm kinda leaning towards the router.  Cheaper.  Of course, I also don't like those HP switches;  I wouldn't mind getting rid of 'em in favour of the aristas.  I wouldn't mind one bit.  But, I generally am not a fan of router on a stick designs, and yeah, it'd save me a good chunk of change.  

 

Anyhow, I need a day off, so I'm taking today off, but I thought customers have a right to know what the hell is going on.    Nick is around today and he might setup a router or at least the SPAN port.  If not, I'll at least get a server with a MIRROR port setup on that procurve on Monday.  . 
10/12/12 07:33:34 FFI: port 2-High collision or drop rate.
                  See help.


This is either a bad cable, or "microbursts"  Let's hope it's a bad cable.  

Network going down for around 5 minutes at 9:00PM for test. 

(Yes, I have multiple upstreams now, but due to bad design on my part, all traffic to C07 and C08, where most of you are, goes over this one cross connect.  My fault, sorry.) 

and we are back.

It wasn't the cable.

an experiment in support.

| | Comments (0)
I have cleared the support queue.  Any problems you are having are therefore figments of your imagination.

To be fair, it wasn't entirely me.  I fed Nick tickets at a slow rate, with specific and actionable questions, and acted on his advice.  I did the same with Luke, who was also busy moving people off Cerberus.

I had a several reasons for doing this.  First, the queue needs to get done.  This is a customer-facing business, and support is important, and month-long delays without replies aren't acceptable.  So that's a simple reason.  But beyond that, I wanted to demonstrate that clearing the queue really isn't all that difficult.  I estimate that there were about 30 tickets, plus a few that came in during the day.  I spent about 6 hours working on them, plus about 6 hours from Nick, Luke, and Megan, although our time accounting is necessarily fuzzy.  That works out to about 20 staff-minutes per ticket.  My feeling is that about 10 tickets usually come in per day.  That means that a single support person, particularly with assistance from the rest of the staff, can easily handle the load.  Even a half-time support person could do it quite easily (again, assuming that the front-line support requires about half the time budget, with the rest done by others.)

I therefore conclude that clearing the queue every day is a reasonable goal, and that getting someone to specialize in doing it would be a good investment.
[root@taney ~]# date
Mon Oct  1 13:34:55 PDT 2012
[root@taney ~]# hwclock
Thu 03 Jan 2002 10:32:22 AM PST  -0.517048 seconds

It's set now, sorry.


so, marshall went down.  "Invalid memory configuration for cpu 1"  - only seeing half the ram, too.  so I swapped it out with an 8 core E5 with just as much ram.  Much more downtime than there should have been, but marshall users are back now, with newer and stronger hardware. 

oh, also note, the blog is on marshall, so most of the reports went on twitter.

http://twitter.com/prgmrcom

note, I screwed up the time on the new server.  don't just check 'date' also check 'hwclock' and run hwclock --systohc if date is right and hwclock is wrong.  

looks like this caused a bunch of guests to hang on fsck.  I will go through and manually restart them.