August 2009 Archives

birds.prgmr.com will be rebooted at 4pm Monday August 31st.  we will configure xen to save and restore all DomUs, so if everything goes as planned, you will detect a 10 minute network outage, but none of the DomUs will appear to have been rebooted.

This is still fallout from the faiure of that cheap samsung;  I am having trouble with hotplug on the 2.6.18.8-xen kernel, and need to reboot the box.

All users of birds.prgmr.com should have recieved a credit for two months for the earlier unplanned outage, so you won't recieve credits for this outage. 
 I suspect disk problems but the investigation is ongoing.  all users of girdle get a month credit.    
no data loss, but everyone was down for a day.  I will credit all customers on birds 1 month for that.  the problem was that we got a bad disk and md did not fail it, and we were unable to fail it manually.   we tried a reboot but the good drive did not have the grub bootloader installed.   ugh.

anyhow, birds is back but it's running on only one drive;   my replacement drive is DOA or there is a cabling problem.   I will try another drive tonight as that is easier than jimming with the cables.   If there will be a reboot I will give you some warning.

also, we either need to learn how to properly tune md so it fails drives rather than hanging when they are bad, or we need to move away from md.   I will write more about that after I fix the raid.  


at least there was no data loss. 


  
looks like there is something wrong with our hack to import  paypal payments to freeside, and we are getting complaints of double billing.  sorry.   We will have this sorted out shortly.  
it still appears to be routing, so if you statically assign your old IP you will get IPv6 again. the default gw is fe80::20c:42ff:fe20:6aed  and the netmask is /64

If you have trouble, email support@prgmr.com. Nick knows a lot more than I do about IPv6.

I've complained, and hope to have my provider fix the router advertisements tomorrow morning.

and given 2 months credit.  (if you would rather have a refund, contact me)  -  We have finished all users who were setup before the attack.   (If we missed you, please contact support@prgmr.com or lsc@prgmr.com) 

stay tuned as we describe how we will be improving our infrastructure so that we are better prepared next time. 
I have no idea who the attackers were (other than the heaviest attackers were sending packets from china) or why the host was attacked. However, I do have packet dumps, and it looks like a simple syn flood. The attacker either has a large botnet, or is spoofing the source IPs, which makes it much less simple.

Now, this was my first DoS attack, so I was completely unprepared. I was figuring 'eh, I've got a 100Mbps commit. Don't worry about it' I didn't even have bandwidth monitoring setup. Midway through the attack I setup a SPAN port on my cisco and started capturing packets on a spare linux box.

I've spent a bunch of time going through tcpdump output with perl, essentially duplicating some of the really basic functionality of something like pmacct, and doing it badly. But I'm learning (and learning how to use pmacct) and even with my silly perl regexes and my tcpdump output, I am seeing some rather interesting patterns to the data. over a 5 hour period, I got 102755167 packets. 26149105 packets were tcp packets with a data payload of zero (which is how tcp connects look, and I believe how syn floods would also look.)

Here is a sampling of the tcpdump output: 15:37:09.233702 IP 60.173.66.174.2887 > 74.113.30.166.80: S 2892936907:28929369\ 07(0) win 65535 15:37:09.233734 IP 60.173.66.174.2888 > 74.113.30.166.80: S 1941102781:19411027\ 81(0) win 65535 15:37:09.764316 IP 60.173.66.174.2890 > 74.113.30.166.80: S 2691934649:26919346\ 49(0) win 65535 15:37:10.295979 IP 60.173.66.174.2892 > 74.113.30.166.80: S 2875418850:28754188\ 50(0) win 65535 15:37:10.825192 IP 60.173.66.174.2894 > 74.113.30.166.80: S 2935806847:29358068\ 47(0) win 65535 15:37:11.354255 IP 60.173.66.174.2896 > 74.113.30.166.80: S 1707919298:17079192\ 98(0) win 65535 15:37:11.885067 IP 60.173.66.174.2898 > 74.113.30.166.80: S 955913454:955913454\ (0) win 65535 15:37:12.175995 IP 60.173.66.174.2887 > 74.113.30.166.80: S 2892936907:28929369\ 07(0) win 65535 15:37:12.176011 IP 60.173.66.174.2888 > 74.113.30.166.80: S 1941102781:19411027\ 81(0) win 65535 15:37:12.416718 IP 60.173.66.174.2900 > 74.113.30.166.80: S 3541068542:35410685\ 42(0) win 65535

I also got a boatload of pings. (If I had my monitoring equipment up, and if I was paged when this started, I believe I could have killed it at my border pretty easily. Of course, if the pps kill upstream routers, that doesn't help, but if I have tools to blackhole IPs upstream, well, then it does help.) 20153524 'ICMP echo' packets, almost all 'length 72'

Also, from what I'm seeing, at least megs/sec, I was doing OK; the problem was that the routers couldn't handle the PPS. Now, obviously, it's completely irrational to think that just 'cause I'm buying 100Mbps of bandwidth that I can take that 100Mbps in the smallest packets I can send. (what's a tcp packet with no data? 64 bytes? yeah. that's a lot of packets) but it's not something I'd have thought to monitor before, as I'm usually charged on megs/sec.

I need to get something so I can announce null-routed /32s to my upstreams I honestly don't know what the standard procedure is, but I'd really prefer to have something I could do programatically, I mean, it's my IP space we are talking about blackholing, so it shouldn't require supervision

I believe that If I am doing my own BGP the best way to do it would be to have my upstreams configure their bgp routers to accept /32 routes from me, then I just announce my /22 or whatever as usual, then announce the /32s I need to kill with a null route. They will travel as far upstream as the routers are configured to accept /32s from me, which could be far enough that it becomes no longer our problem. That would be programatic and under my control.

of course, I'd rather null-route the source of the attack, but that becomes pretty difficult in cases like this where src IPs are either spoofed or coming from a large botnet.

That's the other thing; in the case of a syn flood like this, assuming that my upstreams could handle it without charging massive overages, (or without their routers falling over) I would put up a OpenBSD box with synproxy, and protect my customer. the question there becomes 'how many packets is too many'

I mean, I can 'finish the job' for the attackers, and kick off my customer, but that seems pretty unfair to me. (well, and the guy is paying me, after all. I like getting paid.)

I got lucky, though, the vast majority of my customers are on another link with another provider; only 18 Xen customers were taken down, and one co-lo customer. This was a 'cheap lesson' compared to what could have been. If I was this unprepared and someone hit my main uplink like this, I'd be in much worse shape. I mean, I've got some serious egg on my face, and I've seriously damaged the good reputation I've been building for the last two years (Yeah, I've been doing this for 4 years, but, well, until 2 years ago, i had serious reliability issues, that were fixed by the move to new hardware and internal disk.) Also the target of the DoS is a customer of a fairly large co-location customer. so it wasn't that cheap. But it certainly could have been a lot worse.

(all 18 xen customers are pretty new, and they all get a free month or a refund.)