June 2010 Archives

hamper to be rebooted shortly

| | Comments (2)

replacing more disk

It has come to my attention that prgmr.com does not have a written, publicly accessible privacy policy. Below, I have pasted a first draft. Please give me feedback. Note, I've been editing this draft in place... this is /not/ the final version, I'm just soliciting feedback.

prgmr.com will not release private customer data except in the following cases:

1. in order to comply with ARIN requirements for new IP blocks, we will release
   the name or business name to ARIN. we will be executing the ARIN non-disclosure
   agreement, which requires that ARIN keep your names secret except in the case
   of a court order [1]

2. We will comply with any valid court orders issued by courts that have 
   jurisdiction.

3. we use automated and manual processes to examine network traffic while looking for problems.  

4. we will never examine your disk without permission.   (we may ask you to let us examine your disk or to leave, but if you don't give us permission, we won't examine the disk without a court order.)

5. we may examine network traffic with both manual and automated processes.   the results of this examination won't be shared without a court order.  

6. we may log and examine your serial console while looking for system problems. 




If this document needs to be amended, I will do my best to minimize the impact
on customers, and I will email the address on file with a notice.  If customers
wish to quit a long term contract because of an amendment to this document, any
early termination fees will be waived, and the customer will be given a prorated 
refund based on time used.  
 



[1]https://www.arin.net/resources/agreements/nda.pdf 
[

Data retention is kindof a sticky thing. See, the longer I keep the data, the easier it is for me to spot trends and ongoing problems. but obviously, customers don't want me to keep shit around forever, and without a defined data retention policy, I think it's legally harder for me to tell law enforcement "we don't have that data" when they come knocking.

What if I had a clause that said "I give you access to all data I'm retaining about you at http://blah/customer" - it would be more work for me but it would allow me to have longer data retention (which is good for troubleshooting) without pissing off customers, especially if I add a 'delete this' button... but I don't know where that puts me legally.

of course, that is technically more difficult... but I could release a tool that others could use. (I'd tie the login to the email)

so, for the past few days I've been trying to get my initial IP allocation from ARIN. Here is what they say:

hostmaster@arin.net writes:

> Hello,
>
> Thank you for your reply.  This is close to what we needed but we
> still need you to provide the actual customer name for each IP
> assignment in the list provided please.
>

I called and asked if this was also policy when DSL providers asked for IP addresses that would be statically assigned, and they said this was true of all static IP addresses.

I explained that my existing policy that prevents me from releasing personal information without a court order.

So, for now, I will be buying more IPs from my upstream. Until this is solved, we will not be giving people more than one IP per VPS.

If you are okay with me giving your full name to ARIN (under NDA) please email me. if 1024 of you are okay with that, my problem is solved. Please note, they aren't looking for email, postal address or anything else, just your name.

ugh. all users of hamper will get a month credit. heading down to work on it now.

edit: hamper is back up.

So, hamper had 4 drives (a stripe of mirrors) with a fifth, a spare. many months ago, one of the drives began failing. I removed that drive from the raid, and rebuilt onto the hot spare.

Earlier today, when I was dealing with the DoS, I thought I'd pull the drive and return it for warranty service. Bad idea. the computer siezed up.

It appears that for 8 hours, writes didn't go through to the hard drives. I have reset the drive, and hamper appears functional again. the outage appears to be from 02:36:36 to 20:46:47 PST

Other than the 8 hours of no writes, it appears that there was no data loss. if you are on hamper and are still having problems, please let me know.

This has encouraged me to accelerate my long-talked about backup plan.

we're currently seeing 160 megabits/sec on a 100m pipe, so it's almost certainly a DoS of some sort. we're working on it, and hopefully will get it cleaned up faster than the problem at svtix. (It appears to be unrelated; a different customer.)

robe network driver

| | Comments (0)
Robe was down since yesterday with the too many iterations (6) in nv nic irq rx problem in the forcedeth ethernet driver. The problem was unrelated to the DDoS attack yesterday. Users on robe will get an additional free month.

network problems at SVTIX

| | Comments (6)

We've been working on it for 8 hours. will update when we know more.

Edit: looks like it /may/ be a Ddos. I mean, there /is/ a DDos of one of my customers, and that /may/ be why the router can't stand up. I asked my upstream to blackhole the target IP (I hate finishing the job for the attacker, but at the moment it's my only choice.) If that fixes it, we should be up within 1/2 hour. If not, then I will probably need to replace my router, a process that will probably take closer to 5 hours.

Yes, in fact it was a DDos. null routing the target at my upstream solved the problem.

I do want to make a personal apology for taking more than 8 hours to figure this out (and it was nick who deserves credit for finally figuring it out, not me) - I can explain some of it by the fact that the problem happened about the time I normally go to sleep, and the symptoms /looked/ a lot like the mac address conflict I had quite some time ago. But still, I should have figured this out in a half an hour. There's really no excuse.

As per policy, all effected customers will get 1 month credit. (I probably won't get the credits sent out for a few days, but you /will/ get them) - this will be painful, but not fatal.

There was a miscommunication, and past due notices were sent out to everyone more than 10 days late (most notably, this caught my customers who are paying twice a month because they have two accounts billed on different days.)  when they should have only been sent to new people who were 10 days past due.  (Obviously, I'm a little quicker to shut you off if you've never given me any money.) 

Anyhow, please accept my apology, and don't worry, we won't shut off existing customers for at least 30 days. 

edit:  Please note;  nobody was cut off.  we just sent out the warning emails, and some people were understandably worried that their domain might get shut down.

on logging serial consoles.

| | Comments (7)

So every now and again a customer will complain of a crashing domain. Occasionally, it is an early sign of a hardware problem that I need to deal with, so I don't want to just ignore it.

Now, the problem is that like a physical server, once the domain has rebooted, most of the information about why it crashed is gone. (and what little is left is in /var/log on the guest, and as a general rule we don't like mucking around in the guest. that's your business, not ours.)

Now, on a physical server, we solve this by using a logging serial console. (I reccomend opengear if you have the money, and a used cyclades if you don't have money. the 'buddy system' (making one server the console server for the next, then the next server the console server for the first) usually requires adding usb serial dongles, but is even cheaper still, for installations with only a few servers. I personally like the IOgear brand usb -> serial dongles Fry's has.

I can turn on debug logging in xenconsoled and that will log the console for all domains to a file (one file for each domain) then I can use those logs to troubleshoot the problem. The thing is, apparently some people have privacy concerns with this, so I haven't done it yet.

Now, personally, I don't think serial consoles are that sensitive. I mean, it's common to leave terminals in data centers where passers by can see the output. They will allow me to see what program is crashing, which may be sensitive, and depending on how you have the thing configured, I can see when people log in and log out.

So, I have several options.

  1. I could leave it as is, continue to go back and fourth and guess if someone asks me why something crashed after a reboot
  2. I can log all consoles and delete the data once a week or once a month
  3. I can apply a patch to log some people's consoles and not others, and let the user decide

Obviously, option 2 makes my life a /whole lot/ easier. Option 3 is better than option 1, but it still means maintaining an out of tree xenconsoled (or pushing it upstream)