replacing more disk
June 2010 Archives
prgmr.com will not release private customer data except in the following cases: 1. in order to comply with ARIN requirements for new IP blocks, we will release the name or business name to ARIN. we will be executing the ARIN non-disclosure agreement, which requires that ARIN keep your names secret except in the case of a court order  2. We will comply with any valid court orders issued by courts that have jurisdiction. 3. we use automated and manual processes to examine network traffic while looking for problems. 4. we will never examine your disk without permission. (we may ask you to let us examine your disk or to leave, but if you don't give us permission, we won't examine the disk without a court order.) 5. we may examine network traffic with both manual and automated processes. the results of this examination won't be shared without a court order. 6. we may log and examine your serial console while looking for system problems. If this document needs to be amended, I will do my best to minimize the impact on customers, and I will email the address on file with a notice. If customers wish to quit a long term contract because of an amendment to this document, any early termination fees will be waived, and the customer will be given a prorated refund based on time used. https://www.arin.net/resources/agreements/nda.pdf [
Data retention is kindof a sticky thing. See, the longer I keep the data, the easier it is for me to spot trends and ongoing problems. but obviously, customers don't want me to keep shit around forever, and without a defined data retention policy, I think it's legally harder for me to tell law enforcement "we don't have that data" when they come knocking.
What if I had a clause that said "I give you access to all data I'm retaining about you at http://blah/customer" - it would be more work for me but it would allow me to have longer data retention (which is good for troubleshooting) without pissing off customers, especially if I add a 'delete this' button... but I don't know where that puts me legally.
of course, that is technically more difficult... but I could release a tool that others could use. (I'd tie the login to the email)
so, for the past few days I've been trying to get my initial IP allocation from ARIN. Here is what they say:
email@example.com writes: > Hello, > > Thank you for your reply. This is close to what we needed but we > still need you to provide the actual customer name for each IP > assignment in the list provided please. >
I called and asked if this was also policy when DSL providers asked for IP addresses that would be statically assigned, and they said this was true of all static IP addresses.
I explained that my existing policy that prevents me from releasing personal information without a court order.
So, for now, I will be buying more IPs from my upstream. Until this is solved, we will not be giving people more than one IP per VPS.
If you are okay with me giving your full name to ARIN (under NDA) please email me. if 1024 of you are okay with that, my problem is solved. Please note, they aren't looking for email, postal address or anything else, just your name.
edit: hamper is back up.
So, hamper had 4 drives (a stripe of mirrors) with a fifth, a spare. many months ago, one of the drives began failing. I removed that drive from the raid, and rebuilt onto the hot spare.
Earlier today, when I was dealing with the DoS, I thought I'd pull the drive and return it for warranty service. Bad idea. the computer siezed up.
It appears that for 8 hours, writes didn't go through to the hard drives. I have reset the drive, and hamper appears functional again. the outage appears to be from 02:36:36 to 20:46:47 PST
Other than the 8 hours of no writes, it appears that there was no data loss. if you are on hamper and are still having problems, please let me know.
This has encouraged me to accelerate my long-talked about backup plan.
We've been working on it for 8 hours. will update when we know more.
Edit: looks like it /may/ be a Ddos. I mean, there /is/ a DDos of one of my customers, and that /may/ be why the router can't stand up. I asked my upstream to blackhole the target IP (I hate finishing the job for the attacker, but at the moment it's my only choice.) If that fixes it, we should be up within 1/2 hour. If not, then I will probably need to replace my router, a process that will probably take closer to 5 hours.
Yes, in fact it was a DDos. null routing the target at my upstream solved the problem.
I do want to make a personal apology for taking more than 8 hours to figure this out (and it was nick who deserves credit for finally figuring it out, not me) - I can explain some of it by the fact that the problem happened about the time I normally go to sleep, and the symptoms /looked/ a lot like the mac address conflict I had quite some time ago. But still, I should have figured this out in a half an hour. There's really no excuse.
As per policy, all effected customers will get 1 month credit. (I probably won't get the credits sent out for a few days, but you /will/ get them) - this will be painful, but not fatal.
Anyhow, please accept my apology, and don't worry, we won't shut off existing customers for at least 30 days.
edit: Please note; nobody was cut off. we just sent out the warning emails, and some people were understandably worried that their domain might get shut down.
So every now and again a customer will complain of a crashing domain. Occasionally, it is an early sign of a hardware problem that I need to deal with, so I don't want to just ignore it.
Now, the problem is that like a physical server, once the domain has rebooted, most of the information about why it crashed is gone. (and what little is left is in /var/log on the guest, and as a general rule we don't like mucking around in the guest. that's your business, not ours.)
Now, on a physical server, we solve this by using a logging serial console. (I reccomend opengear if you have the money, and a used cyclades if you don't have money. the 'buddy system' (making one server the console server for the next, then the next server the console server for the first) usually requires adding usb serial dongles, but is even cheaper still, for installations with only a few servers. I personally like the IOgear brand usb -> serial dongles Fry's has.
I can turn on debug logging in xenconsoled and that will log the console for all domains to a file (one file for each domain) then I can use those logs to troubleshoot the problem. The thing is, apparently some people have privacy concerns with this, so I haven't done it yet.
Now, personally, I don't think serial consoles are that sensitive. I mean, it's common to leave terminals in data centers where passers by can see the output. They will allow me to see what program is crashing, which may be sensitive, and depending on how you have the thing configured, I can see when people log in and log out.
So, I have several options.
- I could leave it as is, continue to go back and fourth and guess if someone asks me why something crashed after a reboot
- I can log all consoles and delete the data once a week or once a month
- I can apply a patch to log some people's consoles and not others, and let the user decide
Obviously, option 2 makes my life a /whole lot/ easier. Option 3 is better than option 1, but it still means maintaining an out of tree xenconsoled (or pushing it upstream)