November 2012 Archives

Final(?) status update for boutros

| | Comments (0)
By using iostat, unmounted drives, and dd we were able to determine that the partition tables all should have been on the mirror slot 0 and slot 1 (/dev/sda and /dev/sdb) from the original machine.  Neither of these were the original disk failure.

Because slot 0 fell out of the raid on the original machine and was re-added later, it would have been rebuilt off of slot 1.  It is possible though highly unlikely that there could have been some silent corruption on slot 1. 

We will be testing 99-raid-check, a script which checks for data mismatches within the raid to see if it unacceptably hurts performance.  The unmodified script will only run when the array is idle but for prgmr.com it will have to run when the array is not idle.  If it does not significantly hurt performance, typically it will run every week.
It looks like the data for 75% of the domains on boutros is gone.   srn
is still working on it, and seems to think it's worth her time;  Meanwhile I'm
going to set everyone up with new domains.   We'll get you the data later,
but at least for now you will have something.

The remaining 25% seem to have some minor corruption but seem mostly okay,
to the extent that we have poked.

Of course, you can ask for a refund;  this server mostly houses people who
signed up in the last month or so.  Considering the unacceptable level of
service I have given you, I think it's pretty reasonable for you to demand
a refund and leave.  Email support@prgmr.com and we'll give you a full
refund.

If you are willing to stick around and give me another chance, we will
double your ram and quadruple your disk (you get to keep the upgrade for
as long as you continue paying for your current plan.)   We will also
give you a 3 month credit.   (we're doing the credits by hand;  it's
pretty haphazard.  If you don't get it within the
next few days, complain to support@prgmr.com.)

I'll post another blog entry as I know more.   If you want to look at the
post-mortem notes srn has made, they are linked below, but it's still
raw stuff.  We should have more definitive 'what happened' answers in
the next few days.

http://wiki.prgmr.com/mediawiki/index.php/20121124boutros_post-mortem


hamper rebooted

| | Comments (0)

I just rebooted hamper.  I was trying to fix the multicast_router thing and it crashed.

I chose... poorly

| | Comments (9)
(well, it's also possible, maybe even likely, that the parts are fine and I somehow screwed up the backplane when I removed it to check it out. )

Anyhow, the story on boutros is that I had a bad drive, i went in to replace the drive, and 3 other drives failed immediately after I plugged in the drive.    I'm force-rebuilding the raid now, I don't anticipate significant data loss.

Uh, yeah. so I won't ever use these backplanes again (this is the first one in production... well, I have one that is a dedicated server, but that's less important.  Only two drives there, and mirrors are more resistant to this sort of thing than raid10 setups.) 

For now, boutros is in it's old chassis (and still using that backplane.)  I may swap it into a new chassis, or I may not.  For now, priority is getting the RAID rebuilt.

Update: 03:29 local time   the rebuld finished, then it says: 

raid10: Disk failure on sde2, disabling device.

upon reboot, it again only sees 5/8 devices.

I'm wiped out.  At this point I'm more likely to cause dataloss than anything else.  Nick has left, so this will wait until I sleep and then head in and deal with it. 

In the morning,  my plan is to remove the backplane and connect the disks directly (with a bunch of spares, too)  and see if that helps, and I think it will. 


update 09:31 local time:

I'm up.  I'm going to go remove that backplane now

update: 13:20 

The disks are all in a brand new motherboard/chassis, bringing it all back to 250 stockton and racking it up now.
so yeah, uh, hamper's network hung with the above errror.  The kernel is terrifyingly out of date, so I upgraded to the centos5.8-latest kernel-xen and rebooted;  well, then we had

ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen

...  


and the thing hung for a few minutes before it decided to kick the bad disk out of the raid.     I mean, it did so, now guests are coming back up.  the box should be fine now (of course, I need to replace the bad drive)


Edit: the thing froze up again, as I did some more smart tests on the bad drive before failing it out of md0.   Sorry.   Now it's failed out of md0, so it should be good.   I'll go replace the drive later.

that was a little frightening

| | Comments (0)
billing is back up (along with signups)  report any anomalies. 

some lessons need to be learned twice

| | Comments (2)
I used to ziptie in all the power cords to the servers, so that you can't pull the power cords out by accident.  Because I had accidentally unplugged a server.

Well, several years later, well, I don't put zipties on my power cords anymore.  so, tonight?  I slid birds/stables in... and disconnected Gladwynn. 


I need to have a new power cord retention policy.  Ouch. 
We are planning to move birds, stables, and lion out of Fremont on Sunday night starting at 8pm. We will shutdown birds and stables first, then have he.net move the routing for the subnets except 216.218.210.64/27 and 216.218.223.64/26 (needed for lion and the openvpn tunnel) to our router gryphon.prgmr.com at MPT. Then we will shutdown the tunnel, shutdown lion, and have he.net move the routing for those other 2 subnets. We will take the 3 servers to mpt and start them up. If it all goes well, we will be done by midnight.
-Nick