December 2011 Archives

taney ethernet going up and down

| | Comments (0)
Taney's ethernet link seems to be going up and down:
[13275237.374946] eth0: port 1(peth0) entering disabled state
Dec 27 07:11:36 taney kernel: [13275240.495729] e1000e: peth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Dec 27 07:11:36 taney kernel: [13275240.496284] eth0: port 1(peth0) entering forwarding state
Dec 27 07:12:17 taney kernel: [13275281.374172] eth0: port 1(peth0) entering disabled state
Dec 27 07:12:20 taney kernel: [13275284.454919] e1000e: peth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Dec 27 07:12:20 taney kernel: [13275284.455487] eth0: port 1(peth0) entering forwarding state
The log on the switch just says something similar, with no errors counted on either end.

Luke is going to Market Post Tower now to try replacing the ethernet cable.
Update: Luke plugged the ethernet into eth1 instead of eth0 and it seems to be fixed now. -Nick

stone down hard

| | Comments (0)
I/O errors on both drives.   I'll be heading in to deal with it personally.   One drive failed out of the raid last night, and the other is giving I/O errors (and hanging the box) this morning.  I hope
it's a backplane issue.  Otherwise, I guess it's data recovery time.   

Edit:  weird.  On the upside, it looks like the data is still there.  I'm bringing in a spare hardware in case it's a backplane problem or similar bullshit.

Update: We put the drives in the new system and its starting up now.  We replaced the drive that failed first with a new drive just in case it turns out that it was actually a problem with the hard drive.  Stone should be good now.
we also found that we don't know how to pass break (for magic sysrq) through our terminal system, so we need to fix that.  

Anyhow, it isn't coming up after a remote reboot;  nick is heading to the co-lo now (this one is near his house)  to deal with the problem in person.

Also, we need to adjust the monitoring system;  it didn't register the host as down until we rebooted it, as it was still responding to ping and giving a ssh banner.  

Update: It is now rebuilding the raid mirror in single user mode and should be done in about an hour, then I will boot it up again multi user. Unfortunately the logs don't show any errors before it crashed (and that was at 2:32 last night so its been down for a while and the monitoring should help...). We also don't have console logging here like we do at SVTIX and Market Post Tower, so we don't really know why it seems the filesystem stopped responding. -Nick
 Update: Stone has booted up in multiuser mode now and the guests are starting. Email support@prgmr.com if you have problems after this. Thanks! -Nick

which is to say, we swapped a hot-swap drive, and it hung during the RAID rebuild.    I suspect that we need to limit the raid rebuild speed, but I don't have hard evidence.   We will research this.

The annoying thing is that it hangs rather than printing something useful to the console.   

Anyhow, we took the box down and are rebuilding the raid in single user mode.   


sh-3.2# cat /proc/mdstat 
Personalities : [raid1] [raid10] 
md1 : active raid10 sda2[0] sdb2[1] sdc2[2] sdd2[4]
      1048578048 blocks 256K chunks 2 near-copies [4/3] [UUU_]
      [=>...................]  recovery =  6.9% (36467328/524289024) finish=71.9min speed=112928K/sec
      
md0 : active raid1 sdd1[3] sdc1[2] sdb1[1] sda1[0]
      10482304 blocks [4/4] [UUUU]
      
unused devices: <none>


So figure we'll be back in around an hour and a half.


edit at 20:40 PST:  rebooted.  guests are coming back up.  I expect no further problems (at least until the next disk fails; we will hopefully have figured out a solution by then.)

rutledge raid rebuild

| | Comments (0)
Last night we replaced a hard drive in rutledge and the raid was rebuilding normally until the disk completely froze. I rebooted it and I'm letting the raid rebuild in single user mode now. When its done, I will update the blog here. Email support@prgmr.com if you have any questions. Thanks!

update 20111214 8:36AM PST:  Luke here.   the dang thing rebuilt, then rebuilt again.  I'm suspecting a bad drive.   Smart on the thing hangs, and it reports drive errors (that all have to do with smart)   So I don't have real solid evidence that the drive is bad, but no smart, if you ask me, is enough reason to trash the drive anyhow.  

Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 37 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  b0 d6 01 be 4f c2 00 00      00:00:46.516  SMART WRITE LOG
  b0 d5 20 bf 4f c2 00 00      00:00:46.016  SMART READ LOG
  b0 d6 01 be 4f c2 00 00      00:00:46.006  SMART WRITE LOG
  b0 d5 01 bf 4f c2 00 00      00:00:45.517  SMART READ LOG
  b0 d6 01 be 4f c2 00 00      00:00:45.507  SMART WRITE LOG

edit at 20111214 10:30am PST:
the thing rebuilt successfully and was rebooted about an hour ago.  SMART tests now look good (at least short and conveyance tests.   Long test still has 10% to go; I'll update when that's done) 

I'm no longer at all sure it was a disk problem;  I've seen errors like this when it was rebuilding too fast (there's a /sys/ entry that lets you limit rebuilt speed, and we need to tweak that down next time.  Used to be it limited itself to something reasonable.) 

I'll update again when the smart error clears;  for now, the machine is up, and I don't expect any more reboots.

disk replacement on birds

| | Comments (0)
Birds has a bad disk that needs replacing, and its been making everything slow. People have been complaining about guests not being able to start after shutting down also, so hopefully this will work once the raid mirror is synced again with a new disk and running at full speed. If there are still problems, we will work on moving people off of birds to a newer system. Thanks!
Enhanced by Zemanta