October 2013 Archives

Chime reboot

| | Comments (0)
I have no idea what happened; for some reason the serial console cable was disconnected, so I have no serial console log.   It's back up now (and the console is back and logging)  



Outage on White

| | Comments (0)
edit:  Note, this was down a long time.   like half a day.    It didn't set off my pager, as it was still pingable and answering on ssh (though, it was not authenticating if your keys weren't cached in memory)   This has happened several times now, where outages were... long because my pager didn't go off because the machine was still half up.  

We had a hung disk.  I've brought the system back up but am still troubleshooting.  I'd say there is a 90% chance that I'll figure it out without requiring another reboot.    (one of the disks is suspect,  but I'm poking further 'cause there is some conflicting evidence)  


So, evidence that it is sdd:h

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Fatal or unknown error        80%     26894         -


I have /never/ seen this ....  but also:


196 Reallocated_Event_Count 0x0032   189   189   000    Old_age   Always       -       11


Bad sectors are bad.  so, I'd like to shoot the disk.  but,

[lsc@white ~]$ sudo -i iostat -x /dev/sdb /dev/sdc /dev/sdd /dev/sde
Linux 2.6.18-371.1.2.el5xen (white.prgmr.com)     10/27/2013

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.76    0.02   14.87   18.60    0.07   58.68

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb              94.92    16.91 171.32  8.24 20463.91   200.08   115.09     3.29   18.32   2.69  48.36
sdc             159.85    17.49 191.02  6.55 21081.82   191.24   107.67     3.71   18.78   2.60  51.37
sdd              86.65    17.50 170.97  6.55 20368.76   191.24   115.82     2.47   13.90   2.08  36.97
sde             170.26    16.91 193.60  8.24 21210.90   200.09   106.08     6.45   31.93   3.46  69.92



If I'm reading that right, sdd is the /fastest/  (well, lowest latency, which is what I care about) disk on that server.    (sda, in this case, is a spare and ignored.)


Anyhow, it's only a 500gb disk,  (and thus will be quicker to rebuild)  so I think I might just fail sdd and rebuild onto the spare.


also note, I put more ram in that box while it was down in preparation for customer upgrades (which I haven't done)

wilson outage

| | Comments (0)
so looks to me like we've got a bad disk. 


sd 0:0:0:0:
sd 0:0:0:0: timing out command, waited 20s
Oct 20 11:40:12 wilson kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88
003a83a6c0)^M^M
Oct 20 11:40:12 wilson kernel: sd 0:0:0:0: ^M^M
[Sun Oct 20 19:25:40 2013]Oct 20 11:40:12 wilson kernel:         command: ATA co
mmand pass through(16): 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00^M^M
Oct 20 11:40:12 wilson kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc
=ffff88003a83a6c0)^M^M

....

Oct 20 18:10:23 wilson kernel: sd 0:0:0:0: timing out command, waited 20s^M^M
Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
raid10: sda2: rescheduling sector 2741196983
[Mon Oct 21 03:45:59 2013]raid10: sda2: rescheduling sector 2741196991
raid10: sda2: rescheduling sector 2741196999
raid10: sda2: rescheduling sector 2741197007
raid10: sda2: rescheduling sector 2741197015
raid10: sda2: rescheduling sector 2741197023
raid10: sda2: rescheduling sector 2741197031
raid10: sda2: rescheduling sector 2741197039
sd 0:0:0:0: rejecting I/O to device being removed
sd 0:0:0:0: rejecting I/O to device being removed
sd 0:0:0:0: rejecting I/O to device being removed
[Mon Oct 21 03:45:59 2013]raid1: Disk failure on sda1, disabling device.
        Operation continuing on 7 devices


...




  Vendor: ATA       Model: ST31000340NS      Rev: SN16
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdk: 1953525168 512-byte hdwr sectors (1000205 MB)
sdk: Write Protect is off
SCSI device sdk: drive cache: write back
SCSI device sdk: 1953525168 512-byte hdwr sectors (1000205 MB)
sdk: Write Protect is off
SCSI device sdk: drive cache: write back
[Mon Oct 21 03:46:11 2013]sd 0:0:5:0: Attached scsi disk sdk
sd 0:0:5:0: Attached scsi generic sg0 type 0
INFO: task md1_raid10:752 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md1_raid10    D ffff88003eb57628     0   752     35           755   713 (L-TLB)
 ffff88003e3f1dc0  0000000000000246  0000000000000000  0000000000000400
 000000000000000a  ffff88003f2eb0c0  ffff88003e2e87e0  0000000000001577
 ffff88003f2eb2a8  0000000000000000
Call Trace:
 [<ffffffff8022f2b1>] __wake_up+0x38/0x4f
[Mon Oct 21 03:48:43 2013] [<ffffffff8817e875>] :raid10:raid10d+0x558/0x9bb
 [<ffffffff8026082b>] error_exit+0x0/0x6e
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8028abb2>] default_wake_function+0x0/0xe
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8026365a>] schedule_timeout+0x1e/0xad
 [<ffffffff80264929>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff8029ed86>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80415359>] md_thread+0xf8/0x10e
 [<ffffffff8029ef9e>] autoremove_wake_function+0x0/0x2e
[Mon Oct 21 03:48:43 2013] [<ffffffff80415261>] md_thread+0x0/0x10e



...


after reboot, I see

md: bind<sdg1>
md: bind<sdh1>
md: running: <sdh1><sdg1><sdf1><sde1><sdd1><sdc1><sdb1><sda1>
md: kicking non-fresh sda1 from array!
md: unbind<sda1>
md: export_rdev(sda1)


which means it's booting without sda. 

uh, but not in md1.  gah.  I will have to fail sda out of md1.  man...

the machine is taking forever and a half to fsck /var, which is on md1.   I will report back.



....


I rebooted the server and replaced the drive.  users should be up again.  
users were moved to hughes, and given the long promised 2x ram 4x disk upgrade. 
Note, I'm not 100% sure this will fix it,  (srn will look into it when she gets up)    but I'm upgrading centos while we're at it.   kernel-xen was 2.6.18-348.2.1.el5   -  upgrading to 2.6.18-348.18.1.el5 and xen going from xen-3.0.3-142.el5_9.2   to  3.0.3-142.el5_9.3

happened right after I set
[root@hughes ~]# brctl setportmcrouter xenbr0 vif0.0 2



Enter `^Ec?' for help]
BUG: soft lockup - CPU#0 stuck for 60s! [swapper:0]
CPU 0:
Modules linked in: ebtable_filter ebtables nls_utf8 hfsplus netloop netbk blktap blkbk lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state xt_physdev bridge iptable_filter ipt_MASQUERADE iptable_nat ip_nat ip_conntrack nfnetlink ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport joydev sg i2c_i801 pcspkr i2c_core igb 8021q dca tpm_tis tpm tpm_bios serio_raw serial_core dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata raid10 shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-308.13.1.el5xen #1
RIP: e030:[<ffffffff88790ff0>]  [<ffffffff88790ff0>] :ebtables:ebt_do_table+0xb0/0x2dd
RSP: e02b:ffffffff8079fba0  EFLAGS: 00000246
RAX: 0000000000000002 RBX: ffffffff8879e720 RCX: ffffc200004e4000
RDX: 0000000000000000 RSI: ffffffff8079fce0 RDI: ffffffff8879e75c
RBP: ffff880032812800 R08: ffffffff8879e720 R09: ffffffff8879e6b0
R10: ffff8800253f7280 R11: ffff880032812800 R12: ffff88003f434000
R13: ffffc200004e6030 R14: 0000000000000000 R15: ffffc200004e4080
FS:  00002b3e56b6f170(0000) GS:ffffffff80637000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000

Call Trace:
 <IRQ>  [<ffffffff80235317>] nf_iterate+0x41/0x7d
 [<ffffffff88675195>] :bridge:br_forward_finish+0x0/0x51
 [<ffffffff80258672>] nf_hook_slow+0x58/0xbc
 [<ffffffff88675195>] :bridge:br_forward_finish+0x0/0x51
 [<ffffffff80230c32>] dev_queue_xmit+0x3ba/0x3ef
 [<ffffffff886751e6>] :bridge:__br_forward+0x0/0x9c
 [<ffffffff886751e6>] :bridge:__br_forward+0x0/0x9c
 [<ffffffff8867523e>] :bridge:__br_forward+0x58/0x9c
 [<ffffffff88674d7e>] :bridge:deliver_clone+0x36/0x3d
 [<ffffffff88674da9>] :bridge:maybe_deliver+0x24/0x35
 [<ffffffff88674e20>] :bridge:br_multicast_flood+0x66/0x106
 [<ffffffff88675d41>] :bridge:br_handle_frame_finish+0x0/0x1d3
 [<ffffffff88675e21>] :bridge:br_handle_frame_finish+0xe0/0x1d3
 [<ffffffff88676099>] :bridge:br_handle_frame+0x185/0x1a4
 [<ffffffff802213f1>] netif_receive_skb+0x3a8/0x4c4
 [<ffffffff88276e67>] :igb:igb_poll+0x73e/0xb55
 [<ffffffff8020d017>] net_rx_action+0xb4/0x1c6
 [<ffffffff80212f29>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb89>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026ffc8>] raw_safe_halt+0x87/0xab
 [<ffffffff8026d573>] xen_idle+0x38/0x4a
 [<ffffffff8024ad21>] cpu_idle+0x97/0xba
 [<ffffffff80762b11>] start_kernel+0x21f/0x224
 [<ffffffff807621e5>] _sinittext+0x1e5/0x1eb


Council reboot

| | Comments (0)
At approximately 9:50 PST we rebooted council due to a loss of communication with a disk that caused the root file system to be remounted read-only.  Approximately 24 guests were affected.  The problem was possibly due to a transient backplane failure.  Everyone is back up but we will identify alternate host(s) to move the affected guests to.
(this is only girdle and cattle at this point) 


it sounds like the rippleweb folks are on it.  

I do have plans to migrate out of rippleweb, but those plans aren't super clear.     (I mean, the service has been good, though, they have had network outages.  It's just nice to have all my stuff where I can watch and control it.) 

About this Archive

This page is an archive of entries from October 2013 listed from newest to oldest.

July 2013 is the previous archive.

November 2013 is the next archive.

Find recent content on the main index or look in the archives to find all content.