December 2012 Archives

hang/unclean reboot of coat

| | Comments (0)
Not sure what the problem is;  will troubleshoot more after sleep.

Note, the reboot didn't help.    I'm booting into non-xen and poking around.  will upgrade the kernel and see if that helps. 


14:10 <@prgmrcom> see, it's not packet loss.  It's... something slowing down
                  coat so much that logins on the serial console time out.
14:11 <@prgmrcom> fortunately I ssh'd in before that... but it's not doing me a
                  whole hell of a lot of good.
14:11 <@prgmrcom> typing on that link is also really slow.  Top isn't coming up.
14:12 <@prgmrcom> I mean, I type 'top\n'  and it types t....o....p....   and
                  then it sits there. 
14:12 <@prgmrcom> Cpu(s):  2.8%us,  3.5%sy,  0.0%ni, 32.3%id,  5.5%wa,  0.0%hi,
                  55.6%si,  0.3%st
14:13 <@prgmrcom> top - 06:12:23 up 36 min,  2 users,  load average: 42.76,
                  38.79, 29.70
14:13 <@prgmrcom> huh.
14:13 <@prgmrcom> the load is terrible, but 32% id isn't terrible. 
14:13 <@prgmrcom> Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,
                  0.0%si,  0.0%st
14:13 <@prgmrcom> oh
14:13 <@prgmrcom> that's more what I expect
14:13 <@prgmrcom>  3300 root      17   0  230m  25m 1664 S 99.9  2.5   0:21.62
                  xend              
14:13 <@prgmrcom> hm?
14:14 <@prgmrcom> weird


So, uh, I found one domain using a /lot/ of soft interupts.   I have disabled that domain and we are back (except for that user, who I have emailed.  )

15:29 < LowRadio> taking a long time to boot, i guess everyone else vps is booting also
15:29 <@prgmrcom> fuck.   I feel really uncomfortable, 'cause I don't really understand what was
                  going on.  I mean, /proc/interrupts was incremetning for that guy's vm... but
                  that much?  I don't know.  
15:29 < LowRadio> luke I would'nt know
15:30 <@prgmrcom> but yeah, disable that vm and the problem goes away and the system is as it
                  should be (which is kinda slow;  it's an old box, it's rebuilding a RAID, and
                  everyone is booting at once, so yeah.)
15:31 < LowRadio> that is odd that vm would effect the whole system
15:31 <@prgmrcom> not odd... very bad.  
15:31 <@prgmrcom> but I don't really understand how interupts work.

I killed my 71.19.144.0/20 announcements and just announced the /24s for each location at the location they are at  (and in doing so there'd be no real harm to also having the /20, as the more specific route wins, but I took out the /20)   and I forgot 71.19.154/24, somehow, and went and had a long dinner.  I came back to a bunch of complaints.

Now, 71.19.154/24 is being announced out the he.net port at 55 s. market, and 71.19.144/20 is being announced out both upstreams, and I'm going to be up for a while handling tickets before I sleep.

so yeah, uh, the point to point link from 55 s. market to 250 stockton died this evening.     No problem, right?  we have a transit provider at 250 stockton (cogent) and one at 55 s market (he.net)

Problem was, instead of announcing the /24 blocks we used at 250 stockton to cogent and the /24 blocks at 55 s. market to he.net and then the /20 in both, we just announced the /20 in both.  

I made the more detailed announcements, which brought us up (though traffic from one to the other was still broken) 

At this point, the point to point link is back up, so everything should be better now.
hey, sorry it took me so long to do this, but things have been busy.   I gave all users effected by the cerberus outage:

http://blog.prgmr.com/xenophilia/2012/09/an-inconvienent-page.html


I'm applying a full month of credit because this was particuarly long outage;  we actually ended up moving everyone to new servers and off cerberus completely. 


Total credit 432.2.

the IPv6 crash. Again. boutros

| | Comments (0)
so, there is a problem in CentOS6/xen, for some reasons, sometimes IPv6 won't work.  It doesn't forward the multicast packets as it must for the neighbour discovery policy to work. 


in the places where IPv4 would use broadcast (FF:FF:FF:FF:FF) ethernet frames to map IPs to mac addresses, IPv6 uses multicast (x3:xx:xx:xx:xx:xx, but specifically 33:xx:xx:xx:xx:xx for IPv6 neigh discovery)  which should be treated the same by a bridge but apparently isn't always.

so the idea is that when IPv6 doesn't work, you do a brctl setportmcrouter xenbr0 [interface not seeing multicast] 2

this usually fixes it.   But sometimes?  it crashes the box, with the following backtrace:



related:

http://blog.prgmr.com/xenophilia/2012/06/ipv6-problem-on-black-solved-g.html


http://blog.prgmr.com/xenophilia/2012/08/another-ipv6multicast-related.html
  
      




BUG: soft lockup - CPU#0 stuck for 60s! [swapper:0]
CPU 0:
Modules linked in: xt_physdev ipt_MASQUERADE iptable_nat ip_nat netloop netbk blktap blkbk bridge lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport joydev sg pcspkr i2c_i801 igb serio_raw serial_core i2c_core 8021q dca tpm_tis tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata raid10 shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-308.20.1.el5xen #1
RIP: e030:[<ffffffff886df191>]  [<ffffffff886df191>] :bridge:br_dev_queue_push_xmit+0x1fc/0x200
RSP: e02b:ffffffff8079fcc0  EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffff8800219e1280 RCX: 000000000000020b
RDX: 0000000000000000 RSI: ffff8800219e1280 RDI: 0000000000000000
RBP: ffffffff886df1e6 R08: 0000000000000000 R09: ffffffff886def95
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88002931b000
R13: ffff88002931b220 R14: ffff880018803a80 R15: ffffffff886df1e6
FS:  00002aaf6e0c26e0(0000) GS:ffffffff80637000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000

Call Trace:
 <IRQ>  [<ffffffff886df1e4>] :bridge:br_forward_finish+0x4f/0x51
 [<ffffffff886df24f>] :bridge:__br_forward+0x69/0x9c
 [<ffffffff886ded7e>] :bridge:deliver_clone+0x36/0x3d
 [<ffffffff886deda9>] :bridge:maybe_deliver+0x24/0x35
 [<ffffffff886dee20>] :bridge:br_multicast_flood+0x66/0x106
 [<ffffffff886dfd41>] :bridge:br_handle_frame_finish+0x0/0x1d3
 [<ffffffff886dfe21>] :bridge:br_handle_frame_finish+0xe0/0x1d3
 [<ffffffff886e0099>] :bridge:br_handle_frame+0x185/0x1a4
 [<ffffffff8022143d>] netif_receive_skb+0x3a8/0x4c4
 [<ffffffff88286e60>] :igb:igb_poll+0x73e/0xb55
 [<ffffffff8020d00b>] net_rx_action+0xb4/0x1c6
 [<ffffffff80212f44>] __do_softirq+0x8d/0x13b
 [<ffffffff80260da4>] call_softirq+0x1c/0x278
 [<ffffffff8026eb89>] do_softirq+0x31/0x90
 [<ffffffff802608d6>] do_hypervisor_callback+0x1e/0x2c
 <EOI>  [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000
 [<ffffffff8026ffc8>] raw_safe_halt+0x87/0xab
 [<ffffffff8026d573>] xen_idle+0x38/0x4a
 [<ffffffff8024ad6d>] cpu_idle+0x97/0xba
 [<ffffffff80762b11>] start_kernel+0x21f/0x224
 [<ffffffff807621e5>] _sinittext+0x1e5/0x1eb