August 2011 Archives

Update:  all customers should be back.   The RAID was broken by the unclean restart, so expect degraded I/O while it finishes rebuilding.  

Update:  I fixed the grub boot records.   The box is back up,   the virtuals will be coming on line one at a time.   

Update: knife isn't rebooting remotely.  (I mean, the pdus cycle power, but knife hangs up somewhere betwen post and boot)  I'm heading down to the co-lo to jerk with it in person.


not a good morning.  I'm rebooting knife right now.

BUG: soft lockup detected on CPU#0!

Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
[Wed Aug 31 00:44:24 2011] [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff88204e6f>] :bridge:br_forward_finish+0x0/0x51
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 [<ffffffff8020528a>] hypercall_page+0x28a/0x1000
 [<ffffffff8020528a>] hypercall_page+0x28a/0x1000
 [<ffffffff8036109a>] net_tx_action+0xa82/0x103f
 [<ffffffff803b68f2>] nf_hook_slow+0x58/0xbc
[Wed Aug 31 00:44:24 2011] [<ffffffff8820593e>] :bridge:br_handle_frame_finish+0x0/0xf8
 [<ffffffff88205ba4>] :bridge:br_handle_frame+0x16e/0x1a2
 [<ffffffff8039edb6>] netif_receive_skb+0x1ca/0x2ea
 [<ffffffff80233b33>] tasklet_action+0xa7/0x133
 [<ffffffff802339f8>] __do_softirq+0x83/0x117
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff803fc644>] call_softirq+0x1c/0x26
 [<ffffffff8020c055>] do_softirq+0x6a/0xed
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
[Wed Aug 31 00:44:24 2011] [<ffffffff802ebbfc>] __read_lock_failed+0x8/0x14
 [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

Minor clarifications to the AUP

| | Comments (0)
I forgot to take a copy before I started (yes, it should be in git with the important stuff, but it's not; it's just on the server like the rest of the webpages)  so if you want to look at the old copy, you can see the wayback copy:

http://web.archive.org/web/20100101015856/http://prgmr.com/aup.html

as usual, the live copy is at http://prgmr.com/aup.html

My intent here is to make clear the fact that we have several clauses prohibiting unsolicited bulk mail and spam in addition to the main "no bulk mail of any kind without permission" prohibition.  At Chris' suggestion, I also made more of the clauses stand on their own by appending a "is prohibited" 

Being as I didn't make any substantive changes, I'm not sending out an email, but I wanted to note it here anyhow.  

'power event' at he.net

| | Comments (0)
still working on getting ingot back up.  no further info at this time.  

stables had a problem (xend didn't start??  will figure it out after the carnage is cleaned up)  but it's coming up now.

Ingot is back.  If you are still down, complain loudly.  

here is the official report from he.net:

Jewel is down again.

| | Comments (0)
Please note, I'm adding the new updates at the top;  jewel is up right now.  Read down for the history.  

edit at 20:33:  we've disabled hyperthreading and we've got it on the new kernel and on the e1000e ethernet adaptor.  It should be up and stable;  the raid is still rebuilding, but yeah, I think we are in okay shape for now.    We'll be moving people off this server as we get more capacity up;  email us if you want to move to the front of the line.  

The raid is still rebuilding, so expect less than stellar performance.   

edit at 18:13:   a new crash

[  163.452490] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
[  163.452944] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
(XEN) ----[ Xen-4.0.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN) RFLAGS: 0000000000010602   CONTEXT: hypervisor
(XEN) rax: 0000000000000001   rbx: 0000000000000027   rcx: 0000000000000000
(XEN) rdx: ffff8304b9340000   rsi: ffff8304b9340000   rdi: ffff830459777c20
(XEN) rbp: ffff8304b9340000   rsp: ffff83043ff27cd8   r8:  ffff8300bf42c000
(XEN) r9:  ffff8304b9340000   r10: 1480000000000002   r11: 0000000000000246
(XEN) r12: ffff8304b9340000   r13: ffff8304b9340000   r14: 0000000000000000
(XEN) r15: 00000008a5b57027   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000843101000   cr2: 00007f1c17d0aa60
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83043ff27cd8:
(XEN)    0000000000000000 ffff82c48015bcd8 0000000000000000 ffff82c48022a400
(XEN)    0000000200000000 ffff8300bf42c000 ffff8304b9340000 0000000000000000
(XEN)    ffffffffffffffff 0000000000000027 ffff8304b9340000 00000000008430ff
(XEN)    ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c48016465b
(XEN)    ffffffffffffffff ffff82c48015f903 0000000000000000 ffff8300bf42c000
(XEN)    0000000000000000 000000000084300b ffff8304b9340000 00000008a5b57027
(XEN)    000000000084300b 0000000000000000 0000000000000000 00000001ffffffff
(XEN)    ffff8304b9340000 000000000084300b 0000000000000000 00000000008430ff
(XEN)    ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c480165d36
(XEN)    0000000000000002 ffff8300bf42c030 0000000000000282 0000000000000020
(XEN)    0000000000000000 0000000000000282 ffff8300bf42c000 ffff8300bf2fa000
(XEN)    00007ff000000001 0000000000000000 000002004f1028e8 ffff82f610860160
(XEN)    000000498011b816 ffff8304b9340000 ffff8304b9340000 ffff8300bf42c000
(XEN)    000000084300b067 ffff82c48025a100 ffff82c480145405 ffff8300bf42c030
(XEN)    0000000000097490 0000004980145405 000000084300b248 00000008a5b57027
(XEN)    ffff83043ff27f28 0000000000000002 0000000000000000 ffff83043ff27f28
(XEN)    0000000180372980 0000000000000001 ffff82c48025a080 ffff8300bf42c000
(XEN)    000000000096ff90 00000000008430ff 00000000000000f0 0000000000000aa1
(XEN)    0000000000097000 ffff82c4801e5169 0000000000097000 0000000000000aa1
(XEN)    00000000000000f0 00000000008430ff 000000000096ff90 000000001e1ff000
(XEN) Xen call trace:
(XEN)    [<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN)    [<ffff82c48015bcd8>] get_page+0x28/0xf0
(XEN)    [<ffff82c48016465b>] mod_l1_entry+0x37b/0x9c0
(XEN)    [<ffff82c48015f903>] get_page_and_type_from_pagenr+0x93/0xf0
(XEN)    [<ffff82c480165d36>] do_mmu_update+0x9f6/0x1a70
(XEN)    [<ffff82c480145405>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c4801e5169>] syscall_enter+0xa9/0xae
(XEN)    
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN) 


I've seen some similar things having to do with the intel hardware virtualization, so I disabled all hardware virtualization, and I disabled hyperthreading.  booting again.  


edit at 17:00:   the kernel was upgraded some time ago, but the system still crashes when we start guests.  We can't get it to crash without starting guests, so we're grasping here;  we're going to use the onboard e1000e rather than the usb, now that we have the good kernel in place.  Nick is en-route to the data center.    

original post:

We're going to update the kernel to latest (the old one we were using was built for our amd mcp55 systems, and it's in a modern intel server right now)  


(XEN) ----[ Xen-3.4.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
[Sat Aug  6 14:14:12 2011](XEN) RFLAGS: 0000000000010082   CONTEXT: hypervisor
(XEN) rax: 00000000000003b8   rbx: ffff8308df024ce0   rcx: 0000000000000004
(XEN) rdx: 00002072659edae4   rsi: ffff830c3fdc7e88   rdi: ffff828c802855a0
(XEN) rbp: ffff8308df024cb0   rsp: ffff830c3fdc7ec8   r8:  0000000000012252
(XEN) r9:  0000000000000004   r10: 0000000000000005   r11: 0000000000000000
(XEN) r12: ffff8308df024d70   r13: 00002072659ec3c7   r14: 0000000000000000
(XEN) r15: ffff828c80221100   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 00000009030ae000   cr2: 00000000000000d4
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff830c3fdc7ec8:
[Sat Aug  6 14:14:12 2011](XEN)    ffff82ec80173b66 ffff828c8025f900 00000000802
5e900 ffff830c3fdc7f28
(XEN)    ffff828c8025e900 ffff828c8021f5b0 00002072659ec3c7 0000000000000000
(XEN)    ffff828c80138ed7 0000000000002000 ffff8300bf23c000 ffff8300bf0e4000
(XEN)    0000000001301c00 ffffffff8057d160 ffffffff8057c520 ffffffffffffffff
(XEN)    0000000000631918 0000000000000000 0000000000000246 0000000000631918
(XEN)    ffff880001939ee0 ffffffff805cbe48 0000000000000000 ffffffff802083aa
(XEN)    ffffffff80553f28 0000000000000000 0000000000000001 0000010000000000
(XEN)    ffffffff802083aa 000000000000e033 0000000000000246 ffffffff80553f10
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000002 ffff8300bf23c000
[Sat Aug  6 14:14:12 2011](XEN) Xen call trace:
(XEN)    [<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
(XEN)    [<ffff828c80138ed7>] idle_loop+0x47/0xa0
(XEN)    
(XEN) Pagetable walk from 00000000000000d4:
(XEN)  L4[0x000] = 00000009030e8067 000000000001f2cc
(XEN)  L3[0x000] = 00000009030e7067 000000000001f2cd
(XEN)  L2[0x000] = 0000000000000000 ffffffffffffffff 
(XEN) 
(XEN) ****************************************
[Sat Aug  6 14:14:12 2011](XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000000000d4
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

Hardware issues with jewel

| | Comments (0)
We believe the problem is a bad PSU;  Nick is working on it right now.  

Update:  We decided that it'd be faster to swap the drives into ellsworth, one of our dual quad-core intel servers, which is actually quite a nice server; much newer than Jewel was.   We're having some kernel issues so for now, it's running on Nick's USB ethernet adapter, which I'm not particularly happy about, but it's better than nick and I screwing around with the kernel on this old, touchy setup while we are half-asleep.   

The plan is to move everyone off this box on to newer systems with more and newer drives starting on Sunday, after we're fresh.

Meanwhile, all users on Jewel will get a free month.  Megan is working on that now; credits will be applied, uh, probably within a week.