Please note, I'm adding the new updates at the top; jewel is up right now. Read down for the history.
edit at 20:33: we've disabled hyperthreading and we've got it on the new kernel and on the e1000e ethernet adaptor. It should be up and stable; the raid is still rebuilding, but yeah, I think we are in okay shape for now. We'll be moving people off this server as we get more capacity up; email us if you want to move to the front of the line.
The raid is still rebuilding, so expect less than stellar performance.
edit at 18:13: a new crash
[ 163.452490] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
[ 163.452944] physdev match: using --physdev-out in the OUTPUT, FORWARD and POSTROUTING chains for non-bridged traffic is not supported anymore.
(XEN) ----[ Xen-4.0.1 x86_64 debug=n Not tainted ]----
(XEN) CPU: 2
(XEN) RIP: e008:[<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN) RFLAGS: 0000000000010602 CONTEXT: hypervisor
(XEN) rax: 0000000000000001 rbx: 0000000000000027 rcx: 0000000000000000
(XEN) rdx: ffff8304b9340000 rsi: ffff8304b9340000 rdi: ffff830459777c20
(XEN) rbp: ffff8304b9340000 rsp: ffff83043ff27cd8 r8: ffff8300bf42c000
(XEN) r9: ffff8304b9340000 r10: 1480000000000002 r11: 0000000000000246
(XEN) r12: ffff8304b9340000 r13: ffff8304b9340000 r14: 0000000000000000
(XEN) r15: 00000008a5b57027 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 0000000843101000 cr2: 00007f1c17d0aa60
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff83043ff27cd8:
(XEN) 0000000000000000 ffff82c48015bcd8 0000000000000000 ffff82c48022a400
(XEN) 0000000200000000 ffff8300bf42c000 ffff8304b9340000 0000000000000000
(XEN) ffffffffffffffff 0000000000000027 ffff8304b9340000 00000000008430ff
(XEN) ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c48016465b
(XEN) ffffffffffffffff ffff82c48015f903 0000000000000000 ffff8300bf42c000
(XEN) 0000000000000000 000000000084300b ffff8304b9340000 00000008a5b57027
(XEN) 000000000084300b 0000000000000000 0000000000000000 00000001ffffffff
(XEN) ffff8304b9340000 000000000084300b 0000000000000000 00000000008430ff
(XEN) ffff83084300b248 000000000084300b 00000008a5b57027 ffff82c480165d36
(XEN) 0000000000000002 ffff8300bf42c030 0000000000000282 0000000000000020
(XEN) 0000000000000000 0000000000000282 ffff8300bf42c000 ffff8300bf2fa000
(XEN) 00007ff000000001 0000000000000000 000002004f1028e8 ffff82f610860160
(XEN) 000000498011b816 ffff8304b9340000 ffff8304b9340000 ffff8300bf42c000
(XEN) 000000084300b067 ffff82c48025a100 ffff82c480145405 ffff8300bf42c030
(XEN) 0000000000097490 0000004980145405 000000084300b248 00000008a5b57027
(XEN) ffff83043ff27f28 0000000000000002 0000000000000000 ffff83043ff27f28
(XEN) 0000000180372980 0000000000000001 ffff82c48025a080 ffff8300bf42c000
(XEN) 000000000096ff90 00000000008430ff 00000000000000f0 0000000000000aa1
(XEN) 0000000000097000 ffff82c4801e5169 0000000000097000 0000000000000aa1
(XEN) 00000000000000f0 00000000008430ff 000000000096ff90 000000001e1ff000
(XEN) Xen call trace:
(XEN) [<ffff82c480167617>] do_nmi_stats+0x27/0x110
(XEN) [<ffff82c48015bcd8>] get_page+0x28/0xf0
(XEN) [<ffff82c48016465b>] mod_l1_entry+0x37b/0x9c0
(XEN) [<ffff82c48015f903>] get_page_and_type_from_pagenr+0x93/0xf0
(XEN) [<ffff82c480165d36>] do_mmu_update+0x9f6/0x1a70
(XEN) [<ffff82c480145405>] reprogram_timer+0x55/0x90
(XEN) [<ffff82c4801e5169>] syscall_enter+0xa9/0xae
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************
(XEN)
I've seen some similar things having to do with the intel hardware virtualization, so I disabled all hardware virtualization, and I disabled hyperthreading. booting again.
edit at 17:00: the kernel was upgraded some time ago, but the system still crashes when we start guests. We can't get it to crash without starting guests, so we're grasping here; we're going to use the onboard e1000e rather than the usb, now that we have the good kernel in place. Nick is en-route to the data center.
original post:
We're going to update the kernel to latest (the old one we were using was built for our amd mcp55 systems, and it's in a modern intel server right now)
(XEN) ----[ Xen-3.4.1 x86_64 debug=n Not tainted ]----
(XEN) CPU: 2
(XEN) RIP: e008:[<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
[Sat Aug 6 14:14:12 2011](XEN) RFLAGS: 0000000000010082 CONTEXT: hypervisor
(XEN) rax: 00000000000003b8 rbx: ffff8308df024ce0 rcx: 0000000000000004
(XEN) rdx: 00002072659edae4 rsi: ffff830c3fdc7e88 rdi: ffff828c802855a0
(XEN) rbp: ffff8308df024cb0 rsp: ffff830c3fdc7ec8 r8: 0000000000012252
(XEN) r9: 0000000000000004 r10: 0000000000000005 r11: 0000000000000000
(XEN) r12: ffff8308df024d70 r13: 00002072659ec3c7 r14: 0000000000000000
(XEN) r15: ffff828c80221100 cr0: 000000008005003b cr4: 00000000000026f0
(XEN) cr3: 00000009030ae000 cr2: 00000000000000d4
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
(XEN) Xen stack trace from rsp=ffff830c3fdc7ec8:
[Sat Aug 6 14:14:12 2011](XEN) ffff82ec80173b66 ffff828c8025f900 00000000802
5e900 ffff830c3fdc7f28
(XEN) ffff828c8025e900 ffff828c8021f5b0 00002072659ec3c7 0000000000000000
(XEN) ffff828c80138ed7 0000000000002000 ffff8300bf23c000 ffff8300bf0e4000
(XEN) 0000000001301c00 ffffffff8057d160 ffffffff8057c520 ffffffffffffffff
(XEN) 0000000000631918 0000000000000000 0000000000000246 0000000000631918
(XEN) ffff880001939ee0 ffffffff805cbe48 0000000000000000 ffffffff802083aa
(XEN) ffffffff80553f28 0000000000000000 0000000000000001 0000010000000000
(XEN) ffffffff802083aa 000000000000e033 0000000000000246 ffffffff80553f10
(XEN) 000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000002 ffff8300bf23c000
[Sat Aug 6 14:14:12 2011](XEN) Xen call trace:
(XEN) [<ffff828c801207d0>] compat_xen_version+0x3e0/0x420
(XEN) [<ffff828c80138ed7>] idle_loop+0x47/0xa0
(XEN)
(XEN) Pagetable walk from 00000000000000d4:
(XEN) L4[0x000] = 00000009030e8067 000000000001f2cc
(XEN) L3[0x000] = 00000009030e7067 000000000001f2cd
(XEN) L2[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
[Sat Aug 6 14:14:12 2011](XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000000000d4
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...