February 2013 Archives

About the new data center setup

| | Comments (0)
As you may have heard, we're moving data centers.  

Right now?   we have two 3.84kw racks of prgmr.com/xen stuff at svtix, 250 stockton in san jose;  (we have 2x 1.9kw racks of quarter-rack co-lo there, too) 

We have two 3.8kw racks in suite 1460 at 55 s. market, rented through egihosting (they also rent us a 1 gigabit port with a 200Mbps commit, and a gigabit connection from 55 s. market to 250 stockton.) 

We have 1 3.3Kw rack in suite 1435 at 55 s. market, direct with coresite.  

(we also have two servers with rippleweb in sacramento) 

That's it.

So, now we're moving to coresite santa clara;   We're getting 4x 5kw racks there, and moving out of everything but the one 3.3kw rack at 55 s. market; (and the two  this will actually decrease our costs (slightly.  Not by a lot)  and it will significantly increase our capacity. 

This also will put us in Santa Clara, which means we can sell bandwidth to anyone on the SVP fiber ring.     

We have an upcoming project with unixsurplus.com that will likely result in inexpensive (and unlike the prgmr.com 'servers of opportunity' stuff, actually available in a timely manner)   dedicated servers.  

We need to be out of the old stuff by 2013-05-13.  Emails will be going out this week to users that will be moved. 


Note, the new switching infrastructure should be considerably better;  I'm moving to switches with 10Gbe uplinks, so the problems we've had in the past with the network falling over when we went over 500Mbps will go away.   We've got a 10Gbe cogent port (5G commit) at coresite santa clara, 1gbe from coresite santa clara to 55 s. market, and a 1Gbe he.net port at 55 s. market that we are keeping.    (there is actually some trouble moving the cogent port... but I'm sure that will be worked out before it becomes a huge deal.)

About the new datacentere setup

| | Comments (0)
As you may have heard, we're moving datacenteres.  

Right now?   we have two 3.84kw racks of prgmr.com/xen stuff at svtix, 250 stockton in san jose;  (we have 2x 1.9kw racks of quarter-rack co-lo there, too) 

We have two 3.8kw racks in suite 1460 at 55 s. market, rented through egihosting (they also rent us a 1 gigabit port with a 200Mbps commit, and a gigabit connection from 55 s. market to 250 stockton.) 

We have 1 3.3Kw rack in suite 1433 at 55 s. market, direct with coresite.  

(we also have two servers with rippleweb in sacramento) 

That's it.

So, now we're moving to coresite santa clara;   We're getting 4x 5kw racks there, and moving out of everything but the one 3.3kw rack at 55 s. market; (and the two  this will actually decrease our costs (slightly.  Not by a lot)  and it will significantly increase our capacity. 

This also will put us in Santa Clara, which means we can sell bandwidth to anyone on the SVP fiber ring.     

We have an upcoming project with unixsurplus.com that will likely result in inexpensive (and unlike the prgmr.com 'servers of opportunity' stuff, actually available in a timely manner)   dedicated servers.  

We need to be out of the old stuff by 2013-05-13.  Emails will be going out this week to users that will be moved. 

Rehnquist is back up

| | Comments (0)
It was smartctl crashing the box.

04:31 <@prgmrcom> looks like it's the -data  (while using a marvell sas card)
                  that causes the panic
04:33 <@prgmrcom> -d marvell doesn't work either, but no -d at all seems to be
                  okay on the command line.  lets see if it crashes the box.
04:34 <@prgmrcom> while technically incorrect (-d ata, that is)  it shouldn't
                  crash the thing, and it didn't crash the thing before the
                  upgrade.
04:35 <@prgmrcom> okay, it didn't crash.  booting xen, I guess.
04:42 <@prgmrcom> replacing disk too
04:45 < srn_prgmr> nb: it looks like smartd.conf should be edited to remove
                   explicit references to "-d" before performing a yum upgrade
04:47 < Bugged> that seems failure-prone
04:48 < srn_prgmr> Well, that's what was causing the crash
04:48 < srn_prgmr> I haven't been able to find a related ticket

further breakage on rehnquist

| | Comments (0)
Update:
<prgmrcom> [18:25:50] huh... is my system trying to use dmraid rather than mdadm?
<prgmrcom> [19:25:14] I am suspecting hardware.
<prgmrcom> [19:25:53] leaving
<ryk> [19:29:50] prgmrcom: the list got the email the 1st time
<E6Dev> [20:07:17] So is there any update on rehnquist then?
<prgmrcom> [21:14:58] ugh.
<prgmrcom> [21:15:02] I'm at the colo
<prgmrcom> [21:15:07] messing with rehnquist now.
<prgmrcom> [21:15:17] fucking spare hardware won't boot at all.
<prgmrcom> [21:15:35] so I'm reduced to in-field screwing around with hardware, always a bad idea
<prgmrcom> [21:15:44] always.

Starting puppet: [  OK  ]
Starting smartd: Unable to handle kernel paging request at 0000000000002e38 RIP:
 [<ffffffff880ddc75>] :libata:ata_find_dev+0x24/0x73
PGD 0
Oops: 0000 [1] SMP
last sysfs file: /devices/system/cpu/cpu0/topology/thread_siblings
CPU 7
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge lockd sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi dm_multipath scsi_dh video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg k10temp i2c_nforce2 hwmon pcspkr serio_raw i2c_core amd64_edac_mod edac_mc e1000e tpm_tis tpm tpm_bios dm_snapshot dm_zero dm_mirror dm_log dm_mod sata_mv raid10 shpchp mvsas libsas libata scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 4051, comm: smartd Not tainted 2.6.18-348.1.1.el5 #1
RIP: 0010:[<ffffffff880ddc75>]  [<ffffffff880ddc75>] :libata:ata_find_dev+0x24/0x73
RSP: 0018:ffff81032428fcb0  EFLAGS: 00010286
RAX: 00000000000023f0 RBX: 00007fff449760f0 RCX: ffff810626c02418
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff810626c00028
RBP: ffff810627ffe000 R08: 0000000000000000 R09: ffff810332b708c0
R10: 0000000000000000 R11: 0000000000000000 R12: 00007fff449760f0
R13: 000000000000030d R14: ffff8106266b4680 R15: ffff81010b154858
FS:  00002b6cf1489b50(0000) GS:ffff810332acb440(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000002e38 CR3: 0000000325fc7000 CR4: 00000000000006e0
Process smartd (pid: 4051, threadinfo ffff81032428e000, task ffff810326e9e7e0)
Stack:  ffffffff880ddd7a 00007fff449760f0 ffffffff880e0100 0000000000000000
 00000000ffffffed 000000000000030d ffff810627ffe000 00007fff449760f0
 ffffffea0b154740 000000000000030d ffff810627ffe000 00007fff449760f0
Call Trace:
 [<ffffffff880ddd7a>] :libata:ata_scsi_find_dev+0x6/0x21
 [<ffffffff880e0100>] :libata:ata_scsi_ioctl+0x92/0x1b5
 [<ffffffff88085dd3>] :scsi_mod:scsi_ioctl+0x2cc/0x2f5
 [<ffffffff8014df26>] blkdev_driver_ioctl+0x5d/0x72
 [<ffffffff8014e577>] blkdev_ioctl+0x63c/0x697
 [<ffffffff800227f1>] __up_read+0x19/0x7f
 [<ffffffff800671cf>] do_page_fault+0x4cc/0x842
 [<ffffffff800e8e37>] block_ioctl+0x1b/0x1f
 [<ffffffff80042496>] do_ioctl+0x21/0x6b
 [<ffffffff800304e0>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800baf21>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c89a>] sys_ioctl+0x59/0x78
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf


Code: 48 3b 8a 38 2e 00 00 75 0b f6 42 18 01 b8 02 00 00 00 75 05
RIP  [<ffffffff880ddc75>] :libata:ata_find_dev+0x24/0x73
 RSP <ffff81032428fcb0>
CR2: 0000000000002e38
 <0>Kernel panic - not syncing: Fatal exception
 


 

Taft rebooted itself

| | Comments (0)
Update: When it came up, I tried to finish the update, and it rebooted again, and so to get the kernel update that should fix it, I booted into the non-xen kernel, and finished the update.  I then rebooted the host, to boot into the xen kernel, and it hung when it was rebooting.  Luke is on the way to the datacenter to look at it.

Taft had a kernel panic and rebooted itself while I was yum installing updates.  Guests are coming back up now.

PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
mpt2sas0: chain buffers not available
PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
mpt2sas0: chain buffers not available
Feb 14 18:05:34 taft kernel: PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0^M
Feb 14 18:05:34 taft kernel: mpt2sas0: chain buffers not available^M
Feb 14 18:05:34 taft kernel: PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0^M
[Thu Feb 14 18:05:34 2013]Feb 14 18:05:34 taft kernel: mpt2sas0: chain buffers not available^M
PCI-DMA: Out of SW-IOMMU space for 4096 bytes at device 0000:04:00.0
Unable to handle kernel paging request at ffff88002fc89010 RIP:
 [<ffffffff880e66ee>] :mpt2sas:_scsih_qcmd+0x476/0x6e4
PGD 13eb067 PUD 13ec067 PMD 156b067 PTE 0
Oops: 0000 [1] SMP
last sysfs file: /devices/system/cpu/cpu0/topology/physical_package_id
CPU 0
Modules linked in: xt_physdev ipt_MASQUERADE netloop netbk iptable_nat blktap ip_nat blkbk bridge lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables be2iscsi ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic ipv6 xfrm_nalgo crypto_api uio cxgb3i libcxgbi cxgb3 8021q libiscsi_tcp libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi ac parport_pc lp parport joydev sg pcspkr e1000e i2c_i801 i2c_core serio_raw tpm_tis serial_core tpm tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata raid10 shpchp mpt2sas scsi_transport_sas sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 3, comm: ksoftirqd/0 Not tainted 2.6.18-308.4.1.el5xen #1
[Thu Feb 14 18:05:50 2013]RIP: e030:[<ffffffff880e66ee>]  [<ffffffff880e66ee>] :mpt2sas:_scsih_qcmd+0x476/0x6e4
RSP: e02b:ffffffff8079bd38  EFLAGS: 00010002
RAX: ffff88003e9004f8 RBX: 0000000000000009 RCX: ffffffff880d7057
RDX: ffff88003e9004f8 RSI: 0000000014000030 RDI: ffff88003e009da8
RBP: ffff88002fc89000 R08: ffff880006624000 R09: 0000000000000000
R10: 00000008e8ccc000 R11: 0000000000000000 R12: ffff88000c676e00
R13: ffff88003e8f2978 R14: ffff88003e009db0 R15: 00000000fffa74fa
FS:  00002af38b4a0170(0000) GS:ffffffff80634000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process ksoftirqd/0 (pid: 3, threadinfo ffff880006728000, task ffff88000670b7e0)
[Thu Feb 14 18:05:50 2013]Stack:  ffff88003e233c00  000007322f173fc0  ffff88003e9b9940  ffff88003e9004f8
 ffff88003e9b9900  d50000008023f468  fffffff494000000  0000000300000732
 140000000000000f  ffffffff8021d1d6
Call Trace:
 <IRQ>  [<ffffffff8021d1d6>] __mod_timer+0xff/0x10e
 [<ffffffff88084db2>] :scsi_mod:scsi_dispatch_cmd+0x2ac/0x366
 [<ffffffff8808a4de>] :scsi_mod:scsi_request_fn+0x2c7/0x39e
 [<ffffffff8025e5ba>] blk_run_queue+0x41/0x73
 [<ffffffff880893b5>] :scsi_mod:scsi_next_command+0x2d/0x39
 [<ffffffff88089536>] :scsi_mod:scsi_end_request+0xbf/0xcd
[Thu Feb 14 18:05:50 2013] [<ffffffff880896b4>] :scsi_mod:scsi_io_completion+0x170/0x329
 [<ffffffff880b67ce>] :sd_mod:sd_rw_intr+0x21e/0x258
 [<ffffffff8808992e>] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [<ffffffff80238f47>] blk_done_softirq+0x67/0x75
 [<ffffffff80212eb8>] __do_softirq+0x8d/0x13b
 [<ffffffff8025fda4>] call_softirq+0x1c/0x278
 <EOI>  [<ffffffff8029132c>] ksoftirqd+0x0/0xbf
 [<ffffffff8026db89>] do_softirq+0x31/0x90
 [<ffffffff8029138b>] ksoftirqd+0x5f/0xbf
 [<ffffffff802338c6>] kthread+0xfe/0x132
[Thu Feb 14 18:05:50 2013] [<ffffffff8025fb2c>] child_rip+0xa/0x12
 [<ffffffff802337c8>] kthread+0x0/0x132
 [<ffffffff8025fb22>] child_rip+0x0/0x12


Code: 48 8b 55 10 48 8b 88 c0 03 00 00 75 06 8b 74 24 30 eb 04 8b
RIP  [<ffffffff880e66ee>] :mpt2sas:_scsih_qcmd+0x476/0x6e4
 RSP <ffffffff8079bd38>
CR2: ffff88002fc89010
 <0>Kernel panic - not syncing: Fatal exception
[Thu Feb 14 18:05:51 2013] (XEN) Domain 0 crashed: rebooting machine in 5 seconds.


Mailing Lists

| | Comments (0)
prgmr.com now has mailing lists to help us better communicate with our users.

We have now have announce, maintenance, and discuss lists.  

We highly recommend that everyone sign up for the announce and maintenance mailing
lists so that they can receive important announcements from us and can receive
notification of upcoming maintenance.  We also offer the discuss list so that people
can discuss whatever topics they would like.

To subscribe to these lists, either go to http://lists.prgmr.com or email
listname-subscribe@lists.prgmr.com (where list name is "announce", "maintenance", or
"discuss".  To post to the discuss list, send your mail to discuss@lists.prgmr.com. 
The announce and maintenance lists are moderated and may only be posted to by
prgmr.com staff.
That 5 minute outage for some of you just now?   that was me trying to add another vlan to a trunk.  

 switchport trunk allowed vlan <newvlan>  


when you already have a bunch of allowed vlans?  bad idea.  

Fortunately, as I hadn't saved the switch config, a reboot of the switch later, we were back.   I'm sorry.