September 2011 Archives

network issues at he.net fremont 1

| | Comments (0)
my upstream at he.net is having some DoS related problems.  see http://status.liscolo.net/

debian mirror

| | Comments (0)
Our current debian images are configured to use mirrors.kernel.org in /etc/apt/sources.list for package updates. Its normally a reliable server, but because its down right now we have setup mirrors.prgmr.com on a vps for now running apt-cacher-ng for debian packages. Eventually we plan to setup a dedicated server for a full debian, centos and ubuntu mirror, but this will help while kernel.org is down and until we get the hardware for the full mirror. To use mirrors.prgmr.com, set this in /etc/apt/sources.list:
deb http://mirrors.prgmr.com/debian/ squeeze main
deb-src http://mirrors.prgmr.com/debian/ squeeze main

deb http://security.debian.org/ squeeze/updates main
deb-src http://security.debian.org/ squeeze/updates main
or just search and replace mirrors.kernel.org with mirrors.prgmr.com. If you are running debian lenny, make sure that is still set in the sources.list file also until you are ready to upgrade. Email support@prgmr.com if you have any questions. Thanks!

soft lockup on crock

| | Comments (0)
we will do something about this, I promise.

for now, though, rebooting it. 

BUG: soft lockup detected on CPU#0!

[Thu Sep 22 19:41:27 2011]Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbf9>] __read_lock_failed+0x5/0x14
[Thu Sep 22 19:41:27 2011] [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5

[-- lsc@localhost attached -- Thu Sep 22 20:29:24 2011]

In the process of moving the xen dom0s from SVTIX to Market Post Tower, we have been using sphinx.prgmr.com for routing at both data centers. We have a  router setup now at Market Post Tower also (shinjuku.prgmr.com) and on Friday night we plan to move the gateway addresses for the xen subnets to the new router.

The downtime for this should be separate for each subnet, and should only last a few minutes. Once that is done, we will move the address that our upstream routes to over to shinjuku and route the remaining subnets for colocation back to sphinx at SVTIX. This will mean downtime for everyone at SVTIX and Market Post Tower, and should also only take a few minutes. We will then reboot shinjuku to make sure all the settings are saved correctly.

After this is all working, we plan to start using BGP at Market Post Tower, and eventually also SVTIX. We're planning on having additional transit providers at both data centers also, so it should help with the problems caused by the XO leased line earlier this year. Thanks, and please email support@prgmr.com if you have any questions.

we plan on starting this at 21:00, and only the network will be disturbed;  all guests will remain unmolested (save for a short network outage)


This went badly.   Shinjuku didn't start again after it had the new configuration, so we rolled back to the old router.  sorry.

crock soft lockup.

| | Comments (0)
rebooting now

BUG: soft lockup detected on CPU#0!

[Thu Sep 15 09:24:24 2011]Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbfc>] __read_lock_failed+0x8/0x14
[Thu Sep 15 09:24:24 2011] [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5


Crock is back up now, as are all users on crock

bull is down again after the move

| | Comments (0)
It looks like bull went down around 11:30 Pacific Time here on Tuesday, with no errors on the console. I suspect a power supply or other hardware failure, because when we first turned it on there was a loud pop. Trying again a few minutes later worked, but resetting the power port now doesn't help. Luke is going to try to put its disks in a newer system.

We will update here when its working again, and users on bull should get a free month also. Please email support@prgmr.com if you have any questions. Thanks.

edit at 01:35 by Luke

I've got a spare PSU, and if that doesn't work I have a whole spare server that is largely identical to bull; I'll just swap the disks and be done with it. 


edit at 02:42 by luke

I swapped the PSU and bull is coming back online.

edit at 02:51 by Luke:

all customers on bull are back up.


Robe, Mares and Bull moving tonight

| | Comments (5)
Bull is going down first and coming up last, 'cause it needs to rebuild disk before moving

Bull is going down now;  the others will start going down once we prep MPT to receive the servers. 

edit at 23:23:  robe and mares going down now.

edit at 05:25:  both bull and mares are old servers that had weird disk configurations, and lots of bad disks.   I think it's all sorted out and you should be coming up now.

edit at 06:09  everyone should be back.  ugh.  sorry. 
all users were notified the other night.   we're shutting down now, 21:51, 51 minutes behind schedule, but everything looks good so far.

edit at 23:45:  all dom0 servers are coming back up.  not all guests are up but all dom0s are up, and guests are starting up, except halter.  halter is still down.  


edit at 23:57:  sword, chariot, horn and seashell are all up, all guests on them are up.  Halter is still down, networking mistakes.  

edit at 00:24  halter is back up, not all guest are back up.  

edit at 00:30  all guests should be back up again.  

clarification to privacy policy

| | Comments (0)
diff privacy.txt.2011090200 privacy.txt
18,20d17
< I won't be releasing names to ARIN of my existing customers until 
< after I email everyone a notice.  


This email went out some time ago.  All 'existing customers' before that version of privacy.txt was up have long since been notified.  

If anyone still wants to opt-out, set your company name to "WITHHOLD FROM ARIN"  or just email us and we'll set it up.   We don't update ARIN very often;   we have done so once so far, after sending that message and waiting for opt-outs.


I am still irritated with ARIN about this, but what are you going to do?