June 2011 Archives

crock hung approx. 12 hours ago

| | Comments (0)
I rebooted it just now and it's back.    I will take steps to prevent myself from sleeping through outages like this.

I believe the crash was related to the Bug: Soft lockup problem that has been plaguing us for some time.   I am testing a new kernel that I hope will solve this problem.

 http://xenbits.xen.org/people/mayoung/testing/SRPMS/

edit: please note, this is unrelated to the wiki.xen.prgmr.com outage

crock hung approx. 12 hours ago

| | Comments (0)
I rebooted it just now and it's back.   sorry.   we will be getting people a 25% of a month credit for this outage.

server move tonight

| | Comments (0)
effected customers should have gotten email; this is our status update.  downtime is about to begin for knife, council, and cauldron.  

edit at 00:57  customers are coming back up now.   If you wish to log in to your console server and manually start your server you can do so now, otherwise your box should be started within the next 20 minutes or so.  

edit at 01:09: knife is completely up

edit at 01:14: council and cauldron are both up.  

also note, we won't be moving any more servers for at least 48 hours.   

lozenges crash

| | Comments (0)
Lozenges crashed this morning after we went home from moving some of the other servers to market post tower. It looks like a bug in Xen 4.0.1:
 (XEN) Xen BUG at page_alloc.c:1204
(XEN) ----[ Xen-4.0.1  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82c4801159f2>] free_domheap_pages+0x1f2/0x380
(XEN) RFLAGS: 0000000000010206   CONTEXT: hypervisor
(XEN) rax: 007fffffffffffff   rbx: ffff83007bfd0000   rcx: 0000000000000000
(XEN) rdx: ffff82f600a7f660   rsi: 0000000000000000   rdi: ffff83007bfd0014
(XEN) rbp: ffff830051d13000   rsp: ffff82c48035fa58   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: ffff83007bfd0018   r11: 0000000000000000
[Wed Jun 22 11:49:16 2011](XEN) r12: 0000000000000001   r13: ffff82f600a7f660   r14: 0000000000000000
(XEN) r15: ffff83007bfd0014   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 000000031280c000   cr2: ffff8800369c0358
(XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff82c48035fa58:
(XEN)    000000017bfd0000 ffff830051d12f08 ffff830051d13000 ffff83007bfd0000
(XEN)    ffff83007bfd0000 0000000000000000 ffff830000000000 ffff82c48015ce25
(XEN)    0000000002150c70 000000003e2c7d38 0000000000000156 1400000000000001
(XEN)    ffff82f600a3a240 1400000000000001 ffff82c48035ff28 0000000000000000
(XEN)    ffff830000000000 ffff82c48015d179 0000000100000246 ffff82f600a3a240
[Wed Jun 22 11:49:16 2011](XEN)    000000000007ea58 ffff82f600fd4b00 ffff83007bfd0000 ffff83007ea58000
(XEN)    ffff82c48035ff28 ffff82c48015c951 0000000000000052 000000000007ea58
(XEN)    ffff82f600fd4b00 ffff83007bfd0000 ffff83007ea58000 ffff82c48015cbef
(XEN)    0000000100db89c0 0000000000000000 0000000000000156 2400000000000001
(XEN)    ffff82f600fd4b00 2400000000000001 ffff82c48035ff28 0000000000000001
(XEN)    ffff830000000000 ffff82c48015d179 0000000100000242 ffff82f600fd4b00
(XEN)    ffff83007ea59000 ffff82f600fd4b20 0000000000000000 000000000007ea59
(XEN)    ffff83007ea59000 ffff82c48015d987 0000000000000000 ffff83007ea59000
(XEN)    ffff82f600fd4b20 ffff82c48015cd79 0000000100000000 00000000000534b0
(XEN)    ffff83007bfd0000 3400000000000001 ffff82f600fd4b20 3400000000000001
[Wed Jun 22 11:49:16 2011](XEN)    ffff82c48035ff28 0000000000000001 ffff830000000000 ffff82c48015d179
(XEN)    ffff82f600a7bcc0 ffff82f600fd4b20 0000000000000000 ffff82f600fd4e80
(XEN)    000000000007ea74 ffff83007bfd0000 ffff83007ea74000 ffff82c48015d825
(XEN)    0000000000000001 0000000000000140 0000000000000000 ffff82c48015cb41
(XEN)    0000000100a7bcc0 00000000ffffffe0 0000000000000156 4400000000000001
(XEN) Xen call trace:
(XEN)    [<ffff82c4801159f2>] free_domheap_pages+0x1f2/0x380
(XEN)    [<ffff82c48015ce25>] free_page_type+0x4c5/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015c951>] put_page_from_l2e+0xe1/0xf0
[Wed Jun 22 11:49:16 2011](XEN)    [<ffff82c48015cbef>] free_page_type+0x28f/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015d987>] put_page_from_l3e+0x157/0x170
(XEN)    [<ffff82c48015cd79>] free_page_type+0x419/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48015d825>] put_page_from_l4e+0xd5/0xe0
(XEN)    [<ffff82c48015cb41>] free_page_type+0x1e1/0x670
(XEN)    [<ffff82c48015d179>] __put_page_type+0x1a9/0x290
(XEN)    [<ffff82c48014be85>] relinquish_memory+0x1e5/0x500
(XEN)    [<ffff82c48014c64d>] domain_relinquish_resources+0x1ad/0x280
[Wed Jun 22 11:49:16 2011](XEN)    [<ffff82c480106250>] domain_kill+0x80/0xf0
(XEN)    [<ffff82c48010430e>] do_domctl+0x1be/0xff0
(XEN)    [<ffff82c48011bc70>] get_cpu_idle_time+0x20/0x30
(XEN)    [<ffff82c4801e5169>] syscall_enter+0xa9/0xae
(XEN)   
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at page_alloc.c:1204
(XEN) ****************************************
[Wed Jun 22 11:49:16 2011](XEN)
(XEN) Reboot in five seconds...
So hopefully its fixed in 4.0.2 that just came out. Meanwhile, it looks like everybody is back up, but we will probably not put more new customers on lozenges for a while. Please email support@prgmr.com if you are still having trouble. Thanks.

network outage this morning

| | Comments (2)
right now, my upstream believes the problem is not my equipment;  they gave me a 30 minute ETA, but it sounds like it's too early to tell what the problem is.   


As of this moment, we believe the problem is unrelated to the server move, but like I said, this is very early in the process and it's possible that's wrong.  I will update in 30.

update: 08:06  we appear to be back.   I will update when we know more about what the actual problem was.  

update: 8:25:  bowl and cerberos are still down, working on it...  bowl and cerberus are related to last nights move.  


update: 9:57

my upstream says:
"Appologies,
There have been multiple links affected by this fiber cut,
It is still in progress of being repaired, however the primarly link to SVTIX
+should be working now.

We are still waiting for details on the cause of the disruption.
"


update at 2011-06-25:
" Sorry for the delay, the lates news we have is "I can tell you right now that
+there was a fiber cut due to construction excavating to put in a new water
+main. ", and we already request a RFO for this outage, and it take 7-10
+business day for them to process.
"
We are planning to move the servers apples, cerberus, bowl and branch 
on Tuesday evening this week (June 21, 2011) to the new data center.
Expect the downtime to start sometime after 9PM PDT, and everything
should be back up by midnight. Branch needs a new disk also, so we are
going to take it down earlier and rebuild the raid with the new disk before
the move. It will be down starting at 7PM PDT , and when it finishes
rebuilding we will start shutting down the other 3 servers.

If that goes well, hopefully we can move the second group of servers from the eliteweb cabinet on Wednesday night (knife, cauldron and council). See http://book.xen.prgmr.com/mediawiki/index.php/EGI_Moving
If you have any questions, please email support@prgmr.com. It might be delayed if there is much support email to answer from the earlier move or just regular tickets also though. Thanks!

Update: We're starting later for the first move. Now it will start more like midnight or after.

Update 00:08  -  we are rebooting and rebuilding branch before the move.  expect branch downtime to start in a few minutes,  expect downtime for the other servers to start in maybe three hours.

Update 00:35: Branch is down and the RAID is rebuilding.     nobody else is rebooting yet.

Update 05:54:  branch is done rebuilding, apples, cerberus, bowl down for moving

Update: 07:29:  all servers are coming  back up, but all network connectivity is down.  At this point, my provider seems to think it's something between, and not my network, but it's too early to say.  

Update: 08:26:  Cerberus doesn't want to boot 'cause I removed a bad drive (a spare had replaced it in the RAID)   -  I put the drive back and it boots.    Bowl is also down, I don't know what that problem is yet.  

Update:08:38: bowl and branch are both starting xendomains, the problem with bowl was an upgrade blew out the menu.st file.  the problem with branch was that we incorrectly set power on after power fail to off.

update: 08:51:  all users should now be up and running.  complain otherwise.   




Halter has 2 failed disks. They are on different mirrors, so there is no data loss, but it really needs to be taken care of. Because halter doesn't have the ahci sata chipset, we need to reboot it to detect the new disks. We will check on vps that don't boot up by themself, but if you still have problems email support@prgmr.com. Thanks! -Nick

we are attempting to fix this with the next kernel revision.  If nothing else, we'll be cycling that hardware out fairly soon anyhow.

edit at 1:01 :  guest domains are going down now.

edit at 2:09:   drives are replaced, rebuilding.   We will bring customers online in approx. an hour.  


Personalities : [raid1]                                                         
md1 : active raid1 sdc2[2] sdb2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [==>..................]  recovery = 13.4% (64153152/477901504) finish=65.6
min speed=105075K/sec                                                           
                                                                                
md2 : active raid1 sda2[2] sdd2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [==>..................]  recovery = 11.2% (53639488/477901504) finish=75.6
min speed=93412K/sec                                                            
                             
 
this is one of the old servers that uses 2 raid1s that are striped with lvm rather than 1 raid10. raid10 rebuilds /much/ faster under load than the striped raid1 setup we used on halter and all older servers.  Considering that the thing lost two disks in as many weeks, we will keep it down until it's done rebuilding.     (the rebuild would take 10x-15x as long if we did it while customers were online)


edit at 2:27:  ignore my approx an hour comment.  current mdstat output:

                                                                                
Personalities : [raid1]                                                         
md1 : active raid1 sdc2[2] sdb2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [======>..............]  recovery = 34.2% (163703680/477901504) finish=93.
9min speed=55756K/sec                                                           
                                                                                
md2 : active raid1 sda2[2] sdd2[1]                                              
      477901504 blocks [2/1] [_U]                                               
      [======>..............]  recovery = 31.0% (148422272/477901504) finish=98.
2min speed=55877K/sec  


edit at 4:43:  the disks are finally done rebuilding and customers are coming back up.  

edit at 4:48:  halter is back, all users on halter are back.  

we need to give all you all credit.  

cauldron crash

| | Comments (0)
cauldron crashed, three hours ago.  It's coming back up now.  Investigation to follow.


BUG: soft lockup detected on CPU#0!

[Wed Jun  1 15:55:45 2011]Call Trace:
 <IRQ> [<ffffffff8025758a>] softlockup_tick+0xce/0xe0
 [<ffffffff8020df48>] timer_interrupt+0x3a0/0x3fa
 [<ffffffff80257874>] handle_IRQ_event+0x4e/0x96
 [<ffffffff80257960>] __do_IRQ+0xa4/0x105
 [<ffffffff8020bd5c>] do_IRQ+0x44/0x4d
 [<ffffffff8034c980>] evtchn_do_upcall+0x19e/0x250
 [<ffffffff80209d8e>] do_hypervisor_callback+0x1e/0x2c
 <EOI> [<ffffffff803581ea>] show_rd_sect+0x0/0x68
 [<ffffffff802ebbfc>] __read_lock_failed+0x8/0x14
[Wed Jun  1 15:55:45 2011] [<ffffffff80343f3e>] get_device+0x17/0x20
 [<ffffffff803fc3fd>] .text.lock.spinlock+0x53/0x8a
 [<ffffffff80358211>] show_rd_sect+0x27/0x68
 [<ffffffff802bc351>] sysfs_read_file+0xa5/0x12e
 [<ffffffff8027e3f5>] vfs_read+0xcb/0x171
 [<ffffffff8027e7d4>] sys_read+0x45/0x6e
 [<ffffffff802097b2>] tracesys+0xab/0xb5


edit:  all users on cauldron are up and back online.