September 2010 Archives

22 guests rebooted on Mantle

| | Comments (0)

Note, not all guests on mantle were rebooted. Downtime was approximately 15 minutes. The cause was (my own) human error; I plugged the keyboard into the wrong server when I started the reboot; we were able to cancel the reboot before all guests went down, then we brought up the guests again.

a discussion on the SLA

| | Comments (17)

So, according to some metrics over the last two days we had 3 hours of downtime. But it was spread over two days, so it really should count for more.

So, here's what svtix said about the matter:

In consideration of the downtime experienced in our SVTIX data center on Septem\ ber 13 and 14, I am crediting your account for three days of service. This wil\ l be applied to your current invoice.

Now, this seems to be how most of my competitors do it, too. At best, they give you a symbolic apology.

the thing is that if I had taken the sla payout from my last network outage, and instead of giving those credits, I had spent the money on a new router and a secondary, redundant upstream, this problem would not have been a big deal at all. Customers would not have experienced downtime.

So yeah, while an SLA is a good way of estimating the cost of a problem and aligning the interests of the owner with the interests of the customers wrt. downtime, I think that when the company is in 'full growth' mode like prgmr.com is, it might hurt more than it helps, by removing some of the working capital that would have otherwise paid for infrastructure upgrades.

svtix upstream outage

| | Comments (1)
The fiber optic line card for our provider EGI Hosting's connection at SVTIX is apparently still having problems. EGI is working it out with XO, who owns the line. Here is what they said:
We apologize for any inconvenience this might have caused.
There seems to be a problem with one of the XO devices that one of our transport lines into SVTIX is connected to.
XO moved our port to another line card yesterday as it was rebooting itself.
The new line card we were allocated today had frozen and needed a restart and thus is the reason for today's outage.
We have been working very closely with XO in the past two days and they think that today's issue is not related to yesterday's issue. They think that a card reload was need in order to push all the recent configurations to the router and they say believe we should not be seeing this anymore. We have however asked XO to escalate today's event to their tier 3 support for further investigation.

This is the most up-to-date information we have on the situation. We certainly hope that XO have truly fixed the issue.
They should be getting back to us with a final confirmation later today or early tomorrow.

Please be advised that these issues are affecting every carrier in SVTIX that is connected through this XO device, it is not isolated to EGI's network only.
We will update this post again when we have more information.

NetBSD with Xen 4.0 (and Gentoo!)

| | Comments (2)
So our 2 newest servers, mantle and chessboard, are running Xen 4.0.1 and a customer reported that the network driver in NetBSD as a domU doesn't work anymore. Apparently, it is because of the driver using "flipping mode" instead of "copying mode", according to these posts on the netbsd port-xen and xen-users mailing lists. The same customer also made this really nice wiki page about Gentoo as a DomU so they receive a free month. If anyone does want to run NetBSD we can find  room for you on a dom0 with Xen 3 also where NetBSD works fine.

knife reboot

| | Comments (0)
Knife rebooted this morning, and there was little or no downtime, but we also don't know what caused it. Our newer servers have their consoles logged with conserver but knife is on our other console server that doesn't have logging setup, so we probably missed an error on the serial console. We're planning on moving the older console server to the conserver setup later this month to improve the logging.