Saturday, July 04, 2009

Colotastrophe: The Day After

Our servers are a bunch of primadonnas. They demand to be pampered in the greatest colocation facility in the world (if you agree with the video of Fisher Plaza touting that fact), resting on pillows of AC and fed power in Waterford crystal goblets. We literally pay more for the 5 cabinets that house the servers* than we do our entire Groundspeak office - and then some.

Around 5am Pacific today, all of our grumpy but lucid Groundspeak servers woke from their slumber to greet geocachers** who were, as one user wrote, scratching their arms in search for their next geocaching fix. Most were just happy to have the servers back online but others were asking questions about disaster recovery and communication in a crisis. Instead of finger pointing, although cathartic, I'd like to focus on what worked, what didn't, and how we can try to avert some issues if (and when) this happens again.

To set the stage, we have been hosted at Internap in the Fisher Plaza since 2002 and in that time have only had 2 significant events that related directly to facility issues. The last issue lasted around 8 hours while this one is, by far, the most signficant downtime in the history of the web site. In total we had 29 hours of downtime. Unfortunately the 29 hours were during the geocaching peak season on the busiest weekend of the year and, to compound things, a day off from work for many. The Fates were definitely conspiring to pick the worst day to bring the Geocaching.com site down.

What Worked

The usefulness of Twitter and Facebook became obvious for this crisis. All our web servers and email servers were all located at Fisher Plaza. We had very few options for posting updates, so we had to rely on outside systems to communicate with our community and our partners. I switched from Groundspeak emails to my Gmail account, and my iPhone running Tweetie helped me to get information out as I was "on the scene." By the end of the day I added an additional 800+ followers on Twitter which, in the past, was used as a toy for logging geocaching finds with my family and for the random Groundspeak update.

Also, although we didn't have the need for backups this time, we have daily backups of all our systems. Since this happened before our nightly backups occured it was close to the worst time for a data failure. At the most we would have lost a day of data. In a catastrophic event this isn't a total Fail. It just sucks.

What Didn't Work

Although I won't finger point at the cause of this issue, I will point out that Fisher Plaza people lacked any official communication with the first responders at the scene. Many clients of the building were in the dark, both figuratively and literally, while we were waiting outside for news of what really happened. Instead we had to join in on Twitter to figure out what happened. Was it a fire? (yes) Did the sprinklers turn on? (yes) OMG! Our machines are fried! (no. just the generator) If someone walked out of the building with some authority and told us what they knew - we could have passed that information on to our customers. Internap did a relatively good job at giving status updates though they were sparse and sometimes repeated. I'd give Internap a C and Fisher Plaza an F for communication.

I'll be just as hard on us and say that we should get an F for communication preparedness. Although I think we did a good job at working around our own issues with Facebook and Twitter (and this blog), we were unable to make updates available on our web pages and our iPhone application. The reason why some sites could do this and others could not is that our entire server infrastructure was in the Fisher Plaza basket. The other companies likely had better ways to switch over to a new location. Our only alternative, pointing DNS to another server, would have made it harder to get back online since many people would continue to point to the wrong machine when the servers were back with power. Since we only anticipated a ~12hr outage it made no sense to do something that could take another 24 hours to correct for some users.

What Next?

There are some obvious things to do to correct what didn't work, and some solutions that will require some thought. I'll highlight a couple of high level things we'll consider and implement.

We're not a bank, so although 29 hours is a long time to be down, we do not plan to duplicate our infrastructure so we are completely redundant. It is just too expensive to make fiscal sense. Instead, we'll ensure that in the case of a catastrophic event that we'll have the best backups and the best steps for restoring those backups to a new system. We already have a good system but we'll make it even better.

We'll have a better system for communicating with our customers, so these systems will be the focus for redundancy planning. This includes rerouting web servers and email. Even streaming my Twitter account on the front page of Geocaching.com would have been helpful for letting people know what is happening.

Lastly, we're going to create an official disaster recovery plan so everyone knows what to do at Groundspeak in the situation where there is a catastrophic event. We should always understand the worst case scenario and how to recover from it. We owe this to our customers.

For those in the US, have a Happy 4th of July! And thanks to everyone for your ongoing support of Groundspeak and the geocaching activity. From the Tweets and Facebook posts you definitely enjoy geocaching. Now go out and find a cache!

* we're not using all of the cabinets at Internap yet but we're still paying for them

** although we also run Waymarking.com and Wherigo.com, the geocaching community is easily the largest and most vocal, so I'm focusing on them for the blog. I know everyone else is just as excited to see our other sites back online.

22 comments:

Absoblogginlutely! said...

thanks for all your hard work this weekend. Thankfully my pocket queries were sent just before the fire ;-)
It raised some interesting questions, a lot of which you seem to have answered and also raised some questions in my mind about how the company I work for and my clients would react in a similar situation.

You mentioned that your backups occured at the worst possible time during the day - are you going to take steps to change this happening - can you afford to lose a days worth of data? I bet there would be some pretty mad people if a days worth of logs, photos and tracks were deleted.

As to the web rerouting - couldn't you set a short ttl on the dns and point to a blog or twitter account that would tell people how to refresh their dns once the servers were back up?

Bill said...

Jeremy,

Thank you for all that you have done to keep those affected by the Fisher Plaza fire updated. You and others on twitter were our life line. I am in Florida. My hosting company (dotster.com) is evidently in the Plaza and was shut down, thus shutting my web site and e-mails down. Yet, the only way I could discover what was happening to them, and my web sites was you, and Twitter.

I guess I am now a full fan of Twitter.

Your blog also helped.

Thanks

burgi dad said...

Jeremy,

Thanks for the update on what happened and your plans for any future events.. I definitely understand the not going for 100% redundency due to the costs.. The facebook updates worked very well for me and it sounded like you had everything in place you could when given the okay with power and A/C..

I hope your servers are protected by halon and not water.. That expense is well worth it..

Keep up the good work Mr. Pres / co-owner.. Your customers really do appreciate it..

Dennis B
(Burgi Dad)

Pavel said...

Well, the good news is, that once you have DR plan, you're most likely not going to need it:)

Furthermore, I'd like to point out that the fisher plaza as a collocation center sucks big time. Which genius decided to use sprinklers in geni's rooms/server rooms? Don't know what the housing prices are in US, but down here in EU I can effort quite nice geographical cluster for just simple webhosting I'm running as a side job (although I don't have 7 racks full of servers:)

Anyway, I'm glad that the gc.com is back online, all the other details are now irrelevant...

Scott said...

My .02 is that even though you may not have been happy with Groundspeaks communication to it's customers, as a customer of GC.com and Groundspeak I was more than happy with the continuing coverage by the GC frontman. Nice job all of you.

Steve said...

I have to say that you did one of the best jobs I have ever seen in providing updates to the community. It goes a long way toward keeping us all calm.

waypointazoid said...

A lot of us are not that informed on all that goes into maintaining gc.com. As a simple webmaster of a small geocaching association, from the begining I was impressed with how you and Elias did everything you could do to let everyone know why they couldn't access the site.

I knew there'd be those without iphones and other media access that would not have any idea why their busiest geocaching day was going down the tubes. I stayed with it all night and kept posting to our website an on-the-fly account of what was happening. I couldn't have done it if you had not stayed with us on Twitter, Jeremy.

On a site that usually gets maybe 50 hits on a given day, suddenly myscsg.com was getting thousands. Other companies affected by the outtage were posting a link to the SCSG site on Twitter. As far as I know, there was nowhere else to go to get a coherent chronolgy of why this was happening. I knew I had suddenly found myself and our site needed! I stayed with it all night posting your twits and any other info I could gather, and was genuinely impressed with how you used what you had to get the word out.

I know I can say with confidence that you and Elias did more communicating to your customers with less available equipment than this laymen could have ever dreamed possible, and I know I can speak for all the simple geocachers out here when I say we are pleased and impressed with your efforts! Ya'll saved the holiday! You guys rock!

OneMoreBeer said...

Thanks for this, Jeremy...I'm one of your 800+ new followers on Twitter. At the end of the day this was not a life-and-death situation, but the fact that you and Elias were on the scene PDQ demonstrates your professionalism.

We're just about to take our annual vacation, and my wife will be furious that I can now Twitter on my mobile...the PQ will be set to run the night before ;)

geoid: 83192

Jens L. said...

Good Job. I am lucky to read it her so far Away in Berlin

Slideshow Bob said...

Although I commiserate with the unfortunate timing I'd just like to represent the masses and say I knew nothing at all of what was going on apart from it being evident a major issue had occurred.
As a professional DBA I am surprised that you don't backup transaction logs on a periodic basis allowing point-in-time recovery. Maybe something you could look into?
I was certainly very glad I maintain an off-line database.
Hope you were able enjoyed the holiday to some extent.
Bob.

Jeremy said...

Hi,
I agree that a dns entry with a short ttl, say 15 mins to a web server hosting a sorry page would have been better. Also you might consider an entry for the api record so that apps would return something meaningful too. I do highly available it infrastructure for a living so i'd be happy to provide free advice to you guys about what your options are and the associated costs.

Greg said...

Absoblogginlutely! You'll need to re-read what Jeremy said. "Since this happened before our nightly backups occured it was close to the worst time for a data failure."

That means the disaster happened at the worst possible time. Not the backups. You can't schedule disasters. Which means you can't schedule a backup around it. The disaster happening at the worst possible time does not mean there was a better time to schedule the backup.

frankbroughton said...

Jeremy,

I have an idea for 'ya to mull over. The hosting company I use for my sites has a blog setup that is in a different Data Center than its main equipment that hosts its forums and client section. This blog among other things is used in emergency downtime. Only used once that I recall as now their forum and other items are on failover.

Groundspeak could do this same thing and use this blog as an avenue for announcing site updates and such. Use the blog instead of the thread in the forum for the site update announcements. The weekly newsletter could even be posted on the blog. Along with other things. Keep it simple so you do not have to have it hosted on a dedicated server.

A link to the blog could even be setup from the current forum thread. vBulletin allows a thread to be a link, I suppose IPB does too.

Customers would get used to checking the blog and thus it would be perfect for future emergencies (hopefully never needed, but in reality it will).

Just priming the pump with ideas.

Thanks for the yeoman job of keeping us updated yesterday. You did very well. I grade you an A.

Frank Broughton

mtn-man said...

Thanks for the excellent recap and honest assessment. I am sorry that you and Elias lost your holiday. I like FrankBroughton's idea of a separate blog and posting the weekly email and information that is posted in the announcements section of the forums. When people get used to it, it would be a great location for an emergency update.

Take a day off next week guys! If it makes you feel any better, I am working this entire weekend and did not get home last night until 2:40 AM. Same tomorrow.

schnider said...

I will be the first to say that I really appreciate all that Jeremy and his wonderful staff do for us. That being said, even though some of us feel like we would die without Geocaching, it isn't a life or death situation. I'd say that since I started in 2003, Groundspeak's UP time is phenomenal! I am sure there are some cachers out there that feel Geocaching needs to install a fully robust, immediately switching, all the bells and whistles, super-duper back-up system. But I know I survived without it for 29 hours. I don't think suicide ever popped into my head one single time.

Thanks again Geocaching staff!

javapgmr said...

Jeremy,

As far as I'm concerned you did what needed to be done. You can always 'what if' yourself to death on DR exercises and still miss something. You've learned some valuable lessons. I never really saw the need for Twitter before but was one of the +800 who started following you just to know the status. I can see how it can really help in these times. Again, you've learned something, no one was hurt and the world didn't end.

You and your team all did a great job. I trust that you've got some things to think about.

Hockeyhick said...

It goes without saying that those of us that were networking your twitter and facebook updates throughout the night with other geocaching group sites helped spread the word to the starving fun-seekers out there. Thanks for your firm's dedication and like mtn-man said....take a few days off...you all deserve it!

AntonD said...

That is the way to go. Learn from the situation and the mistakes made, take one on the chin, fix it and move on.
Glad to hear that all ended well.

geotrowel said...

Just to say thanks to everyone involved for getting things up-and-running so quickly. I wasn't a huge fan of Twitter before this happened, although I had an account, but this event has made me a fan! Thanks for all your updates, Jeremy, it was great to be kept informed as to what was occuring, and I was able to keep friends who aren't twitterers in the loop, too.

Thanks again!

Javelot said...

Hi Jeremy,
Many thanks for all your very good job during this difficult moment. I would give you an A for communication.
Regards and Happy Geocaching

flame-red said...

Jeremy,

Thanks for letting us know what happened. Also, thanks for doing all you can to make geocaching enjoyable for all. Some things, while forseeable, are not always preventable.

HAPPY GEOCACHING!!!

rheingauer said...

This wasn't your failure. Just keep this great idea running. Today I renewed my premium account ;)

Cheers ;) Jan aka rheingauer