By some weird coincidence, a few days after Matthew MacDonald Wallace’s talk at unified.diff on change management, I found myself actually dealing with an update gone wrong.
My hosting provider does a pretty good job in keeping things updated regularly, and I get decent results when performing the occasional vulnerability scan against the server, so it’s quite safe against the average script kiddy using Metasploit. The hosting provider also responds personally to queries within minutes, which is the sort of communication businesses need.
From what I gather, a recent update, probably developed for another production system, took out the platform’s control panel, SSH, email and other services for a few days. It’s the first time this happened in three years, and it was unexpected.
Perhaps I would have been screwed at this point? Actually no, because I anticipate this kind of stuff, and so I already had a redundant system, a continuity plan of sorts, and a fairly recent backup. I carefully chose a domain registrar that allows redirection to this blog and another email system under the usual address, and everything was good to go within an hour of swapping the nameservers. Best of all, it costed hardly anything to implement.
But this post isn’t entirely about my domain. It’s about the need for any cloud/hosting-reliant business out there to ensure continuity. The lessons here are:
* Have a redundancy solution for everything that’s business critical. You might (or will) depend on it some day.
* Choose a registrar that allows efficient domain administration, quick redirection, and whatever helps the switch between third-party providers in an emergency.
* Make sure whatever updates are tested for the target platform before deployment. The production environment can (and should) be simulated using virtual machines beforehand.
* Make sure there’s good communication with third-party providers. This is particularly essential for businesses that outsource data management.
* Any changes should be documented, so problems can be traced and resolved without making things worse. If it’s not documented, it didn’t happen.
* Make sure data is backed up regularly.