Dec 28, 2012
First, let me get rid of a bee in my bonnet. At a recent round table, while many argued that business continuity was just one aspect of disaster recovery, I argued that business continuity and disaster recovery were two completely different things. To me, disaster recovery starts from the premise that a catastrophic failure will inevitably disrupt the business and focuses on getting things up and running again as quickly as possible — preferably before the business fails. On the other hand, business continuity means designing systems that will survive failure of the components, by deploying redundant systems — preferably automatically and immediately — thus allowing the business to keep going with minimal disruption.
Now that’s off my chest, let’s look at what this means at a technology platform level. Disaster recovery can be dealt with by storing images of applications and backups of data off-site, which then can be loaded back onto new equipment as and when needed — simple enough, in theory. In practice, both the work involved and the discipline required present formidable operational challenges, and the impact on the business can mean that most of an organization’s prospects and customers decide to go elsewhere, and then never return.
Business continuity needs to deliver the capability for workloads, applications, and data to switch from one physical environment to another in as close to real time as possible. For true business continuity, this will need to be done across as much distance as possible, so that the business can survive an incident that takes out a whole region, such as flood, earthquake, or civil unrest. Every aspect has to be “mirrored” — there have to be copies of apps and live data running somewhere else, ready to take over should the working system become unusable for any reason. This “master” and “slave” approach has been used for years, but not always with success.
Long distances bring in issues for data. Running an app and its associated data is fine when running at data center speeds, but trying to replicate all transactional data over a WAN can lead to jitter, data collisions, and packet loss, all of which will have to be checked for and remediated in real time. WAN acceleration technologies can help here in minimizing the volumes of data being pushed down the pipe.
The failure of the master system may not be the result of equipment failure or a natural disaster, but may be caused by data corruption. Direct mirroring with zero intelligence means that the data corruption is also mirrored; you now have two dead systems. So some means of store-and-forward mechanism is required, which also helps in being able to measure what transactions (if any) have been lost between the failure of the master and the cutting in of the slave systems, since a log of transactions is kept between the systems.
Lastly is the “elegance” in which a system fails over. Historically, this required special software to deal with monitoring the “heartbeat” of the master and slave systems, and proprietary means of managing state, IP addresses of servers and clients, along with logical unit numbers (LUNs) for storage and so on, so that the users do not have to do anything themselves to switch over from the master to the slave system.
With virtualization now becoming more mainstream, such failover is now far easier to manage, and cloud computing is further aiding in abstracting the application and data from the physical equipment.
If you want to be wedded to a backup and restore disaster recovery approach, you could become famous as the person who helped put the organization out of business. If you want to be the White Knight, then make sure you push for a real business continuity approach.
Image credit: Alan Cleaver (flickr)