Feb 6, 2015
Business continuity is the new Holy Grail. Where once people used to look to ‘rapid’ backup and restore capabilities to recover from a disaster, this is no longer good enough in many cases. Backup windows are commonly insufficiently long to carry out the required actions of moving large data sets from one place to another, and the very act of recovering that data can take far too long – during which time, the business is hemorrhaging. A new kind business continuity architecture gives IT the ability to improve uptime even over long distances.
Although many vendors out there interchange the terms disaster recovery (DR) and business continuity (BC), there are major differences between them in real terms. DR is all about how rapidly can you regain business capabilities after an IT disaster, whereas BC is all about how can you retain business capabilities during an IT disaster.
What is needed for BC? There are the obvious things of having spare capacity in hardware and software, along with managing the data. Then there is connectivity and performance – and this is where things can get a little sticky. A major disaster could be caused by something very local – a fire or similar – where BC can be carried out through mirroring across a short distance to a second facility. What happens when the disaster is larger, though – maybe an earthquake, a flood or a major public issue, such as a state of emergency being declared? You may then be looking at the need to mirror to a facility many hundreds or even thousands of miles away.
In itself, this may not be a problem – hardware and software are the same pretty much anywhere, and virtualization has made setting up cost-effective mirroring possible. It is the data where problems tend to come in – and where a different approach may be required.
Mirroring data across a long distance has been an issue for many for a long time. Highly transactional data can run into all sorts of problems when being replicated, with latency and jitter causing the data to be too easily corrupted. For live, synchronous replication, the need for each packet to be acknowledged by the receiving end can also cause problems – and these get bigger the longer the distances involved.
When looking to vendors that specialize in BC, many will talk about maximum distances in the hundreds of miles – some even less. If you really want to deal with the possibility of global continuity, hundreds of miles is not good enough. Many vendors will then shift the discussion to a purely asynchronous approach: the data is stored locally and is trickled off to the remote facility as soon as possible in a way that doesn’t need the timeliness of a synchronous approach.
Should a major disaster occur, however, any data that is still being held waiting for an asynchronous transfer to occur, could also be lost – and the business then has to try and figure out what data has been lost.
There is a way of combining both approaches to gain global BC, though. A synchronous link can be set up from one facility to another, and then from that facility to another – a daisy-chain approach to get your data as far away from your facility as possible in almost real time. It may not need to be expensive facilities that are involved – it may be that all that is required is for data appliances to be installed in your offices around the world, or housed in a low-cost co-location facility. Each copy will not be an exact copy of the previous one – each link will, by the very nature of physical laws introduce a degree of time lag which means that what you are doing is “near synchronous”, not fully synchronous.
This is where a transaction index needs to be used. By indexing the transactions (i.e. not transferring all the data, but just a log of the bare metadata around it), a very small audit log can be created that and can be pushed through a near-synchronous or asynchronous link in almost real time, ahead of the actual data. Therefore, you can see what data was on the way when the disaster occurred by looking at the index – and can then carry out business recovery procedures to contact those who are affected by the data losses.
Even with the use of near-synchronous replicas, at some stage, the data will hit a physical location where synchronous is no longer an option – for example, going from the East Coast of the US over to Europe, or from the West Coast to Asia.
This is where asynchronous comes in: to get the data across a few thousand miles of sea means that a trickle approach is still the best way of managing this.
It will require a very major disaster for all your near-synchronous appliances to be impacted at the same time. The idea is that should your main facility be destroyed, you can failover as required to any of the mirrors that you have created. Should there be any reason why you need to failover to a different geography, then you need to fall back to the asynchronous replica – and unless you know exactly what has happened, you are not really better off than previously: you still don’t know what data has been lost due to the asynchronous data waiting loss. At this point, that transaction index log becomes all important.
Business continuity has to be a strategic focus for any IT team. Depending on backup and recovery just isn’t enough any longer.
If your business is global, then plan for a global continuity network now.