I try to write for the Silver Peak blog on a regular basis, usually aiming for a weekly post…sometimes even making my deadline. Most of the time I write last minute after spending a couple of days thinking about what I want to say. Last week when I was asked to write about Hurricane Sandy and its impact on replication, it was topical because I use Sandy as an example of the need for long distance replication.
It is not uncommon for businesses in Manhattan to replicate to New Jersey, typically somewhere within synchronous distance. One of the interesting things about synchronous replication is the latency limitation. Latency is really a function of distance, the speed of light in fiber, and conversion from light to electricity for the purposes of switching and routing. Most storage vendors have a 5 ms limitation for synchronous replication. The funny thing about the 5 ms limitation is that it is mostly artificial. You can replicate synchronously around the world; the problem is that the application is held up until the final write complete is returned. Waiting for the final commit is the limiting factor on application performance, and is the reason that most vendors limit the supported latency to 5 ms.
We all learned in school that the speed of light in a vacuum is 300,000 km (186,000 miles) per second, but this speed is only accurate in a vacuum. When you shoot light down a fiber optic cable it slows down and you have to start taking the refractive index into account, and we start talking about core, cladding, and other things that most people have no interest in. Because of the issues of fiber speed, switching, and routing, the distance that we can achieve with a 5 ms limitation on synchronous replication is much shorter than just the speed of light. And it is this limitation that left a lot of businesses exposed when Hurricane Sandy came knocking on the data center door in Manhattan.
Sandy was a large storm; in fact it was the second-largest storm since detailed records started being kept in 1988. Sandy covered 560,000 square miles before landfall, and covered 943 miles of U.S. coastline when it came aground on October 29th. With a storm that large, synchronous replication isn’t going to help you. If you are really lucky, your disaster recovery site might stay up and running for 15 minutes longer than your primary site. This is the problem that a lot of businesses faced: flooded primary site and flooded disaster recovery site. Having a synchronous copy of your data that is inaccessible doesn’t solve the problem of disaster recovery. With a large storm, or other large disaster, distance is your friend.
When I was in New York recently, I heard the story of a business that was down for eight days after the storm. This business probably had a disaster recovery plan, and they probably had a copy of their critical data across the river in New Jersey. Eight days to recover isn’t bad if your customers don’t rely on you or expect you to be open for business. In the world of always on, always available, short attention spans, eight days is a lifetime for a business to be unavailable.
The outcome of all of this — the latency, the flooding, the lost data — is that most businesses have realized that synchronous replication is fine for general uptime and highly localized outages. But for true disaster recovery, data needs to be sent somewhere out of the region, preferably across the country or across an ocean. Of course this means asynchronous replication and the possibility of data loss, but a small amount of lost data is nothing compared to eight days of downtime.