Darkness into Light

Big Data Follows The Darkness

Darkness into LightAre you following the darkness? No, not the glam rockers — the regular night-time variety. It used to be that people talked of following the sun, by which they meant passing tasks from site to site around the world — California to Singapore to Ireland, say — so there was always someone awake in their normal working day to handle support calls or work on the software project.

While most software applications aren’t too worried about daylight, Big Data is different. Tools such as Hadoop  need processing and storage capacity, and lots of it, and the best place to find that is often in a data center where everyone has gone home for the night. Big Data analysis can also derive large subsets of data which need further processing, perhaps in combination with other datasets. The problem is of course moving those databases, subsets, or object stores from daylight site to night-time site.

These data movements are on top of the need to aggregate that data in the first place, which means moving it from collection points to a central analytics engine. Plus it must be replicated for protection, especially if the collected data were transient and cannot be recreated — and then of course there is the need to distribute the results of the analysis.

Fatter pipes are one potential solution, but moving huge data volumes is already difficult and expensive  enough when the distances involved are merely national or continental. The ranges we are talking about for follow-the-darkness (or sun, or moon) are global. And of course at these distances, it is likely that your limiting factor is not bandwidth but the aggregated latency of the link. For instance, even if you have a 155 Mbps WAN, if it has 80ms latency and 0.1% loss then you might only be able to send 10 Mbps of replication traffic over it, and adding more bandwidth will not bring any extra speed.

There are a few things you can do to help solve this problem. One is of course WAN optimization: add a WAN appliance at each site, and your replication or move times will come down dramatically. It should also help a lot with Big Data tools such as Hadoop, because it is not unusual for the input data to contain a lot of duplication or redundancy which a WAN optimization appliance can compress down.

Thinking laterally, there is another possibility, which is to move your Big Data to the cloud. The question of when and where to process it is then somebody else’s problem. It does bring a host of other questions though, from obvious ones such as security to the matter of how will you get your data into the cloud and your results out again. Indeed, uploading Big Data to the cloud through the Internet has been likened to filling a swimming pool through a drinking straw.

And let’s face it, if you have multiple data centers each with spare night-time number-crunching capacity, you may as well utilize it. Follow the darkness…

Image credit: jasleen_kaur (flickr) – CC-BY-SA