NYC Subway Railyard

Keeping the Metadata Moving: Why Big Data Depends on the WAN

NYC Subway RailyardWe all know that Big Data — the use of new search and discovery technologies to extract value from huge volumes of information — is coming. We also know that many networks will have to be upgraded or enhanced to cope with it. But how many of us have a handle on just what extra WAN loading it might bring?

Key concerns here are transient data and metadata — data about other data. That’s because Big Data is not about the static content we store and create, both voluntarily and involuntarily, but about the metadata that swirls around it. It’s not the photograph that matters, but the identities of the people pictured and the implied connections, and it’s not that you bought X, but that you did it after looking at Y and just before also buying Z.

You get some idea of the potential scale of this if you look behind the headlines in a new study out from IDC, sponsored by EMC and titled Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. Most news reports have focused on its predictions that we will create and replicate 2.8ZB — that’s zettabytes, i.e. 2.8 million million gigabytes — of data in 2012, up from 1.8ZB in 2011, and that our ‘digital universe’ will reach 40ZB by 2020.

That last figure represents “a 50-fold growth from the beginning of 2010”, and is more than 5TB for every person on the planet, says IDC.

Part of this is down to the Internet of Things: those industrial machines, medical devices, security cameras, embedded systems, sensors, and other automated gear that also generate data. These already generate 30 percent of the digital universe, but by 2020 they will account for more like 42 percent.

What should worry networkers, though, is that only a part of the digital universe is easily-measurable data at rest. Elsewhere in the study, the authors note that much of it is transient — phone calls that are not recorded, digital TV images that are watched but not saved, packets temporarily stored in routers, digital surveillance images purged from memory when new images come in, and so on.

This is why, when University of Southern California researchers in 2011 asked the question, ‘How much information is there in the world?‘, they came up with a figure of 1.9ZB for the amount sent by humanity through broadcast technology alone — and that figure was for 2007! Two-way communications in 2007 — the year the first iPhone was released, and when Facebook climbed to 50 million users — added another 65 exabytes, or 65 million terabytes.

Smartphones are now widespread, of course, and Facebook has over a billion registered users. OK, we can ignore much of the broadcast aspect as it is analogue, but some of that transient digital data is bound to flow over corporate networks, because it is needed in order to put the static data to work. Indeed, the IDC researchers note that while 68 percent of the digital universe is created and consumed by consumers, enterprises have responsibility and liability for nearly 80 percent of the digital universe.

That’s because there is more than twice as much ambient information created and stored about us than we as individuals create each year. This ambient information are those medical scans, financial records, CCTV images and so on, plus, of course, the metadata created about our own created documents and photos and our music downloads. Oh, and metadata is growing twice as fast as the digital universe as a whole.

So as well as having to move around vast amounts of Big Data, it must also be protected and kept secure, both for regulatory and compliance reasons, and simply to maintain customer trust. To put that in perspective, IDC reckons that 35 percent of the information in the digital universe needs protection, but that only 19 percent actually is protected. Big Data is therefore more than a storage and servers challenge — it will be just as big a challenge, if not bigger, for the network.

Big Data is not new — the basic ideas have been around for years in the likes of data warehousing, business intelligence, and so on.  What is enabling it to become widespread is inexpensive storage, a proliferation of sensor and data capture technology, increasing connections to information via the cloud and virtualized storage infrastructures, and innovative software and analysis tools.

But none of that will work without the networking being in place — and since much of it crosses functional and organizational boundaries, that means the WAN. So if your organization wants to participate in Big Data, you had better make sure the WAN is included in its plans.

Image credit: joiseyshowaa (flickr)

About the author
Bryan Betts