“Big Data” or just “a lot of data”?

Over the past few years, there have been the bandwagons – full of shiny promises, covered in good words and often totally confused when it comes to how they have been messaged.  From virtualization through cloud and now onto Big Data – end users just begin to start to understand one when the next one rolls into town.

And Big Data certainly seems to be confusing many – the two words themselves may not be helping, because it is not really about “data”, as it is more about information and intellectual property which come in many different formats, and it certainly isn’t about “big”, as analyzing even small amounts of the right sort of information needs a change in approach in how the underlying assets are dealt with.

Don’t get me wrong.  Big data is an issue – but not necessarily as many commentators see it.  For many, big data is all about how much data there is to be dealt with.  This is not (necessarily) a big data issue – it is a lot of data issue.  OK – I can see the brows furrowing as you read this – what is this guy up to arguing semantics?

Look at this way – if we take an organization such as CERN in Switzerland, they have databases containing petabytes of data collected from massive experiments like the large hadron collider [that seeks to unravel the origins of the universe].  Note the key here – these are databases, not datasets (semantics again).

A database can be dealt with through simple evolution of database technology combined with data analytics and reporting tools – maybe using in-memory or flash-disk based approaches to speed things up, and using more visual tools to make the analytics easier for the end user.  By contrast, a data set is a collection of all the different data sources that I may need to investigate, aggregate, collate, analyze and report on to provide the information I need.

Let’s take a company like British Pathé as an example – it has massive archives of text, image, sound and video.  Sure, it could push all of these into a standard database as binary large objects (BLOBs) and apply SQL queries against them, but it wouldn’t be able to effectively analyze what it has got here.

Enter Big Data in its true form.  This variety of data is a key aspect, requiring a different approach to how the data is dealt with.  For example, although much of the data will be perceived to be unstructured, it will actually be underpinned by structure, such as XML or even just as a comma or tab delimited file.  Voice can be machine translated into text, making analytics so much easier.  Video can be parsed using color, shape and texture analysis so that metadata identifying scenes can be added to the files.

The use of emerging systems, such as Hadoop and noSQL databases such as MungoDB or Cassandra can also help.  Hadoop’s MapReduce can help in dealing with a large data problem in the un- or semi-structured space by helping to reduce the volumes of data under examination, and can also help in deciding whether a particular piece of data is best dealt with under a noSQL or a real SQL database, each having its own strengths and weaknesses.  But, once the multiple different types of data are under control, analysis and reporting become so much easier.

Big data can make a world of difference in how an organization deals with all the sources of information available to it.  Just don’t fall for the glib sales talk from those vendors with too much vested interest – if a SQL vendor says they are all you need, or you hear the same from a noSQL vendor, show them the door.  A hybrid solution; one that includes Hadoop to help in refining the less structured stuff is the way forwards.

Image source: scjody (flickr) — CC-BY-SA