Big Data: What Does It Really Mean?

By Sandeep Katakol

Did you know?

Facebook’s over 1 billion users have uploaded a staggering 240 billion photographs.

YouTube users upload 20 hours of new video every minute of the day.

There are 634 million websites (with 51 million being added to the web every year)

These are just some of the examples; we can give many other stats and figures like this. But the point here is it is an ‘astronomical’ amount of data to store, and process. The story doesn’t end here; the data is exponentially increasing every day. There is a huge data explosion happening.

IBM says, “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.”

According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years.

Such huge data is created by pictures, videos, posts, likes, tweets on social sites, satellites, sensors, business transactions, scientific experiments (CERN produces 15 petabytes of data annually), GPS signals, CCTV, TV shows, NEWS, almost from everything we do. A new name is coined for this data, “Big Data“.

Sheer volume does not define big data. It has other dimensions too.

Volume: As the name indicates, it characterizes huge amount of data. A terabyte is inadequate for an individual now. Petabyte (1000 TB), Exabyte (1000 PB) even Zettabyte (1000 EB) are the new terms to learn.

Estimated size of digital universe in 2011 was 1.8 Zettabyte. It is predicted that between 2009 and 2020 it will grow 44 fold to 35 Zettabyte per year.

Velocity: Speed of data storage and retrieval is an essential property for Big data.

Walmart handles more than a million customer transactions every hour, and this data has an approximate size of 2.5 petabytes (25000000 gigabytes)2.9 million mails are sent every second.

Variety: Big data is not just a structured data constituted in rows and columns. It is both structured and unstructured containing, pictures, videos, bit streams, logs, big tables, machine readable data, sensor data and more.

The data with such a volume and variety cannot be handled by our traditional RDBMS systems. The constraints imposed on data by ACID (atomicity, consistency, isolation, durability) properties develop latency in retrieval of data which becomes unbearable for big-data. They also impose scalability issues. So some of the new databases give up one or more ACID properties to guarantee scalability, availability and performance. Facebook’s Cassandra, Amazon’s Dynamo, Hadoop databases ensure atomicity, isolation, and durability but compromise with consistency (they guarantee weaker ‘eventual consistency’) to give better availability and performance i.e. for Facebook, keeping like data accurate up to the last microsecond may not be as important to make the Facebook post available to the user quickly. These kinds of database systems are called “NOSQL” (Not Only SQL) systems.

Big data has lot of benefits and numerous opportunities. Market insights, event predictions, social media sentimental analysis, real time fraud detection, customized web advertisements, tailored plans for specific customers, traffic management to name a few. Big Data analytics may unfold hidden patterns, unknown correlations and other information which provides competitive advantage and can help companies make better business decisions.

An infographic related this topic can be found here.