Friday

Big Data Basics

 


This term has surely got the hype in the past few years. But, for sometime everyone is aware about the hype but unaware of the term's meaning like the bear above. Let us change that in this blog.  Everybody just loves to talk about how Big Data is the future, how the Fortune 500 companies are leveraging it to make better decisions and so on.
What exactly is Big Data? Well, it is an Electro pop band from Brooklyn(I am not kidding! Check here). Music apart, the first logical thought that comes to the mind is, it is data that is BIG! But then how much data is enough to be considered as BIG? The threshold of the bigness of data is something that is not constant. Decades ago gigabyte was big, then terabyte, petabyte and so on. This will keep on changing.
The point is volume is not the only aspect that defines Big Data.

4 Vs of Big Data



Volume:Big Data involves huge amount of data. In magnitudes of Terabytes and Petabytes. This is because nowadays because of digitization, data is captured almost everywhere and anywhere.

Variety: Data comes in all kinds of format these days. Structured in the form of the good ol' databases. Unstructured in the form of text documents, email, video, audio. Merging and managing such different forms is one of the aspect of Big Data.

Velocity: The speed at which data is being generated from different systems like POS, sensors, the internet etc is tremendous. Effectively handling this velocity is one more aspect of Big Data.

Veracity: Such large amount of data is surely isn't clean. There is noise and abnormalities in data. And this uncertainty is exactly what veracity in Big Data means.

Ok. So now we have our data rather "Big Data". What do we do with it? The first word that can be thrown is Predictive Analytics!
Now, what does that cool sounding term means?


Predictive Analytics


Let us take the example of Netflix. Data is being generated by users who register on Netflix, data about what the user clicks, what the user watches, what he bookmarks is captured. And of course, there is the huge content that Netflix provides the user in the form of video stream. Now an application of predictive analytics is merging this data to predict what the user will like next and generate a suggestion list!
Sounds simple, doesn't it? But what goes on behind the stage is a lot of data crunching using complex statistical methods and data models.

So much of data and processing of data surely can't be done by standalone computers, of course! Enters Distributed Computing.


Distributed Computing.

Imagine a complex job, that is beyond the capacity of a single man. So, to get it done, it can be broken down into simple tasks and distributed to different people. The outcome of each task can then be combined to get the desired job done.
This analogy applies to distributed computing too. When a complex analysis is to be performed on a large volume  of data, the job can be divided and delegated to a grid of connected computers with good computational power. Each computer is called a node. These nodes process the given tasks and their outcome is combined. This is an effective way to apply complex predictive analytics algorithms to Big Data.


Hadoop



The word Hadoop is almost synonymous with the word Big Data because of the number of times it is mentioned with Big Data. But of course, Hadoop is just a technology that was created when the data on the web started exploding and went beyond the ability of traditional systems to handle it. In simple words, Hadoop is a new way to store and process data. It enables distributed computing of huge amount of data across inexpensive servers that store and process data. The storage part of Hadoop is called the Hadoop Distributed File System and the processing part is called MapReduce.


HDFS

Suppose we have 10 terabytes of data. A HDFS will split such large files of data onto multiple computers. At the same time, these distributed files also get replicated to achieve good reliability.


MapReduce

MapReduce is a an application framework to process the data on the files that have already been split on multiple computers in the cluster. With MapReduce, multiple files can be       processed simultaneously, thus minimizing the computation time.

Some more buzzwords that get thrown in with Hadoop are HBase, Hive and Pig. Let's see what they mean
            

HBase

HBase is built on top of  HDFS providing a fualt-tolerant capability. For example, searching   for 40 large items in a group of 1 billion records is made possible by HBase.


Pig

Pig is a data analysis platform used to analyze large amounts of data in the Hadoop     ecosystem. In simple terms, when you write a program in Pig Latin (a SQL-like language for Pig), the Pig infrastructure breaks it down in several MapReduce programs and executes them in parallel.


Hive

Hive is a data warehouse infrastructure and it's main component is a SQL like language i.e. HiveQL. It was developed at Facebook because the learning curve for Pig was high. Hive also leverages the distributed functionality of MapReduce to help analyze data.

All of the above combined make up for the Hadoop Ecosystem. Of course there are many more add-ons, but these are the most basic ones with which a big data infrastructure can be built and analyzed as well.


Stay tuned for the next post on Data Mining. Happy Learning!

~Slice of BI