Big data means a lot of
data. The experts say,
big data fits one or more of four Vs of big data, namely, volume, velocity, veracity and
variety. We are living in the age of big data and the factors mentioned ahead prove this fact to some extent.
Over 90% of all the
data in the world was created in the past 2 years. And, it is expected that by
the year 2020 the amount of digital information in existence will have grown
from 3.2 zettabytes to 40 zettabytes. The total amount of data being captured
and stored by industry doubles every 1.2 years. In two days we create as much
information as we did from the beginning of time until 2003.
So, all of these trending threats
about big data gave birth to the requirement of having a system which can
handle big-data and analyze it at a fast rate. And, this is how Hadoop came
into existence, although there were many system/frameworks which were being
used or are still used for handling big data.
Big Data has been
around for a long time, in fact, you can handle high volumes
of data with massively parallel-processing (MPP) databases, such as those offered by Greenplum,
Aster Data and Vertica. And,
they’re incorporating Hadoop into these platforms.
Hadoop is the
distributed file system which
is nothing but the way to create clustered or distributed storage and can run on any server.
HDFS is fast, secure, and fault tolerant.
MapReduce is actually
the core of Hadoop which can put all the data nodes to process the data locally, and is fast
and very powerful.
Hadoop is not actually
an analytic platform; it
can be used with traditional analytic platform or a common way to analyze the
data we use R programming language to write our MapReduce jobs.
Hadoop can also be used
for archiving and for ETL that stands for extracting, transform, and load. Moreover, Hadoop
can also be used for filtering. The Hadoop platform provides many opportunities
for transforming and extracting the data and processing.
Scaling of data is the
major concern in the data world. The Hadoop system uses Accumulo for scaling
the data. Accumulo is actually inspired from Google big table design and is
built on the top of Hadoop. It comes with a few improvements in big table, for example, it provides cell-based
access control and a server side programming. Also, in Accumulo the key-value pair at the various
points can be modified in the process of data management.
Components
of Hadoop
Hive:
Hive is a data warehouse application and
provides high level language for expressing data analysis programs. It provides
SQL like environment
PIG:
Apache PIG provides high level language
for expressing large datasets. PIG’s language consist of textual language
called Pig Latin.