Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

What is Big Data?

Big data is a collection of large datasets that cannot be


processed using traditional computing techniques. It is
not a single technique or a tool, rather it has become a
complete subject, which involves various tools,
technqiues and frameworks.

What Comes Under Big Data?


Big data involves the data produced by different
devices and applications. Given below are some of the
fields that come under the umbrella of Big Data.
Black Box Data − It is a component of helicopter,
airplanes, and jets, etc. It captures voices of the flight crew,
recordings of microphones and earphones, and the performance
information of the aircraft.
Social Media Data − Social media such as Facebook and
Twitter hold information and the views posted by millions of
people across the globe.
Stock Exchange Data − The stock exchange data holds
information about the ‘buy’ and ‘sell’ decisions made on a share
of different companies made by the customers.
Power Grid Data − The power grid data holds information
consumed by a particular node with respect to a base station.
Transport Data − Transport data includes model, capacity,
distance and availability of a vehicle.
Search Engine Data − Search engines retrieve lots of
data from different databases.

Benefits of Big Data


Using the information kept in the social network
like Facebook, the marketing agencies are learning
about the response for their campaigns,
promotions, and other advertising mediums.
Using the information in the social media like
preferences and product perception of their
consumers, product companies and retail
organizations are planning their production.
Using the data regarding the previous medical
history of patients, hospitals are providing better
and quick service.

Hadoop:

Hadoop is an open source, Java based framework used


for storing and processing big data. The data is stored
on inexpensive commodity servers that run as clusters.
Its distributed file system enables concurrent processing
and fault tolerance. Developed by Doug Cutting and
Michael J. Cafarella, Hadoop uses the MapReduce
programming model for faster storage and retrieval of
data from its nodes.

From a business point of view, too, there are direct and


indirect benefits. By using open-source technology on
inexpensive servers that are mostly in the cloud (and
sometimes on-premises), organizations achieve
significant cost savings.
Additionally, the ability to collect massive data, and the
insights derived from crunching this data, results in
better business decisions in the real-world—such as the
ability to focus on the right consumer segment, weed
out or fix erroneous processes, optimize floor
operations, provide relevant search results, perform
predictive analytics, and so on.
How Hadoop Improves on
Traditional Databases
Hadoop solves two key challenges with traditional
databases:

1. Capacity: Hadoop stores large


volumes of data.
By using a distributed file system called an HDFS
(Hadoop Distributed File System), the data is split into
chunks and saved across clusters of commodity servers.
As these commodity servers are built with simple
hardware configurations, these are economical and
easily scalable as the data grows.

2. Speed: Hadoop stores and retrieves


data faster.
Hadoop uses the MapReduce functional programming
model to perform parallel processing across data sets.
So, when a query is sent to the database, instead of
handling data sequentially, tasks are split and
concurrently run across distributed servers. Finally, the
output of all tasks is collated and sent back to the
application, drastically improving the processing speed.

The Hadoop Ecosystem: Supplementary Components


The following are a few supplementary
components that are extensively used in the
Hadoop ecosystem.

Hive: Data Warehousing


Hive is a data warehousing system that helps to
query large datasets in the HDFS. Before Hive,
developers were faced with the challenge of
creating complex MapReduce jobs to query the
Hadoop data. Hive uses HQL (Hive Query
Language), which resembles the syntax of SQL.
Since most developers come from a SQL
background, Hive is easier to get on-board.
The advantage of Hive is that a JDBC/ODBC driver
acts as an interface between the application and
the HDFS. Originally developed by the
Facebook team, Hive is now an open source
technology.

Pig: Reduce MapReduce Functions


Pig, initially developed by Yahoo!, is similar to
Hive in that it eliminates the need to create
MapReduce functions to query the HDFS. Similar
to HQL, the language used — here, called “Pig
Latin” — is closer to SQL. “Pig Latin” is a high-level
data flow language layer on top of MapReduce.
Pig also has a runtime environment that interfaces
with HDFS. Scripts in languages such as Java or
Python can also be embedded inside Pig.

Flume: Big Data Ingestion


Flume is a big data ingestion tool that acts as a
courier service between multiple data sources and
the HDFS. It collects, aggregates, and sends
huge amounts of streaming data (e.g. log
files, events) generated by applications such
as social media sites, IoT apps, and
ecommerce portals into the HDFS.
Flume is feature-rich, it:
Has a distributed architecture.
Ensures reliable data transfer.
Is fault-tolerant.
Has the flexibility to collect data in batches or real-
time.
Can be scaled horizontally to handle more traffic,
as needed.
Data sources communicate with Flume agents —
every agent has a source, channel, and a sink. The
source collects data from the sender, the channel
temporarily stores the data, and finally, the sink
transfers data to the destination, which is a
Hadoop server.

Sqoop: Data Ingestion for Relational


Databases
Sqoop (“SQL,” to Hadoop) is another data
ingestion tool like Flume. While Flume works on
unstructured or semi-structured data, Sqoop is
used to export data from and import data into
relational databases. As most enterprise data is
stored in relational databases, Sqoop is used to
import that data into Hadoop for analysts to
examine.
Database admins and developers can use a simple
command line interface to export and import data.
Sqoop converts these commands to MapReduce
format and sends them to the HDFS using YARN.
Sqoop is also fault-tolerant and performs
concurrent operations like Flume.

Zookeeper: Coordination of Distributed


Applications
Zookeeper is a service that coordinates
distributed applications. In the Hadoop
framework, it acts as an admin tool with a
centralized registry that has information about the
cluster of distributed servers it manages. Some of
its key functions are:
Maintaining configuration information (shared
state of configuration data)
Naming service (assignment of name to each
server)
Synchronization service (handles deadlocks, race
condition, and data inconsistency)
Leader election (elects a leader among the servers
through consensus)
The cluster of servers that the Zookeeper service
runs on is called an “ensemble.” The ensemble
elects a leader among the group, with the rest
behaving as followers. All write-operations from
clients need to be routed through the leader,
whereas read operations can go directly to any
server.
Zookeeper provides high reliability and resilience
through fail-safe synchronization, atomicity, and
serialization of messages.

Kafka: Faster Data Transfers


Kafka is a distributed publish-subscribe
messaging system that is often used with
Hadoop for faster data transfers. A Kafka
cluster consists of a group of servers that act as an
intermediary between producers and consumers.
In the context of big data, an example of a
producer could be a sensor gathering temperature
data to relay back to the server. Consumers are
the Hadoop servers. The producers publish
message on a topic and the consumers pull
messages by listening to the topic.
A single topic can be split further into partitions.
All messages with the same key arrive to a specific
partition. A consumer can listen to one or more
partitions.
By grouping messages under one key and
getting a consumer to cater to specific
partitions, many consumers can listen on the
same topic at the same time. Thus, a topic is
parallelized, increasing the throughput of the
system. Kafka is widely adopted for its speed,
scalability, and robust replication.

HBase: Non-Relational Database


HBase is a column-oriented, non-relational
database that sits on top of HDFS. One of the
challenges with HDFS is that it can only do
batch processing. So for simple interactive
queries, data still has to be processed in batches,
leading to high latency.
HBase solves this challenge by allowing queries for
single rows across huge tables with low latency. It
achieves this by internally using hash tables. It is
modelled along the lines of Google BigTable
that helps access the Google File System
(GFS).
HBase is scalable, has failure support when a node
goes down, and is good with unstructured as well
as semi-structured data. Hence, it is ideal for
querying big data stores for analytical purposes.

You might also like