School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024

Big Data (CS-3032)
Kalinga Institute of Industrial Technology

Deemed to be University
Bhubaneswar-751024
School of Computer Engineering
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
3 Credit Lecture Note

Course Contents
2
Sr # Major and Detailed Coverage Area Hrs

3 Hadoop Ecosystem 8
Introduction to Hadoop, Hadoop Ecosystem, Hadoop Distributed File System,
MapReduce, YARN, Hive, Pig and PigLatin, Jaql - Zookeeper - HBase, Cassandra-
Oozie, Lucene- Avro, Mahout.

Introduction to Hadoop
3
Hadoop is an open-source project of the Apache Foundation. Apache Hadoop is

written in Java and a collection of open-source software utilities that facilitate
using a network of many computers to solve problems involving massive
amounts of data and computation. It provides a software framework for
distributed storage and processing of big data and uses Google’s MapReduce
and Google File System as its foundation.
Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System
Hadoop provides various tools and technologies, collectively termed as Hadoop

ecosystem, to enable development and deployment of Big Data solutions. It
accomplishes two tasks namely i) Massive data storage, and ii) Faster data
processing.
Flood of data
4
Let’s look at few stastics to get an idea of data gets generated every day, every
minute, and every second.
❑ Every day
❑ NYSE generates 1.5 billion shares and trade data
❑ Facebook stores 2.7 billion comments and likes
❑ Google processes about 24 petabytes of data
❑ Every minutes
❑ Facebook users share nearly 2.5 million pieces of content.
❑ Amazon generates over $ 80,000 in online sale
❑ Twitter users tweet nearly 300,000 times.
❑ Instagram users post nearly 220,000 new photos
❑ Apple users download nearly 50,000 apps.
❑ Email users send over 2000 million messages
❑ YouTube users upload 72 hrs of new video content
❑ Every second
❑ Banking applications process more than 10,000 credit card transactions.
Data Challenges
5
To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:
“I have data in various sources. I have

“I am flooded with
data that rich in variety – structured,
data”. How to store
semi-structured and unstructured”. How
terabytes of mounting
to work with data that is so very
data?
different?
“I need this data to be

proceed quickly. My
decision is pending”.
How to access the
information quickly?

Why Hadoop
6
Its capability to handle massive amounts of data, different categories of data –

fairly quickly.
Considerations

Hadoop History
7
Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
Key Aspects of Hadoop
8

Hadoop Components
9

Hadoop Components cont’d
10
Hadoop Core Components:

❑ HDFS
❑ Storage component
❑ Distributed data across several nodes
❑ Natively redundant
❑ MapReduce
❑ Computational Framework
❑ Splits a task across multiple nodes
❑ Process data in parallel
Hadoop Ecosystems: These are support projects to enhance the functionality of
Hadoop Core components. The projects are as follows:
❑ Hive ❑ Flume ❑ HBase
❑ Pig ❑ Oozie
❑ Sqoop ❑ Mahout

Hadoop Ecosystem
11
Data Management
Data Access
Data Processing
Data Storage

Version of Hadoop
12
There are 3 versions of Hadoop available:

❑ Hadoop 1.x ❑ Hadoop 3.x
❑ Hadoop 2.x
Hadoop 1.x vs. Hadoop 2.x
Hadoop 1.x Hadoop 2.x

Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)

13
Characteristics Hadoop 2.x Hadoop 3.x

Minimum supported Java 7 Java 8
version of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS disk
balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for
parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes
cluster. per cluster.

Hadoop Distributors
14
The top 8 vendors offering Big Data Hadoop solution are:

❑ Integrated Hadoop Solution
❑ Cloudera
❑ HortonWorks
❑ Amazon Web Services Elastic MapReduce Hadoop Distribution
❑ Microsoft
❑ MapR
❑ IBM InfoSphere Insights
❑ Cloud-Based Hadoop Solution
❑ Amazon Web Service
❑ Google BigQuery

Hadoop Ecosystem
15
Data Management
Data Access
Data Processing
Data Storage

Hadoop
16

Version of Hadoop
17
There are 3 versions of Hadoop available:

❑ Hadoop 1.x ❑ Hadoop 3.x
❑ Hadoop 2.x
Hadoop 1.x Hadoop 2.x

Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)

18
Characteristics Hadoop 2.x Hadoop 3.x

Minimum supported Java 7 Java 8
version of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS disk
balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for
parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes
cluster. per cluster.

High Level Hadoop 2.0 Architecture
19
Hadoop is distributed Master-Slave architecture.
Distributed data storage Distributed data processing
Client
HDFS YARN
HDFS Master Node YARN Master Node

Active Namenode Resource Manager
Master
Standby Namenode
Secondary Namenode
HDFS Slave Node YARN Slave Node

DataNode 1 Slave Node Manager 1
DataNode n Node Manager n

High Level Hadoop 2.0 Architecture cont’d
20
Resource Node Node Node

YARN Manager Manager Manager Manager
HDFS
Cluster NameNode DataNode DataNode DataNode

Hadoop HDFS
21

Hadoop HDFS
22
❑ The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
❑ HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
❑ To store such huge data, the files are stored across multiple machines.
❑ These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
❑ It’s run on commodity hardware.
❑ Unlike other distributed systems, HDFS is highly fault-tolerant and designed
using low-cost hardware.

Hadoop HDFS Key points
23
Some key points of HDFS are as follows:

1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
5. One can replicate a file for a configured number of times, which is tolerant
in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. Sits on top of native file system

HDFS Daemons
24
Key components of HDFS are as follows:

1. NameNode 3. Secondary NameNode
2. DataNodes 4. Standby NameNode
Blocks: Generally the user data is stored in the files of HDFS. HDFS breaks a
large file into smaller pieces called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a block. By default the
block size is 128 MB in Hadoop 2.x and 64 MB in Hadoop 1.x. But it can be
increased as per the need to change in HDFS configuration.
Hadoop 2.X Hadoop 1.X
200 MB – abc.txt 200 MB – abc.txt
128 MB – Block 1
72 MB – Block 2 ?
Why block size is large?
1. Reduce the cost of seek time and 2. Proper usage of storage space
Rack
25
A rack is a collection of 30 or 40 nodes that are physically stored close together

and are all connected to the same network switch. Network bandwidth between
any two nodes in rack is greater than bandwidth between two nodes on different
racks. A Hadoop Cluster is a collection of racks. Switch
Node 1 Node 1 Node 1

S S S
Node 2 Node 2 Node 2
W W W
I I I
T T T
C C C
H H H
Node N Node N Node N
Rack 1 Rack 2 Rack N

NameNode
26
1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master.
3. NameNode only stores the metadata of HDFS – the directory tree of all files in the
file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes
5. NameNode knows the list of the blocks and its location for any given file in HDFS.
With this information NameNode knows how to construct the file from blocks.
6. NameNode is usually configured with a lot of memory (RAM).
7. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop
cluster is inaccessible and considered down.
8. NameNode is a single point of failure in Hadoop cluster.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet

NameNode Metadata
27
1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
❑ FsImage ❑ EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks of a
given file.
Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …

DataNode
28
1. DataNode is responsible for storing the actual data in HDFS.

2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed by
the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz

RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet

Secondary NameNode
29
1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not

a backup NameNode server which can quickly take over in case of
NameNode failure.
2. EditLog– All the file write operations done by client applications are first
recorded in the EditLog.
3. FsImage– This file has the complete information about the file system
metadata when the NameNode starts. All the operations after that are
recorded in EditLog.
4. When the NameNode is restarted it first takes metadata information from
the FsImage and then apply all the transactions recorded in EditLog.
NameNode restart doesn’t happen that frequently so EditLog grows quite
large. That means merging of EditLog to FsImage at the time of startup
takes a lot of time keeping the whole file system offline during that process.
5. Secondary NameNode take over this job of merging FsImage and EditLog
and keep the FsImage current to save a lot of time. Its main function is to
check point the file system metadata stored on NameNode.
Secondary NameNode cont’d
30
The process followed by Secondary NameNode to periodically merge the

fsimage and the edits log files is as follows:
1. Secondary NameNode pulls the latest FsImage and EditLog files from the
primary NameNode.
2. Secondary NameNode applies each transaction from EditLog file to FsImage
to create a new merged FsImage file.
3. Merged FsImage file is transferred back to primary NameNode.
1
2
Secondary
NameNode
NameNode
3
It’s been an hour,

provide your
metadata

Standby NameNode
31
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1. Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2. Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means
no downtime for your HDP cluster!
3. Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running
jobs that are critical to complete on schedule will not be adversely affected
during a NameNode failure scenario.

Replication
32
HDFS provides a reliable way to store huge data in a distributed environment as

data blocks. The blocks are also replicated to provide fault tolerance. The default
replication factor is 3 which is configurable. Therefore, if a file to be stored of
128 MB in HDFS using the default configuration, it would occupy a space of 384
MB (3*128 MB) as the blocks will be replicated three times and each replica will
be residing on a different DataNode.

Rack Awareness
33
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Thus
the rack is down. Rack Awareness was introduced by Apache Hadoop to
overcome this issue. In Rack Awareness, NameNode chooses the DataNode
which is closer to the same rack or nearby rack. NameNode maintains Rack ids
of each DataNode to achieve rack information. Thus, this concept chooses
DataNodes based on the rack information. NameNode in Hadoop makes ensures
that all the replicas should not stored on the same rack or single rack. Default
replication factor is 3. Therefore according to Rack Awareness Algorithm:
❑ When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is on
different node on same remote node.
❑ When re-replicating a block, if the number of existing replicas is one, place
the second on a different rack.
❑ When number of existing replicas are two, if the two replicas are in the
same rack, place the third one on a different rack.
Rack Awareness & Replication
34
File B1 Block 1 B3 Block 3

B1 B2 B3 B2 Block 2
B3 DN 1 B1 DN 1 B2 DN 1
B1 DN 2 B2 DN 2 B3 DN 2
B3 DN 3 B1 DN 3 B2 DN 3
DN 4 DN 4 DN 4
Rack 1 Rack 2 Rack 3

Rack Awareness Advantages
35
❑ Provide higher bandwidth and low latency – This policy

maximizes network bandwidth by transferring block within a rack
rather than between racks. The YARN is able to optimize MapReduce
job performance by assigning tasks to nodes that are closer to their
data in terms of network topology.
❑ Provides data protection against rack failure – Namenode assign
the block replicas of 2nd And 3rd Block to nodes in different rack
from the first replica. Thus, it provides data protection even against
rack failure. However, this is possible only if Hadoop was configured
with knowledge of its rack configuration.
❑ Minimize the writing cost and Maximize read speed – Rack
awareness, policy places read/write requests to replicas which are in
the same rack. Thus, this minimizes writing cost and maximizes
reading speed.

Anatomy of File Write
36
HDFS follow Write once Read many models. So files can’t be edited that are
already stored in HDFS, but data can be appended by reopening the file.
Distributed
HDFS File System
Client NameNode
FSData
Client JVM OutputStream
7. Complete
Client Node
4. Write Packet 5. Acknowledge Packet
4 4
DataNode1 5 DataNode2 5 DataNode3
Pipelines of DataNode
Anatomy of File Write cont’d
37
1. The client calls create function on DistributedFileSystem (a class extends

from FileSystem) to create a file.
2. The RPC call to the NameNode happens through the DistributedFileSystem
to create a new file. The NameNode performs various checks (existence
of the file) to create a new file. Initially, the NameNode creates a file
without associating any data blocks to the file. The DistributedFileSystem
returns an FSDataOutputStream (i.e. class instance) to the client to
perform write.
3. As the client writes data, data is split into packets by DFSOutputStream
(i.e. a class), which is then written to the internal queue, called data
queue. DataStreamer (i.e. a class) consumes the data queue. The
DataStreamer requests the NameNode to allocate new blocks by
selecting a list of suitable DataNodes to store replicas. The list of
DataNodes makes a pipeline. With the default replication factor of 3, there
will be 3 nodes in the pipeline for the first block.

Anatomy of File Write cont’d
38
4. DataStreamer streams the packets to first DataNode in the pipeline. It

stores packet and forwards it to the second DataNode in the pipeline. In the
sameway, the cecond DataNode stores the packet and forwards to the third
DataNode in the pipeline.
5. In additional to the internal queue, DFSOutputStream also manages an
“Ack queue” of packets that are waiting for the acknowledgement by
DataNodes. A packet is removed from the “Ack Queue” only if it is
acknowledged by all the DataNodes in the pipeline.
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the DataNode pipeline and waits
for relevant acknowledgements before communicating with the NameNode
to inform the client that the creation of file is complete.

Anatomy of File Read
39
Distributed
HDFS File System
Client NameNode
FSData
Client JVM InputStream
Client Node
4.2. read
4.1. read 4.3. read
DataNode1 DataNode2 DataNode3
DataNodes

Anatomy of File Read cont’d
40
1. The client opens the file that it wishes to read from by calling open() on the
DistributedFileSystem
2. DistributedFileSystem communicates with the NameNode to get the
location of the data blocks. NameNode returns the address of the
DataNodes that the data blocks are stored on. Subsequent to this,
DistributedFileSystem returns DFSInputStream (i.e. a class) to the
client to read from the file.
3. Client then calls read() on the stream DFSInputStream, which has
address of the DataNodes for the first few blocks of the file, connects to
the closet DataNode for the first block in the file.
4. Client calls read() repeatedly to stream the data from the DataNode.
5. When the end of the block is reached, DFSInputStream closes the
connection with the DataNode. It repeats the steps to find the best
DataNode for the next block and subsequent blocks.
6. When the client completes the reading of the file, it calls close() on
FSDataInputStream to close the connection.
HDFS Commands
41
❑ To get the list of directories and files at the root of HDFS.
hadoop fs –ls /
❑ To create a directory (say, sample) in HDFS
hadoop fs –mkdir /sample
❑ To copy a file from local file system to HDFS
hadoop fs –put /root/sample/test.txt /sample/test.txt
❑ To copy a file from HDFS to local file system
hadoop fs –get /sample/test.txt /root/sample/test.txt
❑ To display the contents of an HDFS file on console
hadoop fs –cat /sample/test.txt
❑ To copy a file from one directory to another on HDFS
hadoop fs –cp /sample/test.txt /sample1
❑ To remove a directory from HDFS
hadoop fs –rm –r /sample1

HDFS Example
42
Let’s assume that this sample.txt file contains few lines as text. The content of the file is
as follows:
Hello I am expert in Big Data
How can I help you
How can I assist you
Are you an engineer
Are you looking for coding
Are you looking for interview questions
what are you doing these days
what are your strengths
Hence, the above 8 lines are the content of the file. Let’s assume that while storing this
file in Hadoop, HDFS broke this file into four parts and named each part as first.txt,
second.txt, third.txt, and fourth.txt. So, you can easily see that the above file will be
divided into four equal parts and each part will contain 2 lines. First two lines will be in
the file first.txt, next two lines in second.txt, next two in third.txt and the last two lines
will be stored in fourth.txt. All these files will be stored in DataNodes and the Name Node
will contain the metadata about them. All this is the task of HDFS.

Data Processing with Hadoop
43

Data Processing with Hadoop
44
1. MapReduce is a processing technique and a program model for

distributed computing based on java. It is built on divide and conquer
algorithm.
2. In MapReduce Programming, the input dataset in split into independent chunks.
3. It contains two important tasks, namely Map and Reduce.
4. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). The processing
primitive is called mapper. The processing is done in parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on
the local disk of that server.
5. Reduce task takes the output from a map as an input and combines those data
tuples into a smaller set of tuples. The processing primitive is called reducer.
The input and output are stored in a file system.
6. Reduce task is always performed after the map job.
7. The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes and takes care of other tasks such as
scheduling, monitoring, re-executing failed tasks etc.
Data Processing with Hadoop cont’d
45

Data Processing with Hadoop cont’d
46
❑ The main advantages is that we write an application in the MapReduce

form, scaling the application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster with a configuration change.
❑ MapReduce program executes in three stages: map stage, shuffle & sorting
stage, and reduce stage.
❑ Map Stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line
by line. The mapper processes the data and creates several small
chunks of data.
❑ Shuffle & Sorting Stage: Shuffle phase in Hadoop transfers the map output
from Mapper to a Reducer in MapReduce. Sort phase in MapReduce
covers the merging and sorting of map outputs.
❑ Reducer Stage: The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
How MapReduce Work?
47
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
❑ The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
❑ The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.

Working of MapReduce
48
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different
servers.
❑ Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records
to be processed, 100 mappers can run together to process one record each.
Or maybe 50 mappers can run together to process two records each. The
Hadoop framework decides how many mappers to use, based on the size of
the data to be processed and the memory block available on each mapper
server.
Working of MapReduce cont’d
49
❑ Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for the maximum electrical consumption for each
input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat

50

51
1. What is Hadoop MapReduce?

2. What is the need of MapReduce?
3. What is Mapper in Hadoop MapReduce?
4. In MapReduce, ideally how many mappers should be configured on a
slave?
5. How to set the number of mappers to be created in MapReduce?
6. Where is the output of Mapper written in Hadoop?
7. What is Reducer in MapReduce?
8. How many numbers of reducers run in Map-Reduce Job?
9. Can we set the number of reducers to zero in MapReduce?
10. What happen if the number of the reducer is 0 in MapReduce?
11. What is the key- value pair in Hadoop MapReduce?
12. What is InputFormat in Hadoop MapReduce?
13. What are the various InputFormats in Hadoop?
14. What is the difference between HDFS block and input split?

Data Locality
52
When a dataset is stored in HDFS, it is divided in to blocks and stored across the
DataNodes in the Hadoop cluster. When a MapReduce job is executed against the
dataset the individual Mappers will process the blocks (input splits). When the
data is not available for the Mapper in the same node where it is being executed,
the data needs to be copied over the network from the DataNode which has the
data to the DataNode which is executing the Mapper task. Imagine a MapReduce
job with over 100 Mappers and each Mapper is trying to copy the data from
another DataNode in the cluster at the same time, this would result in serious
network congestion as all the Mappers would try to copy the data at the same
time and it is not ideal. So it is always effective and cheap to move the
computation closer to the data than to move the data closer to the computation.
When the data is located on the same node as the Mapper working on the data,
it is referred to as Data Local. In this case the proximity of the data is closer to
the computation. The ApplicationMaster (MRv2) prefers the node which has
the data that is needed by the Mapper to execute the Mapper.

Data Locality cont’d
53
NameNode
MapReduce
Logic
Data stored for Data stored for Data stored for

MapReduce Job MapReduce Job MapReduce Job
DataNode4 DataNode1 DataNode2 DataNode3

MapReduce MapReduce MapReduce
Logic Logic Logic

Apache Hadoop YARN
54

YARN
55
YARN stands for “Yet Another Resource Negotiator” and is the architectural
center of Hadoop 2.0 that allows multiple data processing engines such as
interactive SQL, real-time streaming, data science and batch processing to
handle data stored in a single platform, unlocking an entirely new approach to
analytics.

Why YARN?
56
In Hadoop 1.0 which is also referred to as MRV1(MapReduce Version 1),

MapReduce performed both processing and resource management functions. It
consisted of a Job Tracker which was the single master. The Job Tracker
allocated the resources, performed scheduling and monitored the processing
jobs. It assigned map and reduce tasks on a number of subordinate processes
called the Task Trackers. The Task Trackers periodically reported their progress
to the Job Tracker.

Why YARN cont’d
57
This design resulted in scalability bottleneck due to a single Job Tracker. IBM
mentioned in its article that according to Yahoo!, the practical limits of such a
design are reached with a cluster of 5000 nodes and 40,000 tasks running
concurrently. Apart from this limitation, the utilization of computational
resources is inefficient in MRV1. Also, the Hadoop framework became limited
only to MapReduce processing paradigm.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the
year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve
MapReduce by taking over the responsibility of Resource Management and Job
Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs
within the Hadoop framework.
With the introduction of YARN, the Hadoop ecosystem was completely
revolutionalized. It became much more flexible, efficient and scalable. When
Yahoo went live with YARN in the first quarter of 2013, it aided the company to
shrink the size of its Hadoop cluster from 40,000 nodes to 32,000 nodes.
Introduction to YARN
58
YARN allows different data processing methods like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored
in HDFS. Therefore YARN opens up Hadoop to other types of distributed applications
beyond MapReduce. YARN enabled the users to perform operations as per requirement
by using a variety of tools like Spark for real-time processing, Hive for SQL, HBase for
NoSQL and others. Apart from Resource Management, YARN also performs Job
Scheduling. YARN performs all your processing activities by allocating resources and
scheduling tasks. Apache Hadoop YARN architecture consists of the following main
components :
❑ Resource Manager: Runs on a master daemon and manages the resource allocation
in the cluster.
❑ Node Manager: They run on the slave daemons and are responsible for the
execution of a task on every single Data Node.
❑ Application Master: Manages the user job lifecycle and resource needs of
individual applications. It works along with the Node Manager and monitors the
execution of tasks.
❑ Container: Package of resources including RAM, CPU, Network, HDD etc on a single
node.
Components of YARN
59

Resource Manager
60
❑ It is the ultimate authority in resource allocation.

❑ On receiving the processing requests, it passes parts of requests to corresponding
node managers accordingly, where the actual processing takes place.
❑ It is the arbitrator of the cluster resources and decides the allocation of the available
resources for competing applications.
❑ Optimizes the cluster utilization like keeping all resources in use all the time against
various constraints such as capacity guarantees, fairness, and SLAs.
❑ It has two major components: a) Scheduler b) Application Manager
❑ Scheduler: It is responsible for allocating resources to the various running
applications subject to constraints of capacities, queues etc and does not
perform any monitoring or tracking of status for the applications. If there is an
application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
❑ Application Manager: It is responsible for accepting job submissions.
Negotiates the first container from the Resource Manager for executing the
application specific Application Master. Manages running the Application
Masters in a cluster and provides service for restarting the Application Master
container on failure.
Node Manager
61
❑ It takes care of individual nodes in a Hadoop cluster and manages user jobs
and workflow on the given node.
❑ It registers with the Resource Manager and sends heartbeats with the
health status of the node.
❑ Its primary goal is to manage application containers assigned to it by the
resource manager.
❑ It keeps up-to-date with the Resource Manager.
❑ Application Master requests the assigned container from the Node Manager
by sending it a Container Launch Context(CLC) which includes everything
the application needs in order to run. The Node Manager creates the
requested container process and starts it.
❑ Monitors resource usage (memory, CPU) of individual containers.
❑ Performs Log management.
❑ It also kills the container as directed by the Resource Manager.

Application Master
62
❑ An application is a single job submitted to the framework. Each such

application has a unique Application Master associated with it which is a
framework specific entity.
❑ It is the process that coordinates an application’s execution in the cluster
and also manages faults.
❑ Its task is to negotiate resources from the Resource Manager and work with
the Node Manager to execute and monitor the component tasks.
❑ It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
❑ Once started, it periodically sends heartbeats to the Resource Manager to
affirm its health and to update the record of its resource demands.

Container
63
❑ It is a collection of physical resources such as RAM, CPU cores, and disks on

a single node.
❑ YARN containers are managed by a container launch context which is
container life-cycle(CLC). This record contains a map of environment
variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary to
create the process.
❑ It grants rights to an application to use a specific amount of resources
(memory, CPU etc.) on a specific host.

Application Workflow
64
1. Client submits an application

2. Resource Manager allocates a
container to start Application
Manager
3. Application Manager registers with
Resource Manager
4. Application Manager asks containers
from Resource Manager
5. Application Manager notifies Node
Manager to launch containers
6. Application code is executed in the
container
7. Client contacts Resource
Manager/Application Manager to
monitor application’s status
8. Application Manager unregisters
with Resource Manager

Hadoop Ecosystem
65
Data Management
Data Access
Data Processing
Data Storage

Apache Pig
66

Pig
67
❑ Apache Pig is an abstraction over MapReduce. It is a tool/platform which

is used to analyze larger sets of data representing them as data flows.
Pig is generally used with Hadoop and perform all the data manipulation
operations.
❑ To write data analysis programs, Pig provides a high-level language
known as Pig Latin. This language provides various operators using which
programmers can develop their own functions for reading, writing, and
processing data.
❑ To analyze data using Apache Pig, programmers need to write scripts
using Pig Latin language. All these scripts are internally converted to
Map and Reduce tasks. Apache Pig has a component known as Pig Engine
that accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.

Need of Apache Pig
68
Programmers who are not so good at Java normally used to struggle working
with Hadoop, especially while performing any MapReduce tasks. Apache Pig is a
boon for all such programmers.
❑ Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
❑ Apache Pig uses multi-query approach, thereby reducing the length of
codes. For example, an operation that would require you to type 200 lines of
code (LoC) in Java can be easily done by typing as less as just 10 LoC in
Apache Pig. Ultimately Apache Pig reduces the development time by almost
16 times.
❑ Pig Latin is SQL-like language and it is easy to learn Apache Pig when
you are familiar with SQL.
❑ Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data
types like tuples, bags, and maps that are missing from MapReduce.

Features of Apache Pig
69
Apache Pig comes with the following features:

❑ Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
❑ Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
❑ Optimization opportunities − The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
❑ Extensibility − Using the existing operators, users can develop their
own functions to read, process, and write data.
❑ UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
❑ Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.

Apache Pig vs. MapReduce
70
Apache Pig MapReduce

Apache Pig is a data flow language MapReduce is a data processing
paradigm.
Performing a Join operation in Apache Pig It is quite difficult in MapReduce to perform
is pretty simple. a Join operation between datasets.
Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work conveniently MapReduce.
with Apache Pig.
Apache Pig uses multi-query approach, MapReduce require almost 20 times more
thereby reducing the length of the codes to the number of lines to perform the same
a great extent. task.
There is no need for compilation. On MapReduce jobs have a long compilation
execution, every Apache Pig operator is process.
converted internally into a MapReduce job.

Application and History of Apache Pig
71
Application
Apache Pig is generally used by data scientists for performing tasks involving
ad-hoc processing and quick prototyping. Apache Pig is used:
❑ To process huge data sources such as web logs.
❑ To perform data processing for search platforms.
❑ To process time sensitive data loads.
History
In 2006, Apache Pig was developed as a research project at Yahoo, especially to
create and execute MapReduce jobs on every dataset. In 2007, Apache Pig was
open sourced via Apache incubator. In 2008, the first release of Apache Pig came
out. In 2010, Apache Pig graduated as an Apache top-level project.

Apache Pig Architecture
72
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It
is a high level data processing language which provides a rich set of data types
and operators to perform various operations on the data.
To perform a particular task Programmers using Pig, programmers need to

write a Pig script using the Pig Latin language, and execute them using any of
the execution mechanisms (Grunt Shell, UDFs, Embedded). After execution,
these scripts will go through a series of transformations applied by the Pig
Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce

jobs, and thus, it makes the programmer’s job easy.

Apache Pig Architecture cont’d
73

Apache Pig Architecture Components
74
❑ Parser: Initially the Pig Scripts are handled by the Parser. It checks the
syntax of the script, does type checking, and other miscellaneous checks.
The output of the parser will be a DAG (directed acyclic graph), which
represents the Pig Latin statements and logical operators. In the DAG, the
logical operators of the script are represented as the nodes and the data
flows are represented as edges.
❑ Optimizer: The logical plan (DAG) is passed to the logical optimizer, which
carries out the logical optimizations such as projection and pushdown.
❑ Compiler: The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
❑ Execution engine: Finally the MapReduce jobs are submitted to Hadoop in
a sorted order. Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.

Apache Pig Data Model
75
❑ Atom: Any single value in Pig Latin, irrespective of their data, type is known as an
Atom. It is stored as string and can be used as string and number. int, long, float,
double, chararray, and bytearray are the atomic values of Pig. A piece of data or a
simple atomic value is known as a field. Ex: ‘Abhi’
❑ Tuple: A record that is formed by an ordered set of fields is known as a tuple, the
fields can be of any type. A tuple is similar to a row in a table of RDBMS. Ex: (‘Abhi’,
14)
❑ Bag: A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields (flexible
schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a
table in RDBMS, it is not necessary that every tuple contain the same number of
fields or that the fields in the same position (column) have the same type. Ex:
{(‘Abhi’), (‘Manu’, (14, 21))}
❑ Map: A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is represented by
‘[]’. Ex: [‘name’#’Raju’, ‘age’#30]
❑ Relation: A relation is a bag of tuples. The relations in Pig Latin are unordered
(there is no guarantee that tuples are processed in any particular order).
Apache Pig Data Model cont’d
76
Data Types: Int, Long, Float, Double, charArray, byteArray, tuple, bag, map
(collection of tuples)

Apache Pig Latin Execution Modes
77
Apache Pig can run in two modes, namely, Local Mode and HDFS mode.
❑ Local Mode: In this mode, all the files are installed and run from local host
and local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
❑ MapReduce Mode: MapReduce mode is where the data loaded or
processed that exists in the Hadoop File System (HDFS). In this mode,
whenever Pig Latin statements are executed to process the data, a
MapReduce job is invoked in the back-end to perform a particular
operation on the data that exists in the HDFS.

Apache Pig Latin Execution Mechanisms
78
Apache Pig scripts can be executed in three ways, namely, interactive mode,
batch mode, and embedded mode.
❑ Interactive Mode (Grunt shell) − You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).
❑ Batch Mode (Script) − You can run Apache Pig in Batch mode by writing
the Pig Latin script in a single file with .pig extension.
❑ Embedded Mode (UDF) − Apache Pig provides the provision of defining
our own functions (User Defined Functions) in programming languages
such as Java, and using them in our script.

Invoking the Grunt Shell
79
Grunt shell can be invoked in a desired mode (local/MapReduce) using the −x

option as shown below.

Operators
80
Operator Description
Loading and Storing
LOAD To Load the data from the file system (local/HDFS) into a relation.
STORE To save a relation to the file system (local/HDFS).
Filtering
FILTER To remove unwanted rows from a relation.
DISTINCT To remove duplicate rows from a relation.
Diagnostic Operators
DUMP To print the contents of a relation on the console.
DESCRIBE To describe the schema of a relation.
Combining and Splitting
UNION To combine two or more relations into a single relation.
SPLIT To split a single relation into two or more relations.

Operators cont’d
81
Operator Description
Grouping and Joining
JOIN To join two or more relations.
COGROUP To group the data in two or more relations.
GROUP To group the data in a single relation.
Sorting
ORDER To arrange a relation in a sorted order based on one or more fields
(ascending or descending).
LIMIT To get a limited number of tuples from a relation.

Apache Pig – Reading Data
82
To analyze data using Apache Pig, the data has to be loaded into Apache Pig. In
MapReduce mode, Pig reads (loads) data from HDFS and stores the results
back in HDFS. The below dataset contains personal details like id, first name,
last name, phone number and city, of six students and stored in student_data.txt
with csv format in Hadoop cluster - hdfs://localhost:9000/pig_data/ path.

Apache Pig – Reading Data cont’d
83
The data to be loaded into Apache Pig from the file system (HDFS/Local)
using LOAD operator of Pig Latin.
Syntax: Relation_name = LOAD 'Input file path' USING function as schema
where:
❑ relation_name − the relation in which we want to store the data.
❑ Input file path − HDFS directory where the file is stored (MapReduce
mode)
❑ function − function from the set of load functions provided by Apache Pig
(BinStorage, JsonLoader, PigStorage, TextLoader).
❑ Schema − define the schema of the data. The required schema can be
defined as − (column1 : data type, column2 : data type);
Example:
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray );

Apache Pig – Storing Data
84
The loaded data can be stored in the HDFS file system using the store
operator.
Syntax: STORE relation_name INTO ' required_directory_path ' [USING
function];
❑ relation_name − the relation in which we want to store the data.
❑ Input file path − HDFS directory where the file is stored (MapReduce
mode)
❑ function − function from the set of load functions provided by Apache Pig
(BinStorage, JsonLoader, PigStorage, TextLoader).
Example:
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/' USING
PigStorage (',');
This would store the relation in the HDFS directory “/pig_Output/” wherein
student is the relation as explained in LOAD operator.

Apache Pig – Diagnostic Operators
85
❑ Dump Operator: It is used to run the Pig Latin statements and display
the results on the screen. It is generally used for debugging Purpose.
Syntax: grunt> Dump Relation_Name
Example: grunt> Dump student → Assume data is read into a relation
student using the LOAD operator
❑ Describe Operator: It is used to view the schema of a relation.
Syntax: grunt> Describe Relation_Name
Example: grunt> Describe student → Assume data is read into a relation
❑ Explain Operator: It is used to display the logical, physical, and
MapReduce execution plans of a relation
Syntax: grunt> explain Relation_Name
Example: grunt> explain student → Assume data is read into a relation

Apache Pig – Group Operators
86
❑ Group Operator: It is used to group the data in one or more relations. It

collects the data having the same key.
Example: grunt> group_data = GROUP student by age; → Assume data is
read into a relation student using the LOAD operator
Example: grunt> group_data = GROUP student by (age, city); → Assume
data is read into a relation student using the LOAD operator and it’s the
illustration of multiple columns grouping
❑ Group ALL Operator: A relation can be grouped by all the columns and is
shown below.
Example: grunt> group_all = GROUP student_details All;
❑ Cogroup Operator: The only difference between the Group and Cogroup
operators is that the group operator is normally used with one relation,
while the cogroup operator is used in statements involving two or
more relations.

Apache Pig – Cogroup Operator Example
87
grunt> student_details = LOAD

'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
grunt> employee_details = LOAD

'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Now, let us group the records/tuples of the relations student_details and

employee_details with the key age, as shown below.
grunt> cogroup_data = COGROUP student_details by age, employee_details

by age;
The cogroup operator groups the tuples from each relation according to age
where each group depicts a particular age value.
Apache Pig – Join Operator
88
The JOIN operator is used to combine records from two or more relations. While
performing a join operation, we declare one (or a group of) tuple(s) from each
relation, as keys. When these keys match, the two particular tuples are matched,
else the records are dropped. Joins can be of the following types −
❑ Self-join: It is used to join a table with itself as if the table were two
relations, temporarily renaming at least one relation.
❑ Inner-join: returns rows when there is a match in both tables.
❑ Outer-join:
❑ Left outer join: returns all rows from the left table, even if there are no
matches in the right relation.
❑ Right outer join: returns all rows from the right table, even if there are
no matches in the left table.
❑ Full outer join: returns rows when there is a match in one of the
relations.

Apache Pig – Self-Join Operator Example
89
Self-join: Generally, in Apache Pig, to perform self-join, we will load the same
data multiple times, under different aliases (names). Therefore let us load the
contents of the file customers.txt as two tables as shown below.
grunt> customers1 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt'
USING PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
grunt> customers2 = LOAD 'hdfs://localhost:9000/pig_data/customers.txt'
salary:int);
Example: Let us perform self-join operation on the relation customers, by
joining the two relations customers1 and customers2 as shown below.
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

Apache Pig – Inner Join Operator Example
90
Inner-join: It creates a new relation by combining column values of two

relations (say A and B) based upon the join-predicate. The query compares each
row of A with each row of B to find all pairs of rows which satisfy the join-
predicate. When the join-predicate is satisfied, the column values for each
matched pair of rows of A and B are combined into a result row.
grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt'
salary:int);
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int, amount:int);
Example: Let us perform inner join operation on the two relations customers
and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;

Apache Pig – Outer Join Operator Example
91

salary:int);
Example :
grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY customer_id;
grunt> outer_right = JOIN customers BY id RIGHT OUTER, orders BY
customer_id;
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Apache Pig – Cross Operator
92
The CROSS operator computes the cross-product of two or more relations.

Example:
salary:int);
grunt> cross_data = CROSS customers, orders;
grunt> Dump cross_data;

Apache Pig – Union Operator
93
The UNION operator of Pig Latin is used to merge the content of two
relations. To perform UNION operation on two relations, their columns and
domains must be identical.
Example:
grunt> student1 = LOAD 'hdfs://localhost:9000/pig_data/student_data1.txt'
USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);
grunt> student2 = LOAD 'hdfs://localhost:9000/pig_data/student_data2.txt'
USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
grunt> student = UNION student1, student2;
grunt> Dump student;

Apache Pig – Split Operator
94
The SPLIT operator is used to split a relation into two or more relations.
Example:
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
Let us now split the relation into two, one listing the students of age less than
23, and the other listing the students having the age between 22 and 25.
grunt> SPLIT student into student1 if age<23, student2 if (22<age and age>25);
grunt> Dump student1;
grunt> Dump student2;

Apache Pig – Distinct Operator
95
The DISTINCT operator is used to remove redundant (duplicate) tuples

from a relation.
Syntax:
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
Assume there are some duplicate tuples in the relation
grunt> distinct_data = DISTINCT student;
grunt> Dump distinct_data ;

Apache Pig – Foreach Operator
96
The FOREACH operator is used to generate specified data transformations

based on the column data.
Example:
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student.txt'
USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
Assume there are 8 tuples in the relation and get the id, age, and city values of
each student from the relation student_details and store it into another relation
named foreach_data
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
grunt> Dump foreach_data ;

Apache Pig – Order By Operator
97
The ORDER BY operator is used to display the contents of a relation in a

sorted order based on one or more fields.
Example:
Sort the relation in a descending order based on the age of the student and store
it into another relation named order_by_data
grunt> order_by_data = ORDER student_details BY age DESC;
grunt> Dump order_by_data ;

Apache Pig – LIMIT Operator
98
The LIMIT operator is used to get a limited number of tuples from a

relation.
Example:
Sort the relation in a descending order based on the age of the student and store
it into another relation named order_by_data and then limit the data to 4 and
then store it into another relation named limit_data
grunt> order_by_data = ORDER student_details BY age DESC;
grunt> limit_data = LIMIT order_by_data 4;
grunt> Dump limit_data ;

Apache Hive
99

Hive
100
Apache Hive is a data warehouse system built on top of Hadoop and is used
for analyzing structured and semi-structured data. Hive abstracts the
complexity of Hadoop MapReduce. Basically, it provides a mechanism to
project structure onto the data and perform queries written in HQL (Hive
Query Language) that are similar to SQL statements. Internally, these
queries or HQL gets converted to map reduce jobs by the Hive compiler.
Therefore, one don’t need to worry about writing complex MapReduce
programs to process data using Hadoop. It is targeted towards users who are
comfortable with SQL. Apache Hive supports Data Definition Language
(DDL), Data Manipulation Language (DML) and Pluggable Functions.
DDL: create table, create index, create views.
DML: Select, Where, group by, Join, Order By
Pluggable Functions:
UDF: User Defined Function
UDAF: User Defined Aggregate Function
UDTF: User Defined Table Function
Hive Data Types
101
Numeric Types
❑ TINYINT (1-byte signed integer, from -128 to 127)
❑ SMALLINT (2-byte signed integer, from -32,768 to 32,767)
❑ INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
❑ BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to
9,223,372,036,854,775,807)
❑ FLOAT (4-byte single precision floating point number)
❑ DOUBLE (8-byte double precision floating point number)
❑ DECIMAL (Hive 0.13.0 introduced user definable precision and scale)
Date/Time Types String Types Misc Types
❑ TIMESTAMP ❑ STRING ❑ CHAR ❑ BOOLEAN
❑ DATE ❑ VARCHAR ❑ BINARY
Self Study – Built-in Operators
102
Relational Arithmetic
Operators Operators
Logical Complex
Operators Operators

Hive DDL Commands
103
Create Database Statement: A database in Hive is a namespace or a

collection of tables. The syntax for this statement is: CREATE
DATABASE|SCHEMA [IF NOT EXISTS] <database name>. IF NOT EXISTS is an
optional clause, which notifies the user that a database with the same name
already exists. We can use SCHEMA in place of DATABASE in this command.
❑ hive> CREATE SCHEMA userdb; → Create the database by name userdb
❑ hive> SHOW DATABASES; → verify a databases list
Drop Database Statement: Drop Database is a statement that drops all the
tables and deletes the database. Its syntax is: DROP DATABASE
StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];
❑ hive> DROP DATABASE IF EXISTS userdb; → drop a database by name
userdb
❑ hive> DROP DATABASE IF EXISTS userdb; → drops the database using
CASCADE i.e. dropping respective
School tables before
of Computer dropping the database.
Engineering
Hive DDL Commands cont’d
104
Create Table Statement: Create Table is a statement used to create a table in

Hive. Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
Example: Let us assume to create a table named employee. The following table
lists the fields and their data types in employee table:
Sr. No Field Name Data Type
1 EID int
2 Name String
3 Salary Float

Create Table Statement
105
The following data is a Comment, Row formatted fields such as Field terminator,
Lines terminator, and Stored File type.
COMMENT ‘Employee details’
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED IN TEXT FILE
The following query creates a table named employee using the above data.
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Load Data Statement
106
Generally, after creating a table in SQL, we can insert data using the Insert
statement. But in Hive, we can insert data using the LOAD DATA statement.
While inserting data into Hive, it is better to use LOAD DATA to store bulk
records. There are two ways to load data: one is from local file system and
second is from Hadoop file system.
Syntax: The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2 ...)]
❑ LOCAL is identifier to specify the local path. It is optional.

❑ OVERWRITE is optional to overwrite the data in the table.
❑ PARTITION is optional.

Load Data Statement Example
107
Let's insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager

1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'

OVERWRITE INTO TABLE employee;

Alter Table
108
It is used to alter a table in Hive.

Syntax: The statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
ALTER TABLE name RENAME TO new_name

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Example:
❑ hive> ALTER TABLE employee RENAME TO emp; → Rename the table from
employee to emp
❑ ALTER TABLE employee CHANGE name ename String; → Rename the column from
name to ename
❑ hive> ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT
'Department name'); → add a column named dept to the employee table
Hive Partitioning
109
Hive organizes tables into partitions. It is a way of dividing a table into

related parts based on the values of partitioned columns such as date, city,
and department. Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to

the data that may be used for more efficient querying. Bucketing works based
on the value of hash function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept,
and yoj (i.e., year of joining). Suppose you need to retrieve the details of all
employees who joined in 2012. A query searches the whole table for the
required information. However, if you partition the employee data with
the year and store it in a separate file, it reduces the query processing
time.

Hive Partitioning Example
110
The following file contains employeedata table.
/tab1/employeedata/file1
id, name, dept, yoj

1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
The above data is partitioned into two files using year.
/tab1/employeedata/2012/file2 /tab1/employeedata/2013/file3
1, gopal, TP, 2012 3, kaleel,SC, 2013

2, kiran, HR, 2012 4, Prasanth, SC, 2013
Hive QL
111
The Hive Query Language (HiveQL) is a query language for Hive to process and
analyze structured data in a Metastore.
Select-Where Statement
SELECT statement is used to retrieve the data from a table. WHERE clause
works similar to a condition. It filters the data using the condition and gives you
a finite result. The built-in operators and functions generate an expression,
which fulfils the condition.
Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Select-Where Example
112
Let us take an example for SELECT…WHERE clause. Assume we have the

employee table, with fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details who earn a salary of more
than Rs 30000.
hive> SELECT * FROM employee WHERE salary>30000;
Generate a query to retrieve the employee details who belongs to IT dept.

hive> SELECT * FROM employee WHERE dept=‘IT’;
Generate a query to retrieve the employee details who belongs to IT dept and
drawing more than Rs 30000..
hive> SELECT * FROM employee WHERE dept=‘IT’ AND salary>30000;

Self Study – Hive QL
113
Select- Select-
Order By Group By
Select-
Joins

Self Study
114
Views Indexes
Built-in
Functions

Apache Scoop
115

Scoop
116
Sqoop is a tool designed to transfer data between Hadoop and relational

database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases.
The traditional application management system, that is, the interaction of
applications with relational database using RDBMS, is one of the sources that
generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational
Database Servers in the relational database structure.
When Big Data storages and analyzers such as MapReduce, Hive, Pig, etc. of the
Hadoop ecosystem came into picture, they required a tool to interact with the
relational database servers for importing and exporting the Big Data residing in
them. Here, Sqoop occupies a place in the Hadoop ecosystem to provide feasible
interaction between relational database server and Hadoop’s HDFS.
Sqoop − “SQL to Hadoop and Hadoop to SQL”

How Sqoop Works?
117
Sqoop Tool
RDBMS
Import
Hadoop File
(MySQL, System (HDFS,
Oracle, DB2, Hive, HBase)
SQL Server) Export
Sqoop Import: The import tool imports individual tables from RDBMS to HDFS.
Each row in a table is treated as a record in HDFS. All records are stored as text
data in text files or as binary data in Avro and Sequence files.
Sqoop Export: The export tool exports a set of files from HDFS back to an
RDBMS. The files given as input to Sqoop contain records, which are called as
rows in table. Those are read and parsed into a set of records and delimited with
user-specified delimiter.
Sqoop - Import
118
The ‘Import tool’ imports individual tables from RDBMS to HDFS. Each row in a
table is treated as a record in HDFS. All records are stored as text data in the text
files or as binary data in Avro and Sequence files.
Syntax: $ sqoop import (generic-args) (import-args)
$ sqoop-import (generic args) (import args) → alternative approach
The Hadoop specific generic arguments must precede any import arguments,
and the import arguments can be of any order.
Importing a Table into HDFS
$ sqoop import --connect --table --username --password --target-dir
--connect Takes JDBC url and connects to database
--table Source table name to be imported
--username Username to connect to database
--password Password of the connecting user
--target-dir Imports data to the specified directory
Sqoop – Import cont’d
119
Importing Selected Data from Table

$ sqoop import --connect --table --username --password --columns --where
--columns Selects subset of columns
--where Retrieves the data which satisfies the condition
Importing Data from Query
$ sqoop import --connect --table --username --password --query
--query Executes the SQL query provided and imports the results
Importing Data from Query
$ sqoop import --connect --table --username --password --incremental <mode>
--check-column <column name> --last-value <last check column value>
Sqoop import supports two types of incremental imports mode namely Append,
and Lastmodified.

Sqoop – Import Example
120
Let us take an example of three tables named as emp (ID, Name, Des, Dalary,
Dept), emp_address (ID, House_No, Street, City, Pincode), and emp_contact (ID,
Phone_No, Email), which are in a database called userdb in a MySQL database
server.
Importing a Table into HDFS:
$ sqoop import -- connect jdbc:mysql://localhost/userdb \
--username root --table emp
Importing a Table into Target Directory:
--username root --table emp_address --target-dir /emp_addr
Import Subset of Table Data:
--username root --table emp_contact --where “Phone_No =’953467890’”
Sqoop – Import Example cont’d
121
Incremental Import :
$ sqoop import --connect jdbc:mysql://localhost/userdb --username root \
--table emp --incremental append --check-column ID --last value 1205
Import All Tables
Syntax:
$ sqoop import-all-tables (generic-args) (import-args)
$ sqoop-import-all-tables (generic-args) (import-args) → alternative approach
Note: If import-all-tables is used, it is mandatory that every table in that
database must have a primary key field.
Example:
$ sqoop import-all-tables --connect jdbc:mysql://localhost/userdb \
--username root

Sqoop - Export
122
The ‘Export tool’ export data back from the HDFS to the RDBMS database. The
target table must exist in the target database. The files which are given as input
to the Sqoop contain records, which are called rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
The default operation is to insert all the record from the input files to the
database table using the INSERT statement. In update mode, Sqoop generates
the UPDATE statement that replaces the existing record into the database.
Syntax: $ sqoop export (generic-args) (import-args)
$ sqoop-export (generic args) (import args) → alternative approach
Example:
Let us take an example of the employee data in file, in HDFS. The employee data
is available in emp_data file in ‘emp/’ directory in HDFS.
It is mandatory that the table to be exported is created manually and is present
in the database from where it has to be exported.
Sqoop – Export Example
123
The following query is to create the table ‘employee’ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (id INT NOT NULL PRIMARY KEY,
name VARCHAR(20), deg VARCHAR(20), salary INT, dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data
file on HDFS) to the employee table in db database of Mysql database server.
$ sqoop export --connect jdbc:mysql://localhost/db --username root \
--table employee --export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
mysql>select * from employee;

Apache HBase
124

HBase
125
HBase is a data model designed to provide quick random access to huge

amounts of structured data. It is a distributed column-oriented database built
on top of the Hadoop file system. It is an open-source project and is horizontally
scalable. It leverages the fault tolerance provided by the HDFS. It is a part of the
Hadoop ecosystem that provides random real-time read/write access to data
Limitations of Hadoop
Hadoop accessed data only in a sequential manner. That means one has to
search the entire dataset even for the simplest of jobs. A huge dataset when
processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in
a single unit of time (random access).
Hadoop Random Access Databases
Storage applications such as HBase, Cassandra, couchDB, Dynamo, and
MongoDB are some of the databases that store huge amounts of data and access
the data in a random manner.
HBase cont’d
126
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of
the Hadoop File System and provides read and write access.
Read Write
HBase
R W
Data e r Data
i Producer
Consumer a
t
d
e
Write
HDFS

HDFS vs. HBase
127
HDFS HBase
HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing. It provides low latency access to single
rows from billions of records (Random
access).
It provides only sequential access of HBase internally uses Hash tables and
data. provides random access, and it stores
the data in indexed HDFS files for
faster lookups.

Storage Mechanism in HBase
128
HBase is a column-oriented database and the tables in it are sorted by row. The
table schema defines only column families, which are the key value pairs. A table
have multiple column families and each column family can have any number of
columns. Subsequent column values are stored contiguously on the disk. Each
cell value of the table has a timestamp. In short, in an HBase:
❑ Table is a collection of rows.
❑ Row is a collection of column families.
❑ Column family is a collection of columns.
❑ Column is a collection of key value pairs.
Example:

129

HBase Architecture
130
Client
HMaster
HRegionServer HRegionServer HRegionServer

ZooKeeper
HRegion HRegion HRegion
HRegion HRegion HRegion
HDFS

HBase Architecture cont’d
131
HBase architecture consists mainly of four components

❑ HMaster
❑ HRegionserver
❑ HRegions
❑ Zookeeper
❑ HDFS
HMaster: HMaster is the implementation of a Master server in HBase
architecture. It acts as a monitoring agent to monitor all Region Server instances
present in the cluster and acts as an interface for all the metadata changes. In a
distributed cluster environment, Master runs on NameNode. Master runs
several background threads.
The client communicates in a bi-directional way with both HMaster and
ZooKeeper. For read and write operations, it directly contacts with HRegion
servers. HMaster assigns regions to region servers and in turn, check the health
status of region servers.
132
HRegions Servers: When Region Server receives writes and read requests from
the client, it assigns the request to a specific region, where the actual column
family resides. However, the client can directly contact with HRegion servers,
there is no need of HMaster mandatory permission to the client regarding
communication with HRegion servers. The client requires HMaster help when
operations related to metadata and schema changes are required.
HRegionServer is the Region Server implementation. It is responsible for
serving and managing regions or data that is present in a distributed
cluster. The region servers run on Data Nodes present in the Hadoop cluster.
HRegions: HRegions are the basic building elements of HBase cluster that
consists of the distribution of tables and are comprised of Column families. It
contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and Hfile.

133
ZooKeeper: In HBase, Zookeeper is a centralized monitoring server which

maintains configuration information and provides distributed synchronization.
Distributed synchronization is to access the distributed applications running
across the cluster with the responsibility of providing coordination services
between nodes. If the client wants to communicate with regions, the server's
client has to approach ZooKeeper first.
Conclusion: HBase is one of NoSql column-oriented distributed database
available in Apache foundation. HBase gives more performance for
retrieving fewer records rather than Hadoop or Hive. It's very easy to search
for given any input value because it supports indexing, transactions, and
updating. Online real-time analytics can be performed using HBase integrated
with Hadoop ecosystem. It has an automatic and configurable sharding for
datasets or tables and provides restful API's to perform the MapReduce jobs.

134

Big Data (CS-3032)

Bhubaneswar-751024

Course Contents
137

4 Storing Data in Big Data context 8
Data Models, RDBMS and Hadoop, Non-Relational Database, Introduction
to NoSQL, Types of NoSQL, Polyglot Persistence, Sharding

Data Model
138
In information technology, data architecture is composed of

models, policies, rules or standards that govern which data is
collected, and how it is stored, arranged, integrated, and put to use
in data systems and in organizations. A few basic concepts in data
architecture:
❑ Conceptual data model - shows data entities such as
customer, product and transaction, and their semantics.
❑ Logical model - defines the data in as much detail as possible,
including relations between data elements, but without
considering how data is stored or managed.
❑ Physical data model - defines how the data is represented and
stored, for example in a flat file, database, data warehouse, key-
value store
Data Model cont’d …
139
Conceptual Data Model Logical Data Model Physical Data Model
Source: 1keydata

Data Model cont’d …
140

RDBMS and Hadoop
141
RDBMS Hadoop
RDBMS is an data/information management It is an open-source software framework used
system. In RDBMS tables are used for for storing data and running applications on
data/information storage. Each row of the a group of commodity hardware. It has large
table represents a record and column storage capacity and high processing power.
represents an attribute of data. Organization It can manage multiple concurrent processes
of data and their manipulation processes are at the same time. It is used in predictive
different in RDBMS from other databases. analysis, data mining and machine learning.
RDBMS ensures ACID (atomicity, consistency, It can handle both structured and unstructured
integrity, durability) properties required for form of data. It is more flexible in storing,
designing a database. The purpose of RDBMS processing, and managing data than
is to store, manage, and retrieve data as traditional RDBMS. Unlike traditional systems,
quickly and reliably as possible. Hadoop enables multiple analytical
processes on the same data at the same time.
It supports scalability very flexibly.

RDBMS and Hadoop difference
142
RDBMS Hadoop
Traditional row-column based databases, An open-source software used for storing
basically used for data storage, data and running applications or
manipulation and retrieval. processes concurrently.
It is best suited for OLTP environment. It is best suited for BIG data.
In this, structured data is mostly In this, both structured and unstructured
processed. data is processed.
Data normalization is required. Data normalization is not required.
The data schema is static type. The data schema is dynamic type.
Cost is applicable for licensed software. Free of cost, as it is an open source
software.
High data integrity available. Low data integrity available than
RDBMS.

Database
143
Database
Relational Non-Relational
Database Database
OLAP OLTP

OLTP vs. OLAP
144
OLTP (Transaction) OLAP (Analytical)

Many short transactions Long transactions, complex queries
Example: Example:
- Update account balance - Count the classes with fewer than 10 classes
- Add book to shopping cart - Report total sales for each dept in each
- Enroll in course month
Queries touch small amounts of Queries touch large amounts of data
data (few records)
Updates are frequent Updates are infrequent
Concurrency is biggest Individual queries can require lots of
performance problem resources

Non-relational databases
145
❑ The non-relational database does not require a fixed schema, and avoids joins. There
are no tables, rows, primary keys or foreign keys.
❑ It is used for distributed data stores and specifically targeted for big data, for
example Google or Facebook which collects terabytes of data every day for their
users.
❑ Traditional relational database uses SQL syntax to store and retrieve data for further
insights. Instead, a non-relational database encompasses a wide range of database
technologies that can store structured, semi-structured, and unstructured data.
❑ It adhere to Brewer’s CAP theorem.
❑ The tables are stored as ASCII files and usually, each field is separated by tabs
❑ The data scale horizontally.

NoSQL
146
Non-relational database

Why and Uses of NoSQL
147
Why: In today’s time data is becoming easier to access and capture through
third parties such as Facebook, Google+ and others. Personal user information,
social graphs, geo location data, user-generated content and machine logging
data are just a few examples where the data has been increasing exponentially.
To avail the above service properly, it is required to process huge amount of data
which SQL databases were never designed. The evolution of NoSql databases is
to handle these huge data properly.
Uses:
Log analysis
Where to use NoSQL? Social networking feeds
Time-based data

Types of NoSQL Database
148
There are mainly four categories of NoSQL databases. Each of these categories
has its unique attributes and limitations.
Performance High High High Variable

Scalability High High Moderate Minimal
Flexibility High Variable (high) High Variable (low)
Functionality Variable Variable Variable Graph Theory

Key Value
149
Data is stored in key/value pairs. It is designed in such a way to handle lots of

data and heavy load. Key-value pair storage databases store data as a hash table
where each key is unique, and the value can be a JSON, BLOB, string, etc. It is one
of the most basic types of NoSQL databases. This kind of NoSQL database is used
as a collection, dictionaries, associative arrays, etc. Key value stores help the
developer to store schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some examples of key-value store.
SQL
ID Name Age State
1 John 27 California
NoSQL – Key Value

Key (i.e. ID) Values
1 Name: John
Age:27
State: California

Document-Based
150
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but
the value part is stored as a document. The document is stored in JSON or XML
formats. The document type is mostly used for CMS (Content Management
Systems), blogging platforms, real-time analytics & e-commerce applications. It
should not use for complex transactions which require multiple operations or
queries against varying aggregate structures.
SQL NoSQL – Document-Based
ID Name Age State Key (ID) Value (JSON)
1 John 27 California 1 {
“Name”: John
“Age”:27
“State”: California
}

JSON vs. XML format
151

Column-Oriented vs. Row-Oriented Database
152

Column-Oriented vs. Row-Oriented Database cont’d
153
Row Oriented Column Oriented

Data is stored and retrieved one row at a In this type of data stores, data are stored
time and hence could read unnecessary data and retrieve in columns and hence it can only
if some of the data in a row are required. able to read only the relevant data if
required.
Records in Row Oriented Data stores are easy In this type of data stores, read and write
to read and write. operations are slower as compared to row-
oriented.
Row-oriented data stores are best suited for Column-oriented stores are best suited for
online transaction system. online analytical processing.
These are not efficient in performing These are efficient in performing operations
operations applicable to the entire datasets applicable to the entire dataset and hence
and hence aggregation in row-oriented is an enables aggregation over many rows and
expensive job or operations. columns.

Column-Based Database
154
Column-oriented databases work

on column family and based on
BigTable paper by Google. Every
column is treated separately.
Values of single column
databases are stored
contiguously. They deliver high
performance on aggregation
queries like SUM, COUNT, AVG,
MIN etc. as the data is readily
available in a column. Such
NoSQL databases are widely used
to manage data warehouses,
business intelligence, CRM,
Library card catalogs etc.

Column-Based cont’d
155

Graph-Based
156
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An edge
gives a relationship between nodes. Every node and edge has a unique identifier.
Graph base database mostly used for social networks, logistics, spatial data.

Advantages of NoSQL
157
❑ Can be used as Primary or Analytic Data Source

❑ Big Data Capability
❑ No Single Point of Failure
❑ Easy Replication
❑ No Need for Separate Caching Layer
❑ Provides fast performance and horizontal scalability.
❑ Can handle structured, semi-structured, and unstructured data with equal effect
❑ NoSQL databases don't need a dedicated high-performance server
❑ Support Key Developer Languages and Platforms
❑ Simple to implement than using RDBMS
❑ It can serve as the primary data source for online applications.
❑ Handles big data which manages data velocity, variety, volume, and complexity
❑ Excels at distributed database and multi-data center operations
❑ Eliminates the need for a specific caching layer to store data
❑ Offers a flexible schema design which can easily be altered without downtime or
service disruption

Disadvantages of NoSQL
158
❑ No standardization rules
❑ Limited query capabilities
❑ RDBMS databases and tools are comparatively mature
❑ It does not offer any traditional database capabilities, like consistency when
multiple transactions are performed simultaneously.
❑ When the volume of data increases it is difficult to maintain unique values as
keys become difficult
❑ Doesn't work as well with relational data
❑ The learning curve is stiff for new developers
❑ Open source options so not so popular for enterprises.

SQL vs. NoSQL
159
SQL NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows
the key-value pair of storing data similar to
JSON
Emphasis on ACID properties Follows Brewer’s CAP theorem

SQL vs. NoSQL cont’d
160
SQL NoSQL
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping Does not have good support for complex
needs querying
Can be configured for strong consistency Few support strong consistency (e.g.,
MongoDB), few others can be configured for
eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra, Redis,
PostgreSQL, etc. Neo4j, CouchDB, Couchbase, Riak, etc.

Polyglot Persistence
161
❑ It is a way of storing data, in which it is best to use multiple data storage

technologies, chosen based upon the way data is being used by individual
applications or components of a single application.
❑ Polyglot persistence is a hybrid approach enabling usage of multiple
databases in a single application/software.
❑ An e-commerce platform will deal with many types of data (i.e. shopping
cart, inventory, completed orders, etc). Instead of trying to store all this data
in one database, which would require a lot of data conversion to make the
format of the data all the same, store the data in the database best suited for
that type of data. So the e-commerce platform might look like this:

162

163

164

Polyglot Persistence cont’d…
165
A guideline on the database type to use based on the functionality of the data:
Functionality Considerations Database Type
User sessions Rapid Access for reads and writes. No need to Key-Value
be durable.
Financial Needs transactional updates. Tabular structure RDBMS
fits data.
POS (Point of Depending on size and rate of ingest. Lots of RDBMS (if modest), Key
sale) writes, infrequent reads mostly for analytics. Value or Document (if
ingest very high) or
Column if analytics is
key.
Shopping Cart High availability across multiple locations. Can Document or Key Value
merge inconsistent writes.
Recommendations Rapidly traverse links between friends, product Graph, (Column if
purchases, and ratings. simple)

Polyglot Persistence cont…
166
Functionality Considerations Database Type

Product Catalog Lots of reads, infrequent writes. Products make Document
natural aggregates.
Reporting SQL interfaces well with reporting tools RDBMS
Analytics Large scale analytics on large cluster Column
User activity logs, High volume of writes on multiple nodes Key Value or Document
CSR logs, Social
Media analysis

Polyglot Persistence Example
167
Consider a company that sells musical instruments and accessories online (and
in a network of shops). At a high-level, there are a number of problems that a
company needs to solve to be successful:
❑ Attract customers to its stores (both virtual and physical).
❑ Present them with relevant products.
❑ Once they decide to buy, process the payment and organize shipping.
To solve these problems a company might choose from a number of available
technologies that were designed to solve these problems:
❑ Store all the products in a document-based database. There are multiple
advantages of document databases: flexible schema, high availability, and
replication, among others.
❑ Model the recommendations using a graph-based database as such
databases reflect the factual and abstract relationships between customers
and their preferences.
❑ Once a product is sold, the transaction normally has a well-structured
schema. To store such data relational databases are best suited.
168
Some strengths of polyglot persistence include:
 Simplifies operations
 Eliminates fragmentation
 Creates faster response time
 Improves efficiency
 Scales extremely well
 Allows database engineers/architects to decide
where data is stored
169
 Weaknesses of polyglot persistence include:

 Must be designed from the ground up based on the specific
data architecture of an enterprise -- there are no out of the
box solutions
 Database engineers/architects must choose where to store
data
 Database interactions become more complicated
 New data stores mean increased complexity and a need for
more training
 Maintenance and repairs take longer
 Database integration increases operational and
engineering expenses
Sharding
170
❑ Sharding is a database architecture pattern related to horizontal partitioning i.e.,

the practice of separating one table’s rows into multiple different tables, known
as partitions. Each partition has the same schema and columns, but also entirely
different rows. Likewise, the data held in each is unique and independent of the
data held in other partitions.
❑ In a vertically-partitioned table, entire columns are separated out and put into
new, distinct tables. The data held within one vertical partition is independent
from the data in all the others, and each holds both distinct rows and columns.
❑ Sharding involves breaking up data into two or more smaller chunks, called
logical shards. The logical shards are then distributed across separate database
nodes, referred to as physical shards, which can hold multiple logical shards.
Despite this, the data held within all the shards collectively represent an entire
logical dataset.
❑ Database shards exemplify a shared-nothing architecture. This means that the
shards are autonomous; they don’t share any of the same data or computing
resources.

Sharding Visual Illustration
171

Why Sharding?
172
❑ The main appeal of sharding a database is that it can help to facilitate

horizontal scaling, also known as scaling out. Horizontal scaling is the
practice of adding more machines to an existing stack in order to spread out
the load and allow for more traffic and faster processing.
❑ Any non-distributed database will be limited in terms of storage and
compute power, so having the freedom to scale horizontally makes setup far
more flexible.
❑ Choose a sharded database architecture is to speed up query response
times. When a query is submitted on a database that hasn’t been sharded, it
may have to search every row in the table one is querying before it can find
the result set one is looking for. For an application with a large, monolithic
database, queries can become prohibitively slow. By sharding one table into
multiple, though, queries have to go over fewer rows and their result sets
are returned much more quickly.

Sharding Architecture
173
When running queries or distributing incoming data to sharded tables or

databases, it’s crucial that it goes to the correct shard. Otherwise, it could result
in lost data or painfully slow queries. Therefore, sharding architectures to be
defined, wherein each of which uses a slightly different process to distribute
data across shards. The recommended architectural approaches are:
❑ Key Based Sharding
❑ Range Based Sharding
❑ Directory Based Sharding

Key Based Sharding
174

Key Based Sharding cont’d…
175
❑ Key based sharding, also known as hash based sharding, involves using a
value taken from newly written data — such as a customer’s ID, a client
application’s IP address, a ZIP code, etc. and plugging it into a hash function
to determine which shard the data should go to. A hash function is a function
that takes as input a piece of data (for example, a customer email) and
outputs a discrete value, known as a hash value. In the case of sharding, the
hash value is a shard ID used to determine which shard the incoming data
will be stored on.
❑ To ensure that entries are placed in the correct shards and in a consistent
manner, the values entered into the hash function should all come from the
same column. This column is known as a shard key. In simple terms,
shard keys are similar to primary keys in that both are columns which are
used to establish a unique identifier for individual rows.

Range Based Sharding
176

Range Based Sharding cont’d…
177
❑ Range based sharding involves sharding data based on ranges of a given

value.
❑ To illustrate, let’s assume the existence of a database that stores information
about all the products within a retailer’s catalog. One could create a few
different shards and divvy up each products’ information based on which
price range they fall into.
❑ The main benefit of range based sharding is that it’s relatively simple to
implement. Every shard holds a different set of data but they all have an
identical schema as one another, as well as the original database. The
application code reads which range the data falls into and writes it to the
corresponding shard.

Directory Based Sharding
178

Directory Based Sharding cont’d…
179
❑ To design a directory based sharding, one must create and maintain a lookup
table that uses a shard key to keep track of which shard holds which data. A
lookup table is a table that holds a static set of information about where
specific data can be found.
❑ The Delivery Zone column is defined as a shard key. Data from the shard key
is written to the lookup table along with whatever shard each respective row
should be written to. This is similar to range based sharding, but instead of
determining which range the shard key’s data falls into, each key is tied to its
own specific shard.
❑ The main appeal of directory based sharding is its flexibility.

180

Big Data (CS-3032)

Bhubaneswar-751024

Course Contents
183

5 Frameworks And Visualization 6
Distributed and Parallel Computing for Big Data, Big Data Visualizations – Visual
data analysis techniques, interaction techniques, applications

Traditional Sequential Computing
184
Big Data Problem
In Ip I1 CPU
tn tp t1
Big Data Instructions processed by

a CPU at different time slot

Parallel Computing
185
Big Data Problem
Big Data Task 1 Big Data Task 2 Big Data Task n
Big Data
instructions
tn T1In T2In T n In processed by
multiple CPU
tp T1Ip T2Ip T n Ip
at different
t1 T1I1 T2I1 T n I1 time slot in a
single
machine
CPU CPU CPU
Distributed Computing
186
Big Data Problem

Secondary Name Node
Name Node
Data Node 1.1 Data Node 2.1 Data Node n.1

Data Node 1.k Data Node 2.l Data Node n.m
Rack1 Rack2 Rackn
Distributed and Parallel Computing in Big Data
187
❑ Multiple computing resources are connected in a network and computing

tasks are distributed across the resources. The sharing of tasks increases
the speed as well as the efficiency of the system. Because of such reason, the
distributed computing is more efficient than traditional methods of
computing. It is primarily suitable to process huge amounts of data in a
limited time.
❑ Another way to improve the processing capability of the system is to add
additional computational resources to it. This will help in dividing complex
computations into subtasks, which can be handled individually by
processing units that are running in parallel. Such system are called parallel
systems in which multiple parallel computing resources are involved to
carry out computation simultaneously.
❑ Organization use both parallel and distributed computing techniques to
process Big Data due to the important constraint for business, so called
“Time”.

Distributed and Parallel Computing in Big Data cont’d
188
❑ Sometimes, computing resources develop technical challenge, and

fail to respond. Such situations can be handled by virtualization,
where some processing and analytical tasks are delegated to other
resources.
❑ Another problem that often hampers data storage and processing
activities is latency. It is defined as the aggregate delay in the system
because of delays in the completion of individual tasks. Such a delay
automatically leads to the slowdown in system performance as
whole and is often termed as system delay. It affects data
management and communication within and across various business
units thereby, affecting the productivity and profitability of the
organizations. Implementing distributed and parallel computing
methodologies helps in handling both latency and data-related
problem.
Visualization
189
Visualization is a pictorial representation technique. Anything which is

represented in pictorial or graphical form, with the help of diagrams, charts,
pictures, flowcharts etc. is known as visualization. Data visualization is a
pictorial or visual representation of data with the help of visual ads such as
graphs, bar, histograms, tables, pie charts, mind maps etc.
Ways of Representing Visual Data
The data is first analyzed and then the result is visualized. There are 2 ways to
visualize a data, namely, Infographics and data visualization.
Infographics – It is the visual representation of information or data rapidly.
Data Visualization – It is the study of representing data or information in a
visual form.
Difference: Infographics tell a premeditated story to guide the audience to
conclusions (subjective). Data visualizations let the audience draw their own
conclusions (objective).
An infographic can contain data visualizations but not the other way around.

Infographics vs. Data Visualization
190
Infographics Data Visualization

Infographics vs. Data Visualization cont’d
191
Infographics are:
❑ Best for telling a premeditated story and offer subjectivity
❑ Best for guiding the audience to conclusions and point out
relationships
❑ Created manually for one specific dataset
It is used for Marketing content, Resumes, Blog posts, and Case studies
etc.
Data visualizations are:
❑ Best for allowing the audience to draw their own conclusions, and
offer objectivity
❑ Ideal for understanding data at a glance
❑ Automatically generated for arbitrary datasets
It is used for Dashboards, Scorecards, Newsletters, Reports, and
Editorials etc.
Data Visualization Purpose
192
❑ Data presented in the form of graphics can be analyzed better than

the data presented in words.
❑ Patterns, trends, outliers and correlations that might go undetected
in text-based data can be exposed and recognized easier with
data visualization software.
❑ Data scientists can use data visualizations to make their information
more actionable. Illustrations, graphs, charts and spreadsheets can
turn dull reports into something illuminating, where it’s easier to
gather insight and actionable results.
❑ Data Visualization help to transmit a huge amount of information
to the human brain at a glance.
❑ Data Visualization point out key or interesting breakthrough in a
large dataset.

Techniques Used for Data Representation
193
Data can be presented in various forms, which include simple line

diagrams, bar graphs tables, metrics etc. Techniques used for a visual
representation of the data are as follows:
❑ Map ❑ Ordinogram
❑ Parallel Coordinate Plot ❑ Isoline
❑ Venn Diagram ❑ Isosurface
❑ Timeline ❑ Streamline
❑ Euler Diagram ❑ Direct Volume Rendering (DVR)
❑ Hyperbolic Trees
❑ Cluster Diagram

Map
194
It is generally used to represent the location of different areas of a country

and is generally drawn on a plain surface. Google maps is generally widely
used for data visualization. Now–a-days it is widely used for finding the
location in different domains of country.

Parallel Coordinate Plot
195
It is a visualization technique of representing multidimensional data. This type

of visualisation is used for plotting multivariate, numerical data. Parallel
Coordinates Plots are ideal for comparing many variables together and seeing
the relationships between them.

Venn Diagram
196
It is used to represent the logical relations between finite collection of sets.
List your friends that play Soccer OR Tennis List your friends that play Soccer AND Tennis
❑ Draw the Venn Diagram to show people that play Soccer but NOT Tennis
❑ Draw the Venn Diagram to show people that play Soccer or play Tennis, but
not the both.
Timeline
197
It is used to represent a chronological display of events.
Source: datavizcatalogue.com
Timeline cont’d
198
Source: officetimeline.com
Euler Diagram
199
It is a representation of the relationships between sets.

Example: Let’s take 3 sets namely A = {1, 2, 5}, B = {1, 6} and C={4, 7}. The Euler
diagram of the sets looks lie:
Draw the Equivalent Venn Diagram
Source: wikipedia
Class Exercise
Draw the Euler diagram of the sets, X = {1, 2, 5, 8}, Y = {1, 6, 9} and Z={4, 7, 8 ,
9}. Then draw the equivalent Venn Diagram.
Hyperbolic Trees
200
A hyperbolic tree (often shortened as hypertree) is an information visualization

and graph drawing method inspired by hyperbolic geometry.

Cluster Diagram
201
A cluster diagram or clustering diagram is a general type of diagram, which

represents some kind of cluster. A cluster in general is a group or bunch of
several discrete items that are close to each other.
Comparison diagram of sky scraper Computer network diagram

Ordinogram
202
It is generally used to perform the analysis operation of various sets of

multivariate objects which are generally used in different domain.
Simple two-dimensional graph is an example of ordinogram.
Univariate data – This type of data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals
with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it. The example of a univariate data can be height.
Height (cm) 164 167 170 170.4 176.5 180 179.2 165 175
Multivariate data – This type of data involves two or more than two different
variables. The analysis of this type of data deals with causes and relationships
and the analysis is done to find out the relationship among the variables.
Temperature in Celsius Ice Cream Sales
20 2000
35 5000
Isoline
203
It is basically a 2D data representation of a curved line that generally

transfers constantly on the surface of the graph, the plotting of line generally
drawn on the basis of data arrangement instead of data visualization.

Isosurface
204
It is a 3D representation of an Isoline. Isosurfaces are designed to

present points that are bound by a constant value in a volume of space
i.e. in a domain that covers 3D space.

Streamline
205
It is a field that is generated from the description of velocity vector field

of the data flow.

Types of Data Visualization
206
Data visualization can be done in different ways such as:
Volumetric
Linear Data Planar Data
Data
Visualization Visualization
Visualization
Multidimensional
Data Visualization
Data
Types Visualization
Temporal Network
Tree Data
Data Data
Visualization

Types of Data Visualization cont’d
207
❑ Linear Data Visualization: Data always

represented in list format. Basically it’s not
consider it as a visualization technique
rather is a data organization technique. No
tool is used to visualize the data. It is also
called as 1D data visualization.
❑ Planar Data Visualization: Data generally
take in the form of images, diagrams or
charts over a plane surface. The best
example of this type of data visualization is
Cartogram and dot distribution. A cartogram
is a map in which some thematic mapping
variable – such as travel time, or population
is substituted for land area or distance. Some
tools used to build planar data visualization
are GeoCommons, Polymaps, Google Maps,
Tabeleau Public etc. Germany-population-cartogram

208
❑ Volumetric Data Visualization:

the presentation of data generally
involves exactly with three
dimensions to present
simulations, surface and volume
rendering etc. and commonly used
scientific studies. Basic tools used
for it are AC3D, AutoQ3D, TrueSpace
etc.
❑ Temporal Data Visualization:
Sometimes, visualizations are time
dependent so to visualize the
dependence of analyses of time, the
temporal data visualization is used
which include Gantt chart, Time
series and Sanky diagram etc.

209
❑ Multidimensional Data visualization:

Numerous dimension are generally used
to represent the data. Generally pie
charts, histograms, bar charts etc are
generally used. Many Eyes, Google Charts,
Tableau Public, etc. are some tools used to
create such visualization.
❑ Tree/Hierarchical Data visualization:
Sometimes, data relationships need to be
shown in the form of hierarchies and to
represent it, tree or hierarchical data
visualization. Examples include
hyperbolic tree, wedge-stack graph, etc.
Google Charts, d3, etc. are some tools
used to create such visualization.

210
❑ Network Data Visualization: This approach is generally used to represent

the relations that are too complex in the form of hierarchies. Some of
the basic tools used for network data visualization are hive plot, Pajek,
Gephi, NodeXL, Google Fusion Tables, Many Eyes, d3/Protovis etc.
Social Network Visualization

Visualization Interaction Techniques
211
The following interaction techniques are used in information visualization to overcome

various limitations such as maximum amount of information is limited by the
resolution.
❑ Zooming: It is one of the basic interaction techniques of information
visualizations. It allows the user to specify the scale of magnification and
increasing or decreasing the magnification of an image by that scale. This allows
the user focus on a specific area and information outside of this area is generally
discarded.
❑ Filtering: It is one of the basic interaction techniques often used in information
visualization used to limit the amount of displayed information through filter
criteria.
❑ Details on demand: This technique allows interactively selecting parts of data to
be visualized more detailed and additional information on a point-by-point basis.
❑ Overview-plus-Detail: Two graphical presentation, wherein one shows a rough
overview of the complete information space and neglects details, and the other
one shows a small portion of the information space and visualizes details. Both
are either shown sequentially or in parallel.

Application of Data Visualization
212
There are 3 ways to use data visualization in a company.

1. Internal Communication: Any key data that influences
decision-making is prime for data visualization. This is
specifically true for the information delivered to higher-ups
such as boss or other key stakeholders. Examples are
presentation, reports or financial statements.
2. Client Reporting: With data visualization, results reporting to
clients or customers is more impactful.
3. Marketing Content: Public-facing content for thought
leadership or promotion is more credible with data. Content
such as blogs, whitepapers, infographics etc. can be beneficial.

213

214

Big Data (CS-3032)

Bhubaneswar-751024

Course Contents
216

5 Frameworks And Visualization 6
Distributed and Parallel Computing for Big Data, Big Data Visualizations – Visual
data analysis techniques, interaction techniques, applications

Sequential Computing in Big Data
217
Big Data Problem
In Ip I1 CPU
tn tp t1
Big Data Instructions processed by

a CPU at different time slot

Parallel Computing in Big Data
218
Big Data Problem
Big Data
instructions
tn T1In T2In T n In processed by
multiple CPU
tp T1Ip T2Ip T n Ip
at different
t1 T1I1 T2I1 T n I1 time slot in a
single
machine
CPU CPU CPU
Difference between Sequential Computing and
Parallel Computing
219
Sequential Computing Parallel Computing

All the instructions are executed in a All the instructions are executed
sequence, one at a time. parallely.
It has a single processor. It is having multiple processors.
It has low performance and the It has high performance and the
workload of the processor is high due to workload of the processor is low
the single processor. because multiple processors are working
simultaneously.
Bit-by-bit format is used for data Data transfers are in bytes.
transfer.
It requires more time to complete the It requires less time to complete the
whole process. whole process.
Cost is low Cost is high

Class Work
220
❑ What kind of tasks can be split up into separate threads while

using parallel processing?
❑ What are the advantages and disadvantages of using parallel
computation?
❑ What is the role of memory in parallel processing?
❑ Give some examples of real-world applications that use parallel
processing?
❑ What is the maximum number of threads that can exist at any
given moment during the execution of a program?
❑ How parallel processing improves computational efficiency?
❑ How would you design a parallel algorithm for a given
problem?

Distributed Computing in Big Data
221
Big Data Problem

Secondary Name Node
Name Node

Data Node 1.k Data Node 2.l Data Node n.m
Rack1 Rack2 Rackn
Difference between Parallel Computing and
Distributed Computing
222
Parallel Computing Distributed Computing

Single computer is required Uses multiple computers
Multiple processors perform Multiple computers perform
multiple operations multiple operations
Improves the system Improves system scalability,
performance fault tolerance and resource
sharing capabilities
Processors communicate with Computer communicate with
each other through bus each other through message
passing.
Distributed and Parallel Computing in Big Data
223
❑ Multiple computing resources are connected in a network and computing

tasks are distributed across the resources. The sharing of tasks increases
the speed as well as the efficiency of the system. Because of such reason, the
distributed computing is more efficient than traditional methods of
computing. It is primarily suitable to process huge amounts of data in a
limited time.
❑ Another way to improve the processing capability of the system is to add
additional computational resources to it. This will help in dividing complex
computations into subtasks, which can be handled individually by
processing units that are running in parallel. Such system are called parallel
systems in which multiple parallel computing resources are involved to
carry out computation simultaneously.
❑ Organization use both parallel and distributed computing techniques to
process Big Data due to the important constraint for business, so called
“Time”.

Distributed and Parallel Computing in Big Data cont’d
224
❑ Sometimes, computing resources develop technical challenge, and

fail to respond. Such situations can be handled by virtualization,
where some processing and analytical tasks are delegated to other
resources.
❑ Another problem that often hampers data storage and processing
activities is latency. It is defined as the aggregate delay in the system
because of delays in the completion of individual tasks. Such a delay
automatically leads to the slowdown in system performance as
whole and is often termed as system delay. It affects data
management and communication within and across various business
units thereby, affecting the productivity and profitability of the
organizations. Implementing distributed and parallel computing
methodologies helps in handling both latency and data-related
problem.
Class Work
225
❑ The reasons to why a system should be built distributed, not just parallel
are: Scalability, Reliability, Data sharing, Resources sharing,
Heterogeneity and modularity, Geographic construction, and
Economic. Explain each of the terms in details.
❑ Design considerations for distributed systems are: No global clock,
Geographical distribution, No shared memory, Independence and
heterogeneity, Fail-over mechanism, and Security concerns. Explain
each of the terms.
❑ Explain the following image.

Visualization
226
Visualization is a pictorial representation technique. Anything which is

represented in pictorial or graphical form, with the help of diagrams, charts,
pictures, flowcharts etc. is known as visualization. Data visualization is a
pictorial or visual representation of data with the help of visual ads such as
graphs, bar, histograms, tables, pie charts, mind maps etc.
Ways of Representing Visual Data
The data is first analyzed and then the result is visualized. There are 2 ways to
visualize a data, namely, Infographics and data visualization.
Infographics – It is the visual representation of information or data rapidly.
Data Visualization – It is the study of representing data or information in a
visual form.
Difference: Infographics tell a premeditated story to guide the audience to
conclusions (subjective). Data visualizations let the audience draw their own
conclusions (objective).
An infographic can contain data visualizations but not the other way around.

Infographics vs. Data Visualization
227
Infographics Data Visualization

Infographics vs. Data Visualization cont’d
228
Infographics are:
❑ Best for telling a premeditated story and offer subjectivity
❑ Best for guiding the audience to conclusions and point out
relationships
❑ Created manually for one specific dataset
❑ It is used for Marketing content, Resumes, Blog posts, and Case
studies etc.
Data visualizations are:
❑ Best for allowing the audience to draw their own conclusions, and
offer objectivity
❑ Ideal for understanding data at a glance
❑ Automatically generated for arbitrary datasets
❑ It is used for Dashboards, Scorecards, Newsletters, Reports, and
Editorials etc.
Class Work
229
❑ Design an Infographics narrating the advantages of

online education.

Data Visualization Purpose
230
❑ Data presented in the form of graphics can be analyzed better than

the data presented in words.
❑ Patterns, trends, outliers and correlations that might go undetected
in text-based data can be exposed and recognized easier with
data visualization software.
❑ Data scientists can use data visualizations to make their information
more actionable. Illustrations, graphs, charts and spreadsheets can
turn dull reports into something illuminating, where it’s easier to
gather insight and actionable results.
❑ Data Visualization help to transmit a huge amount of information
to the human brain at a glance.
❑ Data Visualization point out key or interesting breakthrough in a
large dataset.

Techniques Used for Big Data Visualization
231
Big data can be presented in various forms, which include simple line
diagrams, bar graphs tables, metrics etc. Techniques used for a visual
representation of the big data are as follows:
❑ Map ❑ Ordinogram
❑ Parallel Coordinate Plot ❑ Isoline
❑ Venn Diagram ❑ Isosurface
❑ Timeline ❑ Streamline
❑ Euler Diagram ❑ Direct Volume Rendering (DVR)
❑ Hyperbolic Trees
❑ Cluster Diagram

Map
232
It is generally used to represent the location of different areas of a country

and is generally drawn on a plain surface. Google maps is generally widely
used for big data visualization. Now–a-days it is widely used for finding the
location in different domains of country.

Parallel Coordinate Plot
233
It is a visualization technique of representing multidimensional data. This type

of visualisation is used for plotting multivariate, numerical data. Parallel
Coordinates Plots are ideal for comparing many variables together and seeing
the relationships between them.

Class Work
234
Consider the characteristics of each item to respective variables. Draw a parallel

coordinates plot for the following multivariate numerical data.
Variable X Variable Y Variable Z

Item 1 50 100 2.0
Item 2 20 115 1.5
Item 3 70 120 3.0
Item 4 90 200 4.0
The highest and lowest values of each variable are as follows:

Highest Lowest
Variable X 100 25
Variable Y 250 100
Variable Z 5.0 1.0

Venn Diagram
235
It is used to represent the logical relations between finite collection of sets.
List your friends that play Soccer OR Tennis List your friends that play Soccer AND Tennis
❑ Draw the Venn Diagram to show people that play Soccer but NOT Tennis
❑ Draw the Venn Diagram to show people that play Soccer or play Tennis, but
not the both.
Class Work
236
Let A, B and C represent people who like apples, bananas, and carrots
respectively. The number of people in A = 10 (A1 to A10), B = 12 (B1 to B12)
and C = 16 (C1 to C16). Three people are such that they enjoy apples, bananas as
well as carrots. Two of them like apples and bananas. Let three people like
apples and carrots. Also, four people are such that they like bananas and carrots.
Draw the Venn diagram illustrating:
❑ How many people like apples only?
❑ How many people like only one of the three?
❑ How many people like all three?

Class Work
237
Consider the following diagram.
Draw the Venn diagram illustrating:
❑ How many people like only tea and wine?

❑ How many people like tea only?
❑ How many people like only tea and coffee?
❑ How many people like tea, coffee and wine?

Timeline
238
It is used to represent a chronological display of events.
Source: datavizcatalogue.com
Timeline cont’d
239
Source: officetimeline.com
Class Work
240
Draw a chronological sequence of the following dataset
Year Event
1941 First electronic computer
1947 First commercial stored program counter
1972 PROLOG language revealed
1950 Birth of AI
1956 Dartmouth conference
1957 Logic Theorist development
1968 Micro-world program SHRDLU created
1958 Lisp language developed
1963 Start of DoD’s advance research project
1970 First Expert System
1991 AI system beats human chess master

Euler Diagram
241
It is a representation of the relationships between sets.

Example: Let’s take 3 sets namely A = {1, 2, 5}, B = {1, 6} and C={4, 7}. The Euler
diagram of the sets looks lie:
Draw the Equivalent Venn Diagram
Source: wikipedia
Class Exercise
Draw the Euler diagram of the sets, X = {1, 2, 5, 8}, Y = {1, 6, 9} and Z={4, 7, 8 ,
9}. Then draw the equivalent Venn Diagram.
Class Work
242
Let A, B and C represent people who like apples, bananas, and carrots
respectively. The number of people in A = 10 (A1 to A10), B = 12 (B1 to B12)
and C = 16 (C1 to C16). Three people are such that they enjoy apples, bananas as
well as carrots. Two of them like apples and bananas. Let three people like
apples and carrots. Also, four people are such that they like bananas and carrots.
Draw the Euler diagram illustrating:
❑ How many people like apples only?
❑ How many people like only one of the three?
❑ How many people like all three?

Class Work
243
Consider the following diagram.
Draw the Euler diagram illustrating:
❑ How many people like only tea and wine?

❑ How many people like tea only?
❑ How many people like only tea and coffee?
❑ How many people like tea, coffee and wine?

Hyperbolic Trees
244
A hyperbolic tree (often shortened as hypertree) is an information visualization

and graph drawing method inspired by hyperbolic geometry. These are special
types of graphs composed of vertices and edges (connecting lines).

Class Work
245
Your parent has asked you to reduce monthly expenses. So you

adopted 2 techniques called as “Buy Less” and “Pay Less”. So, you
have started in buying less food and cloths. While buying less food,
you have started cooking; reduce eating outside meals, buy
nonperishable items in bulk. While paying, you have started
sharing the cost, bought items that are on sale, and bought generic
items. Draw the hyperbolic tree.

Cluster Diagram
246
A cluster diagram or clustering diagram is a general type of diagram, which

represents some kind of cluster. A cluster in general is a group or bunch of
several discrete items that are close to each other.
Comparison diagram of sky scraper Computer network diagram

Class Work
247
❑ Draw a cluster diagram for big data environment by including

name node, secondary name node, standby name node and
data nodes arranged in 10 racks. Each rack has different
number of data nodes.
❑ Draw a cluster diagram for city-planning by identifying groups
of houses according to their house type, value, and
geographical location.
❑ Draw a cluster diagram for an insurance firm by identifying
groups of motor insurance policyholders with a high average
claim cost.

Ordinogram
248
It is generally used to perform the analysis operation of various sets of

multivariate objects which are generally used in different domain.
Simple two-dimensional graph is an example of ordinogram.
Univariate data – This type of data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals
with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and
find patterns that exist within it. The example of a univariate data can be height.
Height (cm) 164 167 170 170.4 176.5 180 179.2 165 175
Multivariate data – This type of data involves two or more than two different
variables. The analysis of this type of data deals with causes and relationships
and the analysis is done to find out the relationship among the variables.
Temperature in Celsius Ice Cream Sales
20 2000
35 5000
Class Work
249
For the following dataset, draw the ordinogram.

Ice Cream Sales vs. Temperature
Temp 14.2 16.4 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1 22.6 17.2
Sales 215 325 185 332 406 522 412 614 544 421 445 408

Isoline
250
It show a range of quantity i.e., it is basically a 2D data representation of a

curved line that generally transfers constantly on the surface of the graph,
and the plotting of line generally drawn on the basis of data arrangement.
Radar maps, temperature maps and rainfall maps are all Isoline.

Isosurface
251
It is a 3D representation of an Isoline. Isosurfaces are designed to

present points that are bound by a constant value in a volume of space
i.e. in a domain that covers 3D space.

Streamline
252
It is a field that is generated from the description of velocity vector field

of the data flow.

Types of Data Visualization
253
Data visualization can be done in different ways such as:
Volumetric
Linear Data Planar Data
Data
Visualization
Multidimensional
Data Visualization
Data
Types Visualization
Temporal Network
Tree Data
Data Data
Visualization

254
❑ Linear Data Visualization: Data always

represented in list format. Basically it’s not
consider it as a visualization technique
rather is a data organization technique. No
tool is used to visualize the data. It is also
called as 1D data visualization.
❑ Planar Data Visualization: Data generally
take in the form of images, diagrams or
charts over a plane surface. The best
example of this type of data visualization is
Cartogram and dot distribution. A cartogram
is a map in which some thematic mapping
variable – such as travel time, or population
is substituted for land area or distance. Some
tools used to build planar data visualization
are GeoCommons, Polymaps, Google Maps,
Tabeleau Public etc. Germany-population-cartogram

255
❑ Volumetric Data Visualization:

the presentation of data generally
involves exactly with three
dimensions to present
simulations, surface and volume
rendering etc. and commonly used
scientific studies. Basic tools used
for it are AC3D, AutoQ3D, TrueSpace
etc.
❑ Temporal Data Visualization:
Sometimes, visualizations are time
dependent so to visualize the
dependence of analyses of time, the
temporal data visualization is used
which include Gantt chart, Time
series and Sanky diagram etc.

256
❑ Multidimensional Data visualization:

Numerous dimension are generally used
to represent the data. Generally pie
charts, histograms, bar charts etc are
generally used. Many Eyes, Google Charts,
Tableau Public, etc. are some tools used to
create such visualization.
❑ Tree/Hierarchical Data visualization:
Sometimes, data relationships need to be
shown in the form of hierarchies and to
represent it, tree or hierarchical data
visualization. Examples include
hyperbolic tree, wedge-stack graph, etc.
Google Charts, d3, etc. are some tools
used to create such visualization.

257
❑ Network Data Visualization: This approach is generally used to represent

the relations that are too complex in the form of hierarchies. Some of
the basic tools used for network data visualization are hive plot, Pajek,
Gephi, NodeXL, Google Fusion Tables, Many Eyes, d3/Protovis etc.
Social Network Visualization

Visualization Interaction Techniques
258
The following interaction techniques are used in information visualization to overcome

various limitations such as maximum amount of information is limited by the
resolution.
❑ Zooming: It is one of the basic interaction techniques of information
visualizations. It allows the user to specify the scale of magnification and
increasing or decreasing the magnification of an image by that scale. This allows
the user focus on a specific area and information outside of this area is generally
discarded.
❑ Filtering: It is one of the basic interaction techniques often used in information
visualization used to limit the amount of displayed information through filter
criteria.
❑ Details on demand: This technique allows interactively selecting parts of data to
be visualized more detailed and additional information on a point-by-point basis.
❑ Overview-plus-Detail: Two graphical presentation, wherein one shows a rough
overview of the complete information space and neglects details, and the other
one shows a small portion of the information space and visualizes details. Both
are either shown sequentially or in parallel.

Application of Data Visualization
259
There are 3 ways to use data visualization in a company.

1. Internal Communication: Any key data that influences
decision-making is prime for data visualization. This is
specifically true for the information delivered to higher-ups
such as boss or other key stakeholders. Examples are
presentation, reports or financial statements.
2. Reporting: With data visualization, results reporting to clients
or customers is more impactful.
3. Marketing Content: Public-facing content for thought
leadership or promotion is more credible with data. Content
such as blogs, whitepapers, infographics etc. can be beneficial.

260

School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024

Uploaded by

Copyright:

Available Formats

Big Data (CS-3032)

Kalinga Institute of Industrial Technology

School of Computer Engineering

3 Credit Lecture Note

Sr # Major and Detailed Coverage Area Hrs

School of Computer Engineering

Hadoop is an open-source project of the Apache Foundation. Apache Hadoop is

Hadoop provides various tools and technologies, collectively termed as Hadoop

“I have data in various sources. I have

“I need this data to be

School of Computer Engineering

Its capability to handle massive amounts of data, different categories of data –

School of Computer Engineering

School of Computer Engineering

School of Computer Engineering

Hadoop Core Components:

School of Computer Engineering

School of Computer Engineering

There are 3 versions of Hadoop available:

Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x Hadoop 2.x

School of Computer Engineering

Characteristics Hadoop 2.x Hadoop 3.x

School of Computer Engineering

The top 8 vendors offering Big Data Hadoop solution are:

School of Computer Engineering

School of Computer Engineering

School of Computer Engineering

There are 3 versions of Hadoop available:

Hadoop 1.x vs. Hadoop 2.x

Hadoop 1.x Hadoop 2.x

School of Computer Engineering

Characteristics Hadoop 2.x Hadoop 3.x

School of Computer Engineering

HDFS Master Node YARN Master Node

HDFS Slave Node YARN Slave Node

DataNode n Node Manager n

School of Computer Engineering

Resource Node Node Node

School of Computer Engineering

School of Computer Engineering

School of Computer Engineering

Some key points of HDFS are as follows:

School of Computer Engineering

Key components of HDFS are as follows:

A rack is a collection of 30 or 40 nodes that are physically stored close together

Node 1 Node 1 Node 1

Rack 1 Rack 2 Rack N

School of Computer Engineering

School of Computer Engineering

1. DataNode is responsible for storing the actual data in HDFS.

Processors: 2 Quad Core CPUs running @ 2 GHz

School of Computer Engineering

1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not

The process followed by Secondary NameNode to periodically merge the

It’s been an hour,

School of Computer Engineering

School of Computer Engineering

HDFS provides a reliable way to store huge data in a distributed environment as

School of Computer Engineering

File B1 Block 1 B3 Block 3

Rack 1 Rack 2 Rack 3