Big Data Notes - 2 Unit

Big Data (Unit 2)
Apache Hadoop
 Apache Hadoop
Hadoop is an open source framework overseen by Apache Software Foundation which is written
in Java for storing and processing of huge datasets with the cluster of commodity hardware.
There are mainly two problems with the big data. First one is to store such a huge amount of data
and the second one is to process that stored data. The traditional approach like RDBMS is not
sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the problem
of big data i.e. storing and processing the big data with some extra capabilities.
 History of Hadoop
 Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they
both started to work on Apache Nutch project. Apache Nutch project was the process of
building a search engine system that can index 1 billion pages. After a lot of research on
Nutch, they concluded that such a system will cost around half a million dollars in
hardware, and along with a monthly running cost of $30, 000 approximately, which is
very expensive. So, they realized that their project architecture will not be capable
enough to the workaround with billions of pages on the web. So they were looking for a
feasible solution which can reduce the implementation cost as well as the problem of
storing and processing of large datasets.
 In 2003, they came across a paper that described the architecture of Google’s distributed
file system, called GFS (Google File System) which was published by Google, for
storing the large data sets. Now they realize that this paper can solve their problem of
storing very large files which were being generated because of web crawling and
indexing processes. But this paper was just the half solution to their problem. 
 In 2004, Google published one more paper on the technique MapReduce, which was the
solution of processing those large datasets. Now this paper was another half solution for
Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS &
MapReduce) were just on white paper at Google. Google didn’t implement these two
techniques. Doug Cutting knew from his work on Apache Lucene ( It is a free and open-
source information retrieval software library, originally written in Java by Doug Cutting
in 1999) that open-source is a great way to spread the technology to more people. So,
together with Mike Cafarella, he started implementing Google’s techniques (GFS &
MapReduce) as open-source in the Apache Nutch project.
 In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon
realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he started to
find a job with a company who is interested in investing in their efforts. And he found
Yahoo!.Yahoo had a large team of engineers that was eager to work on this there project.
 So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to provide
the world with an open-source, reliable, scalable computing framework, with the help of
Yahoo. So at Yahoo first, he separates the distributed computing parts from Nutch and
formed a new project Hadoop (He gave name Hadoop it was the name of a yellow toy
Mr. Satish Kumar Singh Page 1

elephant which was owned by the Doug Cutting’s son. and it was easy to pronounce and
was the unique word.) Now he wanted to make Hadoop in such a way that it can work
well on thousands of nodes. So with GFS and MapReduce, he started to work on
Hadoop.
 In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it.
 In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache
Software Foundation). And in July of 2008, Apache Software Foundation successfully
tested a 4000 node cluster with Hadoop.
 In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages. And Doug
Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop
to other industries. 
 In December of 2011, Apache Software Foundation released Apache Hadoop version
1.0.
 And later in Aug 2013, Version 2.0.6 was available.
 And currently, we have Apache Hadoop version 3.0 which released in December 2017.
Let's focus on the history of Hadoop in the following shortly steps: -
 In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch.
It is an open source web crawler software project.
 While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of costs which becomes the consequence of that project. This problem
becomes one of the important reason for the emergence of Hadoop.
 In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
 In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
 In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.

 In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.
 Doug Cutting gave named his project Hadoop after his son's toy elephant.
 In 2007, Yahoo runs two clusters of 1000 machines.
 In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
 In 2013, Hadoop 2.2 was released.
 In 2017, Hadoop 3.0 was released.
 HDFS (Hadoop Distributed File System)

With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines. Such file systems are called
distributed file systems. Since data is stored across a network all the complications of a network
come in.
This is where Hadoop comes in. It provides one of the most reliable file systems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely large
files with streaming data access pattern and it runs on commodity hardware. HDFS (Hadoop
Distributed File System) is the primary storage system used by Hadoop applications. This open
source framework works by rapidly transferring data between nodes. It's often used by
companies who need to handle and store big data. HDFS is a key component of many Hadoop
systems, as it provides a means for managing big data, as well as supporting big data analytics.
 Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-
many-times. Once data is written large portions of dataset can be processed any number
times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
 Where to use HDFS
 Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

 Streaming Data Access: The time to read whole data set is more important than latency
in reading the first. HDFS is built on write-once and read-many-times pattern.
 Commodity Hardware:It works on low cost hardware.
 Where not to use HDFS
 Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time to
fetch the first record.
 Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one data node to another data node to retrieve each small file, this whole
process is a very inefficient data access pattern.
 Multiple Writes:It should not be used when we have to write multiple times.
 How does HDFS work?

The way HDFS works is by having a main « NameNode » and multiple « data nodes » on a
commodity hardware cluster. All the nodes are usually organized within the same physical rack
in the data center. Data is then broken down into separate « blocks » that are distributed among
the various data nodes for storage. Blocks are also replicated across nodes to reduce the
likelihood of failure.
The NameNode is the «smart» node in the cluster. It knows exactly which data node contains
which blocks and where the data nodes are located within the machine cluster. The NameNode
also manages access to the files, including reads, writes, creates, deletes and replication of data
blocks across different data nodes.
The NameNode operates in a “loosely coupled” way with the data nodes. This means the
elements of the cluster can dynamically adapt to the real-time demand of server capacity by
adding or subtracting nodes as the system sees fit.
The data nodes constantly communicate with the NameNode to see if they need complete a
certain task. The constant communication ensures that the NameNode is aware of each data
node’s status at all times. Since the NameNode assigns tasks to the individual datanodes, should
it realize that a datanode is not functioning properly it is able to immediately re-assign that
node’s task to a different node containing that same data block. Data nodes also communicate
with each other so they can cooperate during normal file operations. Clearly the NameNode is
critical to the whole system and should be replicated to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by the
NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the
NameNode unmaps the data note from the cluster and keeps operating with the other data nodes
as if nothing had happened. When this data node comes back to life or a different (new) data
node is detected, that new data node is (re-)added to the system. That is what makes HDFS
resilient and self-healing. Since data blocks are replicated across several data nodes, the failure
of one server will not corrupt a file.

1. Blocks: A Block is the minimum amount of data that it can read or write. HDFS blocks
are 128 MB by default and this is configurable. Files n HDFS are broken into block-sized
chunks,which are stored as independent units. Unlike a file system, if the file is in HDFS
is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of file
stored in HDFS of block size 128 MB takes 5MB of space only. The HDFS block size is
large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata
of all the files in HDFS; the metadata information being file permission, names and
location of each block. The metadata are small, so it is stored in the memory of name
node, allowing faster access to data. Moreover the HDFS cluster is accessed by multiple
clients concurrently, so all this information is handled by a single machine. The file
system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client
or name node. They report back to name node periodically, with list of blocks that they
are storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node. Since all the metadata is
stored in name node, it is very important. If it fails the file system cannot be used as there
would be no way of knowing how to reconstruct the files from blocks present in data
node. To overcome this, the concept of secondary name node arises.
4. Secondary Name Node: It is a separate physical machine which acts as a helper of name
node. It performs periodic check points. It communicates with the name node and take
snapshot of meta data which helps minimize downtime and loss of data.
 Features of HDFS
There are several features that make HDFS particularly useful, including:
 Data replication. This is used to ensure that the data is always available and prevents
data loss. For example, when a node crashes or there is a hardware failure, replicated data
can be pulled from elsewhere within a cluster, so processing continues while data is
recovered.
 Fault tolerance and reliability. HDFS' ability to replicate file blocks and store them
across nodes in a large cluster ensures fault tolerance and reliability.
 High availability. As mentioned earlier, because of replication across notes, data is
available even if the NameNode or a DataNode fails.
 Scalability. Because HDFS stores data on various nodes in the cluster, as requirements
increase, a cluster can scale to hundreds of nodes.
 High throughput. Because HDFS stores data in a distributed manner, the data can be
processed in parallel on a cluster of nodes. This, plus data locality (see next bullet), cut
the processing time and enable high throughput.
 Data locality. With HDFS, computation happens on the DataNodes where the data
resides, rather than having the data move to where the computational unit is. By
minimizing the distance between the data and the computing process, this approach
decreases network congestion and boosts a system's overall throughput.
 Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
 Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
 Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets. HDFS accommodates applications that have data sets typically gigabytes
to terabytes in size.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
Access to streaming data - HDFS is intended more for batch processing versus interactive use,
so the emphasis in the design is for high data throughput rates, which accommodate streaming
access to data sets.
Portability - To facilitate adoption, HDFS is designed to be portable across multiple hardware

platforms and to be compatible with a variety of underlying operating systems.
Coherence Model – The application that runs on HDFS require to follow the write-once-ready-
many approach. So, a file once created need not to be changed. However, it can be appended and
truncate.
 Benefits of using HDFS
 1. Cost :- Hadoop is open-source and uses cost-effective commodity hardware which

provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem with
traditional Relational databases is that storing the Massive volume of data is not cost-
effective, so the company’s started to remove the Raw data. which may not result in the
correct scenario of their business. Means Hadoop provides us 2 main benefits with the
cost one is it’s open-source means free to use and the other is that it uses commodity
hardware which is also inexpensive.
 2. Scalability :- Hadoop is a highly scalable model. A large amount of data is divided
into multiple inexpensive machines in a cluster which is processed parallelly. the number
of these machines or nodes can be increased or decreased as per the enterprise’s
requirements. In traditional RDBMS (Relational DataBase Management System) the
systems can not be scaled to approach large amounts of data.
 3. Flexibility :- Hadoop is designed in such a way that it can deal with any kind of
dataset like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured
(Images and Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. which is very much useful for
enterprises as they can process large datasets easily, so the businesses can use Hadoop to
analyze valuable insights of data from sources like social media, email, etc. with this
flexibility Hadoop can be used with log processing, Data Warehousing, Fraud detection,
etc.
 4. Speed :- Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a large size
file is broken into small size file blocks then distributed among the Nodes available in a
Hadoop cluster, as this massive number of file blocks are processed parallelly which
makes Hadoop faster, because of which it provides a High-level performance as
compared to the traditional Data-Base Management Systems. When you are dealing with

a large amount of unstructured data speed is an important factor, with Hadoop you can
easily access TB’s of data in just a few minutes.
 5. Fault Tolerance :- Hadoop uses commodity hardware(inexpensive systems) which
can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a
Hadoop cluster which ensures the availability of data if somehow any of your systems
got crashed. You can read all of the data from a single machine if this machine faces a
technical issue data can also be read from other nodes in a Hadoop cluster because the
data is copied or replicated by default. Hadoop makes 3 copies of each file block and
stored it into different nodes.
 6. High Throughput :- Hadoop works on Distributed file System where various jobs are
assigned to various Data node in a cluster, the bar of this data is processed parallelly in
the Hadoop cluster which produces high throughput. Throughput is nothing but the task
or job done per unit time.
 7. Minimum Network Traffic :- In Hadoop, each task is divided into various small sub-
task which is then assigned to each data node available in the Hadoop cluster. Each data
node process a small amount of data which leads to low traffic in a Hadoop cluster.
 8. Compatibility and portability :- HDFS is designed to be portable across a variety of
hardware setups and compatible with several underlying operating systems ultimately
providing users optionality to use HDFS with their own tailored setup. These advantages
are especially significant when dealing with big data and were made possible with the
particular way HDFS handles data.
 9. Data locality:- When it comes to the Hadoop file system, the data resides in data
nodes, as opposed to having the data move to where the computational unit is. By
shortening the distance between the data and the computing process, it decreases network
congestion and makes the system more effective and efficient.
 10. Stores large amounts of data:- Data storage is what HDFS is all about - meaning
data of all varieties and sizes - but particularly large amounts of data from corporations
that are struggling to store it. This includes both structured and unstructured data.
 Disadvantages of HDFS
 1. Problem with Small files :- Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file blocks which are from
128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small size
file in a large amount. This so many small files surcharge the Namenode and make it
difficult to work.
 2. Vulnerability :- Hadoop is a framework that is written in java, and java is one of the
most commonly used programming languages which makes it more insecure as it can be
easily exploited by any of the cyber-criminal.
 3. Low Performance In Small Data Surrounding :- Hadoop is mainly designed for
dealing with large datasets, so it can be efficiently utilized for the organizations that are
generating a massive volume of data. It’s efficiency decreases while performing in small
data surroundings.
 4. Lack of Security :- Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver needs to be careful with this
security face and should take appropriate action on it. Hadoop uses Kerberos for security
feature which is not easy to manage. Storage and network encryption are missing in
Kerberos which makes us more concerned about it.
 5. High Up Processing :- Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the data read or write done
from the disk which makes it difficult to perform in-memory calculation and lead to
processing overhead or High up processing.

 6. Supports Only Batch Processing :- The batch process is nothing but the processes
that are running in the background and does not have any kind of interaction with the
user. The engines used for these processes inside the Hadoop core is not that much
efficient. Producing the output with low latency is not possible with it.
 Components of Hadoop
1. MapReduce :- For computational processing i.e. MapReduce, MapReduce is the data
processing layer of Hadoop. It is a software framework for easily writing applications that
process the vast amount of structured and unstructured data stored in the Hadoop Distributed
Filesystem (HSDF). It processes huge amount of data in parallel by dividing the job (submitted
job) into a set of independent tasks (sub-job).
In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The
Map is the first phase of processing, where we specify all the complex logic/business
rules/costly code. Reduce is the second phase of processing, where we specify light-weight
processing like aggregation/summation.
2. HDFS :- For storage purpose i.e., HDFS, Acronym of Hadoop Distributed File System –
which is basic motive of storage. It also works as the Master-Slave pattern. In HDFS NameNode
acts as a master which stores the metadata of data node and Data node acts as a slave which
stores the actual data in local disc parallel.
3. Yarn :- which is used for resource allocation.YARN is the processing framework in Hadoop,
which provides Resource management, and it allows multiple data processing engines such as
real-time streaming, data science and batch processing to handle data stored on a single platform.
4. Hadoop Common: Hadoop Common refers to the collection of common utilities and
libraries that support other Hadoop modules. It is an essential part or module of the Apache
Hadoop Framework, along with the Hadoop Distributed File System (HDFS), Hadoop YARN
and Hadoop MapReduce. Like all other modules, Hadoop Common assumes that hardware
failures are common and that these should be automatically handled in software by the
Hadoop Framework. These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
 Data formats used in Hadoop
1. Text/CSV Files: - Text/CSV file is the most commonly used data file format. It’s the most
human readable and also ubiquitously easy to parse. It’s the choice of format to use when export
data from an RDBMS table. It can be read or written in any programming language and is
mostly delimited by comma or tab. Text and CSV files are quite common and frequently Hadoop
developers and data scientists received text and CSV files to work upon.
2. JSON Records:- JSON is in text format that stores meta data with the data, so it fully
supports schema evolution. You can easily add or remove attributes for each datum. However,
because it’s text file, it doesn’t support block compression. JSON records contain JSON files
where each line is its own JSON datum. In the case of JSON files, metadata is stored and the file
is also splittable but again it also doesn’t support block compression.
3. Avro Files :- The Avro file format has efficient storage due to optimized binary encoding. It
is widely supported both inside and outside the Hadoop ecosystem. The Avro file format is ideal
for long-term storage of important data. It can read from and write in many languages like Java,

Scala and so on.Schema metadata can be embedded in the file to ensure that it will always be
readable. Schema evolution can accommodate changes. The Avro file format is considered the
best choice for general-purpose storage in Hadoop. It can be processed by many languages
(currently C, C++, C#, Java, Python, and Ruby). A key feature of Avro format is the robust
support for data schemas that changes over time, i.e., schema evolution. Avro handles schema
changes like missing fields, added fields, and changed fields.
4. Sequence Files :- The sequencefile format can be used to store an image in the binary format.
They store key-value pairs in a binary container format and are more efficient than a text file.
However, sequence files are not human- readable.
5. RC Files :- RC File (Record Columnar Files) file was the first columnar file in Hadoop and
has significant compression and query performance benefits. But it doesn’t support schema
evaluation and if you want to add anything to RC file you will have to rewrite the file. Also, it is
a slower process.
6. ORC Files :- The Optimized Row Columnar (ORC) file format provides a highly efficient
way to store data. It was designed to overcome the limitations of other file formats. It
compresses better, but still does not support schema evolution. It is worthwhile to note that OCR
is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.
ORC is the compressed version of RC file and supports all the benefits of RC file with some
enhancements like ORC files compress better than RC files, enabling faster queries.
But it doesn’t support schema evolution.
7. Parquet Files:- Paquet file format is also a columnar format. Just like ORC file, it’s great for
compression with great query performance. It’s especially efficient when querying data from
specific columns. Parquet format is computationally intensive on the write side, but it reduces a
lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema
evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera
and optimized with Impala.
 Analysing data with Hadoop

Data will always be a part of your workflows, no matter where you work or what you do. The
amount of data produced every day is truly staggering. Having a large amount of data isn’t the
problem, but being able to store and process that data is truly challenging. With every
organization generating data like never before, companies are constantly seeking to pave the way
in Digital Transformation. With Hadoop Big Data Tools, you can efficiently handle absorption,
analysis, storage, and data maintenance issues. You can further execute Advanced Analytics,
Predictive Analytics, Data Mining, and Machine Learning applications. Hadoop Big Data Tools
can make your journey in Big Data quite easy. There is a wide range of analytical tools available
in the market that helps Hadoop deal with the astronomical size data efficiently.
1. Apache Spark
Apache spark in an open-source processing engine that is designed for ease of analytics
operations. It is a cluster computing platform that is designed to be fast and made for general
purpose uses. Spark is designed to cover various batch applications, Machine Learning,
streaming data processing, and interactive queries.
Features of Spark:
 In memory processing
 Tight Integration Of component
 Easy and In-expensive

 The powerful processing engine makes it so fast
 Spark Streaming has high level library for streaming process
2. Map Reduce
MapReduce is just like an Algorithm or a data structure that is based on the YARN framework.
The primary feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster, which Makes Hadoop working so fast Because when we are dealing with Big
Data, serial processing is no more of any use.
Features of Map-Reduce:
 Scalable
 Fault Tolerance
 Parallel Processing
 Tunable Replication
 Load Balancing
3. Apache Hive
Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is known as
HQL or HIVEQL.
Features of Hive:
 Queries are similar to SQL queries.

 Hive has different storage type HBase, ORC, Plain text, etc.
 Hive has in-built function for data-mining and other works.
 Hive operates on compressed data that is present inside Hadoop Ecosystem.
4. Apache Impala
Apache Impala is an open-source SQL engine designed for Hadoop. Impala overcomes the
speed-related issue in Apache Hive with its faster-processing speed. Apache Impala uses similar
kinds of SQL syntax, ODBC driver, and user interface as that of Apache Hive. Apache Impala
can easily be integrated with Hadoop for data analytics purposes.
Features of Impala:
 Easy-Integration
 Scalability
 Security
 In Memory data processing
5. Apache Mahout
The name Mahout is taken from the Hindi word Mahavat which means the elephant rider.
Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. Mahout is
mainly used for implementing various Machine Learning algorithms on our Hadoop like
classification, Collaborative filtering, Recommendation. Apache Mahout can implement the
Machine algorithms without integration on Hadoop.
Features of Mahout:
 Used for Machine Learning Application

 Mahout has Vector and Matrix libraries
 Ability to analyze large datasets quickly
6. Apache Pig
This Pig was initially developed by Yahoo to get ease in programming. Apache Pig has the
capability to process an extensive dataset as it works on top of the Hadoop. Apache pig is used
for analysing more massive datasets by representing them as dataflow. Apache Pig also raises
the level of abstraction for processing enormous datasets. Pig Latin is the scripting language that
the developer uses for working on the Pig framework that runs on Pig runtime.
Features of Pig:
 Easy To Programme
 Rich set of operators
 Ability to handle various kind of data
 Extensibility
7. HBase
HBase is nothing but a non-relational, NoSQL distributed, and column-oriented database. HBase
consists of various tables where each table has multiple numbers of data rows. These rows will
have multiple numbers of column family’s, and this column family will have columns that
contain key-value pairs. HBase works on the top of HDFS (Hadoop Distributed File System).
We use HBase for searching small size data from the more massive datasets.
Features of HBase:
 HBase has Linear and Modular Scalability

 JAVA API can easily be used for client access
 Block cache for real time data queries
8. Apache Sqoop
Sqoop is a command-line tool that is developed by Apache. The primary purpose of Apache
Sqoop is to import structured data i.e., RDBMS (Relational database management System) like
MySQL, SQL Server, Oracle to our HDFS(Hadoop Distributed File System). Sqoop can also
export the data from our HDFS to RDBMS.
Features of Sqoop:
 Sqoop can Import Data To Hive or HBase

 Connecting to database server
 Controlling parallelism

9. Tableau
Tableau is data visualization software that can be used for data analytics and business
intelligence. It provides a variety of interactive visualization to showcase the insights of the data
and can translate the queries to visualization and can also import all ranges and sizes of data.
Tableau offers rapid analysis and processing, so it Generates useful visualizing charts on
interactive dashboards and worksheets.
Features of Tableau:
 Tableau supports Bar chart, Histogram, Pie chart, Motion chart, Bullet chart, Gantt chart
and so many
 Secure and Robust
 Interactive Dashboard and worksheets
10. Apache Storm
Apache Storm is a free open source distributed real-time computation system build using
Programming languages like Clojure and java. It can be used with many programming
languages. Apache Storm is used for the Streaming process, which is very faster. We use
Daemons like Nimbus, Zookeeper, and Supervisor in Apache Storm. Apache Storm can be used
for real-time processing, online Machine learning, and many more. Companies like Yahoo,
Spotify, Twitter, and so many uses Apache Storm.
Features of Storm:
 Easily operatable
 each node can process millions of tuples in one second
 Scalable and Fault Tolerance
 Scaling out
Scaling: - Scaling is growing an infrastructure (compute, storage, and networking) larger so that
the applications riding on that infrastructure can serve more people at a time. When architects
talk about the ability of an infrastructure design to scale, we mean that the design as conceived is
able to grow larger over time without a wholesale replacement. The terms “scale up” and “scale
out” refer to the way in which the infrastructure is grown. The primary benefit of Hadoop is its
Scalability. One can easily scale the cluster by adding more nodes.
Scale out (Horizontal scalability):- It is also referred as “scale out” is basically the addition of
more machines or setting up the cluster. In horizontal scaling instead of increasing hardware
capacity of individual machine you add more nodes to existing cluster and most importantly, you
can add more machines without stopping the system. Therefore we don’t have any downtime or
green zone, nothing of such sort while scaling out. So at last to meet your requirements you will
have more machines working in parallel.
[While scaling out involves adding more discrete units to a system in order to add capacity, scaling up
involves building existing units by integrating resources into them.]

When to Scale Out Infrastructure
 When you need a long-term scaling strategy: The incremental nature of scaling out allows you
to scale your infrastructure for expected, long-term data growth. Components can be added or
removed depending on your goals.
 When upgrades need to be flexible: Scaling out avoids the limitations of depreciating
technology, as well as vendor lock-in for specific hardware technologies.
 When storage workloads need to be distributed: Scaling out is perfect for use cases that
require workloads to be distributed across several storage nodes.
Strengths
 Embraces newer server technologies: Because the architecture is not constrained by older
hardware, scale-out infrastructure is not affected by capacity and performance issues as much as
scale-up infrastructure.
 Adaptability to demand changes: Scale-out architecture makes it easier to adapt to changes in
demand since services and hardware can be removed or added to satisfy demand needs. This also
makes it easy to carry out resource scaling.
 Cost management: Scaling out follows an incremental model, which makes costs more
predictable. Furthermore, such a model allows you to pay for the resources required as you need
them.
Weaknesses
 Limited rack space: Scale-out infrastructure poses the risk of running out of rack space.
Theoretically, rack space can get to a point where it cannot support increasing demand, showing
that scaling out is not always the approach to handle greater demand.
 Increased operational costs: The introduction of more server resources introduces additional
costs, such as licensing, cooling, and power.
 Higher upfront costs: Setting up a scale-out system requires a sizable investment, as you’re not
just upgrading existing infrastructure.
Scale up (Vertical scalability):- It is also referred as “scale up”. In vertical scaling, you can
increase the hardware capacity of the individual machine. In other words, you can add more
RAM or CPU to your existing system to make it more robust and powerful.
[Alternately, scaling up may involve adding storage capacity to an existing server, whether that's a piece
of hardware or a logically partitioned virtual machine.]
When to Scale Up Infrastructure
 When there’s a performance impact: A good indicator of when to scale up is when

your workloads start reaching performance limits, resulting in increased latency and
performance bottlenecks caused by I/O and CPU capacity.
 When storage optimization does not work: Whenever the effectiveness of optimization
solutions for performance and capacity diminishes, it may be time to scale up.
Strengths
 Relative speed: Replacing a resource, such as a single processor with a dual processor,
means that the throughput of the CPU is doubled. The same can be done to resources
such as dynamic random access memory (DRAM) to improve dynamic memory
performance.

 Simplicity: Increasing the size of an existing system means that network connectivity
and software configuration do not change. As a result, the time and effort saved ensure
the scaling process is much more straightforward compared to scaling out architecture.
 Cost-effectiveness: A scale-up approach is cheaper compared to scaling out, as
networking hardware and licensing cost much less. Additionally, operational costs such
as cooling are lower with scale-up architectures.
 Limited energy consumption: As less physical equipment is needed in comparison to
scaling out, the overall energy consumption related to scaling up is significantly lessened.
Weaknesses
 Latency: Introducing higher capacity machines may not guarantee that a workload runs
faster. Latency may be introduced in scale-up architecture for a use case such as video
processing, which in turn may lead to worse performance.
 Labor and risks: Upgrading the system can be cumbersome, as you may, for instance,
have to copy data to a new server. Switchover to a new server may result in downtime
and poses a risk of data loss during the process.
 Aging hardware: The constraints of aging equipment lead to diminishing effectiveness
and efficiency with time. Backup and recovery times are examples of functionality that is
negatively impacted by diminishing performance and capacity.
 Hadoop streaming
Hadoop Streaming is a utility that comes with the Hadoop distribution. It can be used to execute
programs for big data analysis. Hadoop streaming can be performed using languages like
Python, Java, PHP, Scala, Perl, UNIX, and many more. The utility allows us to create and run
Map/Reduce jobs with any executable or script as the mapper and/or the reducer. It uses Unix
streams as the interface between the Hadoop and our MapReduce program so that we can use
any language which can read standard input and write to standard output to write for writing our
MapReduce program.
Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :
 Hadoop Streaming is a part of the Hadoop Distribution System.
 It facilitates ease of writing Map Reduce programs and codes.
 Hadoop Streaming supports almost all types of programming languages such as Python,
C++, Ruby, Perl etc.
 The entire Hadoop Streaming framework runs on Java. However, the codes might be
written in different languages as mentioned in the above point.
 The Hadoop Streaming process uses Unix Streams that act as an interface between
Hadoop and Map Reduce programs.
 Hadoop Streaming uses various Streaming Command Options and the two mandatory
ones are – -input directoryname or filename and -output directoryname
 Hadoop pipes
Apache Hadoop provides an adapter layer called pipes which allows C++ application code to be
used in MapReduce programs. Applications that require high numerical performance may see
better throughput if written in C++ and used through Pipes. The primary approach is to split the
C++ code into a separate process that does the application specific code. In many ways, the

approach will be similar to Hadoop streaming, but using Writable serialization to convert the
types into bytes that are sent to the process via a socket.
 Hadoop Ecosystem
Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software
library; it includes open source projects as well as a complete range of complementary tools.
Some of the most well-known tools of the Hadoop ecosystem include
 HDFS -> Hadoop Distributed File System

 YARN -> Yet Another Resource Negotiator
 MapReduce -> Data processing using programming
 Spark -> In-memory Data Processing
 PIG, HIVE-> Data Processing Services using Query (SQL-like)
 HBase -> NoSQL Database
 Mahout, Spark MLlib -> Machine Learning
 Apache Drill -> SQL on Hadoop
 Zookeeper -> Managing Cluster
 Oozie -> Job Scheduling
 Flume, Sqoop -> Data Ingesting Services
 Solr & Lucene -> Searching & Indexing
 Ambari -> Provision, Monitor and Maintain cluster
 HDFS
 Hadoop Distributed File System is the core component or you can say, the backbone of
Hadoop Ecosystem.
 HDFS is the one, which makes it possible to store different types of large data sets (i.e.
structured, unstructured and semi structured data).
 HDFS creates a level of abstraction over the resources, from where we can see the whole
HDFS as a single unit.

 It helps us in storing our data across various nodes and maintaining the log file about the
stored data (metadata).
 HDFS has two core components, i.e. NameNode and DataNode
1. The NameNode is the main node and it doesn’t store the actual data. It contains
metadata, just like a log file or you can say as a table of content. Therefore, it
requires less storage and high computational resources.
2. On the other hand, all your data is stored on the DataNodes and hence it requires
more storage resources. These DataNodes are commodity hardware (like your
laptops and desktops) in the distributed environment. That’s the reason, why
Hadoop solutions are very cost effective.
3. You always communicate to the NameNode while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
DataNodes.
 YARN
YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities
by allocating resources and scheduling tasks.
 It has two major components, i.e. ResourceManager and NodeManager.

1. ResourceManager is again a main node in the processing department.
2. It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.
3. NodeManagers are installed on every DataNode. It is responsible for execution
of task on every single DataNode.

1. Schedulers: Based on your application resource requirements, Schedulers
perform scheduling algorithms and allocates the resources.
2. Applications Manager: While Applications Manager accepts the job
submission, negotiates to containers (i.e. the Data node environment where
process executes) for executing the application specific Application Master and
monitoring the progress. Application Masters are the deamons which reside on
Datanode and communicates to containers for execution of tasks on each
DataNode. Resource Manager has two components, i.e. Schedulers and
ApplicationsManager.
 MAPREDUCE
It is the core component of processing in a Hadoop Ecosystem as it provides the logic

of processing. In other words, MapReduce is a software framework which helps in
writing applications that processes large data sets using distributed and parallel
algorithms inside Hadoop environment.
 In a MapReduce program, Map() and Reduce() are two functions.
1. The Map function performs actions like filtering, grouping and sorting.
2. While Reduce function aggregates and summarizes the result produced by map
function.
3. The result generated by the Map function is a key value pair (K, V) which acts as
the input for Reduce function.
 APACHE PIG

 PIG has two parts: Pig Latin, the language and the pig runtime, for the execution
environment. You can better understand it as Java and JVM.
 It supports pig latin language, which has SQL like command structure.
 The compiler internally converts pig latin to MapReduce. It produces a sequential set of
MapReduce jobs, and that’s an abstraction (which works like black box).
 PIG was initially developed by Yahoo.
 It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
How Pig works?

In PIG, first the load command, loads the data. Then we perform various functions on it
like grouping, filtering, joining, sorting, etc. At last, either you can dump the data on the
screen or you can store the result back in HDFS.
 APACHE HIVE
 Facebook created HIVE for people who are fluent with SQL. Thus, HIVE makes them
feel at home while working in a Hadoop Ecosystem.
 Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL

 The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
 It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
 The Hive Command line interface is used to execute HQL commands.
 While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC)
is used to establish connection from data storage.
 Secondly, Hive is highly scalable. As, it can serve both the purposes, i.e. large data set
processing (i.e. Batch query processing) and real time processing (i.e. Interactive query
processing).
 It supports all primitive data types of SQL.
 APACHE MAHOUT
Now, let us talk about Mahout which is renowned for machine learning. Mahout provides
an environment for creating machine learning applications which are scalable.
So, What is machine learning?
Machine learning algorithms allow us to build self-learning machines that evolve by
itself without being explicitly programmed. Based on user behavior, data patterns and
past experiences it makes important future decisions. You can call it a descendant of
Artificial Intelligence (AI).
What Mahout does?
It performs collaborative filtering, clustering and classification. Some people also
consider frequent item set missing as Mahout’s function. Let us understand them
individually:
1. Collaborative filtering: Mahout mines user behaviors, their patterns and their
characteristics and based on that it predicts and make recommendations to the users. The
typical use case is E-commerce website.
2. Clustering: It organizes a similar group of data together like articles can contain blogs,
news, research papers etc.

3. Classification: It means classifying and categorizing data into various sub-departments
like articles can be categorized into blogs, news, essay, research papers and other
categories.
4. Frequent item set missing: Here Mahout checks, which objects are likely to be
appearing together and make suggestions, if they are missing. For example, cell phone
and cover are brought together in general. So, if you search for a cell phone, it will also
recommend you the cover and cases.
Mahout provides a command line to invoke various algorithms. It has a predefined set of
library which already contains different inbuilt algorithms for different use cases.
 APACHE SPARK
 Apache Spark is a framework for real time data analytics in a distributed computing
environment.
 The Spark is written in Scala and was originally developed at the University of
California, Berkeley.
 It executes in-memory computations to increase speed of data processing over Map-
Reduce.
 It is 100x faster than Hadoop for large scale data processing by exploiting in-memory
computations and other optimizations. Therefore, it requires high processing power than
Map-Reduce.
 APACHE HBASE
 HBase is an open source, non-relational distributed database. In other words, it is a

NoSQL database.
 It supports all types of data and that is why, it’s capable of handling anything and
everything inside a Hadoop ecosystem.
 It is modelled after Google’s BigTable, which is a distributed storage system designed to
cope up with large data sets.
 The HBase was designed to run on top of HDFS and provides BigTable like capabilities.
 It gives us a fault tolerant way of storing sparse data, which is common in most Big Data
use cases.
 The HBase is written in Java, whereas HBase applications can be written in REST, Avro
and Thrift APIs.
 APACHE DRILL
As the name suggests, Apache Drill is used to drill into any kind of data. It’s an open
source application which works with distributed environment to analyze large data sets.
 It is a replica of Google Dremel.
 It supports different kinds NoSQL databases and file systems, which is a
powerful feature of Drill. For example: Azure Blob Storage, Google Cloud Storage,
HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon S3, Swift, NAS and local files.
So, basically the main aim behind Apache Drill is to provide scalability so that we can
process petabytes and exabytes of data efficiently (or you can say in minutes).
 The main power of Apache Drill lies in combining a variety of data stores just by
using a single query.
 Apache Drill basically follows the ANSI SQL.
 It has a powerful scalability factor in supporting millions of users and serve their
query requests over large scale data.

 APACHE ZOOKEEPER
 Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of
various services in a Hadoop Ecosystem.
 Apache Zookeeper coordinates with various services in a distributed environment.
Before Zookeeper, it was very difficult and time consuming to coordinate between
different services in Hadoop Ecosystem. The services earlier had many problems with
interactions like common configuration while synchronizing data. Even if the services are
configured, changes in the configurations of the services make it complex and difficult to
handle. The grouping and naming was also a time-consuming factor.
Due to the above problems, Zookeeper was introduced. It saves a lot of time by
performing synchronization, configuration maintenance, grouping and naming.
Although it’s a simple service, it can be used to build powerful solutions.
 APACHE OOZIE
Consider Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds
them together as one logical work.
There are two kinds of Oozie jobs:
1. Oozie workflow: These are sequential set of actions to be executed. You can
assume it as a relay race. Where each athlete waits for the last one to complete his part.
2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data
is made available to it. Think of this as the response-stimuli system in our body. In the
same manner as we respond to an external stimulus, an Oozie coordinator responds to the
availability of data and it rests otherwise.
 APACHE FLUME
Ingesting data is an important part of our Hadoop Ecosystem.

 The Flume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
 It gives us a solution which is reliable and distributed and helps us in collecting,
aggregating and moving large amount of data sets.
 It helps us to ingest online streaming data from various sources like network
traffic, social media, email messages, log files etc. in HDFS.
Now, let us understand the architecture of Flume from the below diagram:
There is a Flume agent which ingests the streaming data from various data sources to
HDFS. From the diagram, you can easily understand that the web server indicates the
data source. Twitter is among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.
1. Source: it accepts the data from the incoming streamline and stores the data in the
channel.
2. Channel: it acts as the local storage or the primary storage. A Channel is a
temporary storage between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and
commits or writes the data in the HDFS permanently.
 APACHE SQOOP
Now, let us talk about another data ingesting service i.e. Sqoop. The major difference
between Flume and Sqoop is that:
 Flume only ingests unstructured data or semi-structured data into HDFS.
 While Sqoop can import as well as export structured data from RDBMS or Enterprise
data warehouses to HDFS or vice versa.
When we submit Sqoop command, our main task gets divided into sub tasks which is
handled by individual Map Task internally. Map Task is the sub task, which imports part
of data to the Hadoop Ecosystem. Collectively, all Map tasks imports the whole data.
When we submit our Job, it is mapped into Map Tasks which brings the chunk of data
from HDFS. These chunks are exported to a structured data destination. Combining all
these exported chunks of data, we receive the whole data at the destination, which in most
of the cases is an RDBMS (MYSQL/Oracle/SQL Server).
 APACHE SOLR & LUCENE

Apache Solr and Apache Lucene are the two services which are used for searching and
indexing in Hadoop Ecosystem.
 Apache Lucene is based on Java, which also helps in spell checking.
 If Apache Lucene is the engine, Apache Solr is the car built around it. Solr is a
complete application built around Lucene.
 It uses the Lucene Java search library as a core for search and full indexing.
 APACHE AMBARI
Ambari is an Apache Software Foundation Project which aims at making Hadoop
ecosystem more manageable. It includes software for provisioning, managing and
monitoring Apache Hadoop clusters.
The Ambari provides:
1. Hadoop cluster provisioning:
 It gives us step by step process for installing Hadoop services across a number of hosts.
 It also handles configuration of Hadoop services over a cluster.
2. Hadoop cluster management:
 It provides a central management service for starting, stopping and re-configuring
Hadoop services across the cluster.
3. Hadoop cluster monitoring:
 For monitoring health and status, Ambari provides us a dashboard.
 The Amber Alert framework is an alerting service which notifies the user, whenever
the attention is needed. For example, if a node goes down or low disk space on a node,
etc.

Big Data Notes - 2 Unit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Notes - 2 Unit

Uploaded by

Copyright:

Available Formats

Big Data (Unit 2)

Mr. Satish Kumar Singh Page 1

Let's focus on the history of Hadoop in the following shortly steps: -

Mr. Satish Kumar Singh Page 2

 HDFS (Hadoop Distributed File System)

 Where to use HDFS

 Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

 Where not to use HDFS

 How does HDFS work?

Mr. Satish Kumar Singh Page 4

Portability - To facilitate adoption, HDFS is designed to be portable across multiple hardware

 Benefits of using HDFS

 1. Cost :- Hadoop is open-source and uses cost-effective commodity hardware which

Mr. Satish Kumar Singh Page 6

Mr. Satish Kumar Singh Page 7

 Data formats used in Hadoop

Mr. Satish Kumar Singh Page 8

 Analysing data with Hadoop

Mr. Satish Kumar Singh Page 9

 Queries are similar to SQL queries.

 Used for Machine Learning Application

 HBase has Linear and Modular Scalability

 Sqoop can Import Data To Hive or HBase

Mr. Satish Kumar Singh Page 11

10. Apache Storm

Mr. Satish Kumar Singh Page 12

When to Scale Up Infrastructure

 When there’s a performance impact: A good indicator of when to scale up is when

Mr. Satish Kumar Singh Page 13

Features of Hadoop Streaming

Mr. Satish Kumar Singh Page 14

 HDFS -> Hadoop Distributed File System

Mr. Satish Kumar Singh Page 15

 It has two major components, i.e. ResourceManager and NodeManager.

It is the core component of processing in a Hadoop Ecosystem as it provides the logic

Mr. Satish Kumar Singh Page 16

How Pig works?

HIVE + SQL = HQL

Mr. Satish Kumar Singh Page 17

 HBase is an open source, non-relational distributed database. In other words, it is a

Mr. Satish Kumar Singh Page 18

Ingesting data is an important part of our Hadoop Ecosystem.

 APACHE SOLR & LUCENE

Mr. Satish Kumar Singh Page 20

You might also like