Taming Big Data

Table of Contents:
Introduction to Big Data--------------------------------------------------------------------------------------------- 3

Hadoop Introduction------------------------------------------------------------------------------8
Hadoop Ecosystem-------------------------------------------------------------------------------63
Starting HDFS------------------------------------------------------------------------------------70
Installation of Hadoop---------------------------------------------------------------------------72
MapReduce in Hadoop---------------------------------------------------------------------------78
YARN in Hadoop---------------------------------------------------------------------------------80
Pig in Hadoop------------------------------------------------------------------------------------83
Hadoop Hive-------------------------------------------------------------------------------------97
Hadoop Streaming-------------------------------------------------------------------------------------------------- 104
Sqoop-------------------------------------------------------------------------------------------113
Impala------------------------------------------------------------------------------------------ 116
Oozie in Hadoop-------------------------------------------------------------------------------116
Apache Flume in Hadoop----------------------------------------------------------------------118
Zookeeper--------------------------------------------------------------------------------------120
Hue--------------------------------------------------------------------------------------------- 120
Kafka overview---------------------------------------------------------------------------------121
Apache Atlas Overview-------------------------------------------------------------------------123
Spark vs MapReduce------------------------------------------------------------------------------------------------ 127
How Does Spark Have an Edge over MapReduce----------------------------------------------128
Hadoop NoSQL-------------------------------------------------------------------------------------------------------- 132
Apache HBase----------------------------------------------------------------------------------134
Apache Cassandra------------------------------------------------------------------------------141
MongoDB---------------------------------------------------------------------------------------164
Hadoop Security Overview--------------------------------------------------------------------------------------- 181
Hadoop Security Features----------------------------------------------------------------------183
Security Administration------------------------------------------------------------------------183
Top 5 Big Data Vendors------------------------------------------------------------------------------------------- 189
Cloudera & Hortonworks-----------------------------------------------------------------------189
Amazon Web Services Elastic MapReduce Hadoop Distribution-------------------------------199
Microsoft Hadoop Distribution-----------------------------------------------------------------203
1
HPE Ezmeral Data Fabric (formerly MapR Data Platform)------------------------------------225
IBM InfoSphere Insights------------------------------------------------------------------------232
8 Applications of Big Data in Real Life----------------------------------------------------------------------238
Big Data in Education Industry-----------------------------------------------------------------238
Big Data in Healthcare Industry----------------------------------------------------------------239
Big Data in Government Sector----------------------------------------------------------------239
Big Data in Media and Entertainment Industry------------------------------------------------240
Big Data in Weather Patterns------------------------------------------------------------------241
Big Data in Transportation Industry------------------------------------------------------------241
Big Data in Banking Sector---------------------------------------------------------------------242
Big Data in Transforming Real Estate----------------------------------------------------------242
Kick-start Your Career in Big Data and Hadoop--------------------------------------------------------245
Career prospects upon completion of Hadoop Certification-----------------------------------246
Most Valuable Data Science Skills Of 2020----------------------------------------------------------------248
Non-Technical Expertise------------------------------------------------------------------------249
Big Data Job Responsibilities and Skills--------------------------------------------------------------------250
Hadoop Developer Roles and Responsibilities-------------------------------------------------250
Hadoop Architect Roles and Responsibilities---------------------------------------------------251
Hadoop Administrator Roles and Responsibilities---------------------------------------------251
Hadoop Tester Roles and Responsibilities------------------------------------------------------252
Data Scientist Roles and Responsibilities-------------------------------------------------------252
Big Data Terminologies You Must Know-------------------------------------------------------------------254
Key Takeaways-------------------------------------------------------------------------------------------------------- 259
2
Introduction to Big Data
Big Data is a term that is used for denoting the collection of datasets that are large and
complex, making it very difficult to process using legacy data processing applications.
Around 2005, people began to realize just how much data users generated through
Facebook, YouTube, and other online services. Hadoop (an open-source framework created
specifically to store and analyze big data sets) was developed that same year. NoSQL also
began to gain popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark)
was essential for the growth of big data because they make big data easier to work with and
cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are
still generating huge amounts of data but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to
the internet, gathering data on customer usage patterns and product performance. The
emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.
So, basically, our legacy or traditional systems can’t process a large amount of data in one
go. But, how will you classify the data that is problematic and hard to process? In order to
know ‘What is Big Data?’ we need to be able to categorize this data. Let’s see how.
Categorizing Data as Big Data
We have five Vs:

1. Volume: This refers to the data that is tremendously large. As you can see from the image,
the volume of data is rising exponentially. In 2016, the data created was only 8 ZB and by
2020, the data is up to 40 ZB, which is extremely large.
2. Variety: A reason for this rapid growth of data volume is that the data is coming from
different sources in various formats. The data is categorized as follows:
a) Structured Data: Here, data is present in a structured schema, along with all the
required columns. It is in a structured or tabular format. Data that is stored in a
relational database management system is an example of structured data. For
example, in the below-given employee table, which is present in a database, the data
is in a structured format.
Emp Emp Name Gende Department Salary

ID r
2383 ABC Male Finance 6,50,000
4623 XYZ Male Admin 50,00,000
3
b) Semi-structured Data: In this form of data, the schema is not properly defined,
i.e., both forms of data is present. So, basically semi-structured data has a structured
form but it isn’t defined, e.g., JSON, XML, CSV, TSV, and email. The web application
data that is unstructured contains transaction history files, log files, etc. OLTP systems
(Online Transaction Processing) are built to work with structured data and the data is
stored in relations, i.e., tables.
c) Unstructured Data: In this data format, all the unstructured files such as video
files, log files, audio files, and image files are included. Any data which has an
unfamiliar model or structure is categorized as unstructured data. Since the size is
large, unstructured data possesses various challenges in terms of processing for
deriving value out of it. An example for this is a complex data source that contains a
blend of text files, videos, and images. Several organizations have a lot of data
available with them but these organizations don’t know how to derive value out of it
since the data is in its raw form.
d) Quasi-structured Data: This data format consists of textual data with inconsistent
data formats that can be formatted with effort and time, and with the help of several
tools. For example, web server logs, i.e., a log file that is automatically created and
maintained by some server which contains a list of activities.
3. Velocity: The speed of data accumulation also plays a role in determining whether the
data is categorized into big data or normal data.
As can be seen from the image below, at first, mainframes were used wherein fewer people
used computers. Then came the client/server model and more and more computers were
evolved. After this, the web applications came into the picture and started increasing over
the Internet. Then, everyone began using these applications. These applications were then
used by more and more devices such as mobiles as they were very easy to access. Hence, a
lot of data!
4. Value: How will the extraction of data work? Here, our fourth V comes in, which deals with
a mechanism to bring out the correct meaning out of data. First of all, you need to mine the
data, i.e., a process to turn raw data into useful data. Then, an analysis is done on the data
that you have cleaned or retrieved out of the raw data. Then, you need to make sure
whatever analysis you have done benefits your business such as in finding out insights,
results, etc. which were not possible earlier.
You need to make sure that whatever raw data you are given, you have cleaned it to be used
for deriving business insights. After you have cleaned the data, a challenge pops up, i.e.,
during the process of dumping a huge amount of data, some packages might have lost. So
for resolving this issue, our next V comes into the picture.
5. Veracity: Since the packages get lost during the execution, we need to start again from
the stage of mining raw data in order to convert them into valuable data. And this process
goes on. Also, there will be uncertainties and inconsistencies in the data. To overcome this,
our last V comes into place, i.e., Veracity. Veracity means the trustworthiness and quality of
data. It is necessary that the veracity of the data is maintained. For example, think about
Facebook posts, with hashtags, abbreviations, images, videos, etc., which make them
4
unreliable and hamper the quality of their content. Collecting loads and loads of data is of
no use if the quality and trustworthiness of the data is not up to the mark.
Now, that you have a sheer idea of what is big data, let’s check out the major sectors using
Big Data on an everyday basis.
Major Sectors Using Big Data Every Day
Banking
Since there is a massive amount of data that is gushing in from innumerable sources, banks
need to find uncommon and unconventional ways in order to manage big data. It’s also
essential to examine customer requirements, render services according to their
specifications, and reduce risks while sustaining regulatory compliance. Financial institutions
have to deal with Big Data Analytics in order to solve this problem.
 NYSE (New York Stock Exchange): NYSE generates about one terabyte of new
trade data every single day. So imagine, if one terabyte of data is generated every day,
in a whole year how much data there would be to process. This is what Big Data is used
for.
Government
Government agencies utilize Big Data and have devised a lot of running agencies, managing
utilities, dealing with traffic jams, or limiting the effects of crime. However, apart from its
benefits in Big Data, the government also addresses the concerns of transparency and
privacy.
 Aadhar Card: The Indian government has a record of all 1.21 billion of citizens. This
huge data is stored and analyzed to find out several things, such as the number of youth
in the country. According to which several schemes are made to target the maximum
population. All this big data can’t be stored in some traditional database, so it is left for
storing and analyzing using several Big Data Analytics tools.
Education
Education concerning Big Data produces a vital impact on students, school systems, and
curriculums. With interpreting big data, people can ensure students’ growth, identify at-risk
students, and achieve an improvised system for the evaluation and assistance of principals
and teachers.
 Example: The education sector holds a lot of information with regard to curriculum,
students, and faculty. The information is analyzed to get insights that can enhance the
operational adequacy of the educational organization. Collecting and analyzing
information of a student such as attendance, test scores, grades, and other issues take
up a lot of data. So, big data makes an approach for a progressive framework wherein
this data can be stored and analyzed making it easier for the institutes to work with.
Big Data in Healthcare
5
When it comes to what Big Data is in Healthcare, we can see that it is being used
enormously. It includes collecting data, analyzing it, leveraging it for customers. Also,
patients’ clinical data is too complex to be solved or understood by traditional systems. Since
big data is processed by Machine Learning algorithms and Data Scientists, tackling such
huge data becomes manageable.
 Example: Nowadays, doctors rely mostly on patients’ clinical records, which means
that a lot of data needs to be gathered, that too for different patients. Obviously, it is
not possible for old or traditional data storage methods to store this data. Since there is
a large amount of data coming from different sources, in various formats, the need to
handle this large amount of data is increased, and that is why the Big Data approach is
needed.
E-commerce
Maintaining customer relationships is the most important in the e-commerce industry. E-

commerce websites have different marketing ideas to retail their merchandise to their
customers, to manage transactions, and to implement better tactics of using innovative ideas
with Big Data to improve businesses.
 Flipkart: Flipkart is a huge e-commerce website dealing with lots of traffic on a daily
basis. But, when there is a pre-announced sale on Flipkart, traffic grows exponentially
that actually crashes the website. So, to handle this kind of traffic and data, Flipkart uses
Big Data. Big Data can actually help in organizing and analyzing the data for further use.
Social Media
Social media in the current scenario is considered as the largest data generator. The stats
have shown that around 500+ terabytes of new data get generated into the databases of
social media every day, particularly in the case of Facebook. The data generated mainly
consist of videos, photos, message exchanges, etc. A single activity on any social media site
generates a lot of data which is again stored and gets processed whenever required. Since
the data stored is in terabytes, it would take a lot of time for processing if it is done by our
legacy systems. Big Data is a solution to this problem.
Why is Big Data so important?
Although big data may not immediately kill your business, neglecting it for a long period
won’t be a solution. The impact of big data on your business should be measured to make it
easy to determine a return on investment. Hence, big data is a problem definitely worth
looking into.
Whenever you visit a website, you might have noticed that on the right panel or top panel or
somewhere on the screen, you will find a recommendation field which is basically an
advertisement that is related to your preferences. How does the advertisement company
know that you would be interested in it?
Well, everything you surf on the Internet is stored and all this data is analyzed properly so
that whatever you are surfing for or you’re interested in comes up. Obviously, you will be
6
interested in that particular advertisement and you will be surfing further. But, mind you! The
amount of data generated from a single user is so huge that it is considered as big data.
Have you ever noticed when you go to YouTube, YouTube knows what kind of videos you
would like to watch and what you must be looking for next? Similarly, Amazon shows you
the type of products you must be looking to buy. Even if you would have searched for a pair
of earphones, you will keep on getting the recommendation of earphones, again‌‌and‌‌again,
that too on different websites.
How does this happen? It happens because of Big Data Analytics.
What is Big Data Analytics?
Big Data Analytics examines large and different types of data in order to uncover the hidden
patterns, insights, and correlations. Basically, Big Data Analytics is helping large companies
facilitate their growth and development. And it majorly includes applying various data
mining algorithms on a certain dataset.
How is Big Data Analytics used today?
Big Data Analytics is used in a number of industries to allow organizations and companies to
make better decisions, as well as verify and disprove existing theories or models. The focus
of Data Analytics lies in inference, which is the process of deriving conclusions that are solely
based on what the researcher already knows.
Benefits of Big Data Analytics
Big Data Analytics is indeed a revolution in the field of Information Technology. The use of
Data Analytics by various companies is increasing every year. The primary focus of them is on
their customers. Hence, the field is flourishing in Business-to-Consumer (B2C) applications.
7
Hadoop Introduction
Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes
without any issue in a seamless manner.
The different domains of Big Data means we are able to manage the data’s are from videos,
text medium, transactional data, sensor information, statistical data, social media
conversations, search engine queries, ecommerce data, financial information, weather data,
news updates, forum discussions, executive reports, and so on
Google’s Doug Cutting and his team members developed an Open Source Project namely
known as HADOOP which allows you to handle the very large amount of data. Hadoop runs
the applications on the basis of MapReduce where the data is processed in parallel and
accomplish the entire statistical analysis on large amount of data.
It is a framework which is based on java programming. It is intended to work upon from a

single server to thousands of machines each offering local computation and storage. It
supports the large collection of data set in a distributed computing environment.
The Apache Hadoop software library based framework that gives permissions to distribute

huge amount of data sets processing across clusters of computers using easy programming
models.
8
History of Hadoop
Apache Hadoop was born to enhance the usage and solve major issues of big data. The
web media was generating loads of information on a daily basis, and it was becoming very
difficult to manage the data of around one billion pages of content. In order of revolutionary,
Google invented a new methodology of processing data popularly known as MapReduce.
Later after a year Google published a white paper of Map Reducing framework where Doug
Cutting and Mike Cafarella, inspired by the white paper and thus created Hadoop to apply
these concepts to an open-source software framework which supported the Nutch search
engine project. Considering the original case study, Hadoop was designed with a much
simpler storage infrastructure facilities.
Hadoop was created by Doug Cutting and hence was the creator of Apache Lucene. It is the
widely used text to search library. Hadoop has its origins in Apache Nutch which is an open
source web search engine itself a part of the Lucene project.
Why Apache Hadoop?
Most database management systems are not up to the mark for operating at such lofty
levels of Big Data requirements either due to the sheer technical inefficiency or the
insurmountable financial challenges posed. When the type of data is unstructured, the
volume of data is huge, and the results needed are at uncompromisable speeds, then the
only platform that can effectively stand up to the challenge is Apache Hadoop.
Hadoop owes its runaway success to a processing framework, MapReduce, that is central to
its existence. MapReduce technology lets ordinary programmers contribute their part where
large datasets are divided and are independently processed in parallel. These coders need
not know the nuances of high-performance computing. With MapReduce, they can work
efficiently without having to worry about intra-cluster complexities, monitoring of tasks,
node failure management, and so on.
9
How did Big Data help in driving Walmart’s performance?
Walmart, one of the Big Data companies, is currently the biggest retailer in the world with
maximum revenue. Consisting of 2 million employees and 20,000 stores, Walmart is building
its own private cloud in order to incorporate 2.5 petabytes of data every hour.
Walmart has been collecting data of products that have the maximum sales in a particular
season or because of some specific reason. For example, if people are buying candies during
the Halloween season, along with costumes, you’d see a lot of candies and costumes all
around Walmart only during the Halloween season. This it does based on the Big Data
Analytics it had made for the previous years’ Halloween seasons.
Again, when in 2012, Hurricane Sandy hit the US it was analyzed by Walmart, from the data it
had collected and analyzed from such previous instances, that people generally buy
emergency equipment and strawberry pop-tarts when a warning for an approaching
hurricane is declared. So, this time too, Walmart quickly filled its racks with the emergency
equipment people would require during the hurricane in the red alert areas. This made the
selling of these products very quick and Walmart gaining a lot of profit.
Hadoop Architecture
Apache Hadoop was developed with the goal of having an inexpensive, redundant data store
that would enable organizations to leverage Big Data Analytics economically and increase
the profitability of the business.
A Hadoop architectural design needs to have several design factors in terms of networking,
computing power, and storage. Hadoop provides a reliable, scalable, flexible, and distributed
computing Big Data framework.
Hadoop follows a master–slave architecture for storing data and data processing. This
master–slave architecture has master nodes and slave nodes. Let’s first look at each
terminology before we start with understanding the architecture:
1. NameNode: NameNode is basically a master node that acts like a monitor
and supervises operations performed by DataNodes.
2. Secondary NameNode: A Secondary NameNode plays a vital role in case if
there is some technical issue in the NameNode.
3. DataNode: DataNode is the slave node that stores all files and processes.
4. Mapper: Mapper maps data or files in the DataNodes. It will go to every
DataNode and run a particular set of codes or operations in order to get the work
done.
5. Reducer: While a Mapper runs a code, Reducer is required for getting the
result from each Mapper.
6. JobTracker: JobTracker is a master node used for getting the location of a file
in different DataNodes. It is a very important service in Hadoop as if it goes down,
all the running jobs will get halted.
7. TaskTracker: TaskTracker is a reference for the JobTracker present in the
DataNodes. It accepts different tasks, such as map, reduce, and shuffle operations,
from the JobTracker. It is a key player performing the main MapReduce functions.
10
8. Block: Block is a small unit wherein the files are split. It has a default size of 64
MB and can be increased as needed.
9. Cluster: Cluster is a set of machines such as DataNodes, NameNodes,
Secondary NameNodes, etc.
There are two layers in the Hadoop architecture. First, we will see how data is stored in
Hadoop and then we will move on to how it is processed. While talking about the storage of
files in Hadoop, HDFS comes to place.
Hadoop Distributed File System (HDFS)
HDFS is based on Google File System (GFS) that provides a distributed system particularly
designed to run on commodity hardware. The file system has several similarities with the
existing distributed file systems. However, HDFS does stand out among all of them. This is
because it is fault-tolerant and is specifically designed for deploying on low-cost hardware.
HDFS is mainly responsible for taking care of the storage parts of Hadoop applications. So, if
you have a 100 MB file that needs to be stored in the file system, then in HDFS, this file will
be split into chunks, called blocks. The default size of each block in Hadoop 1 is 64 MB, on
the other hand in Hadoop 2 it is 128 MB. For example, in Hadoop version 1, if we have a 100
MB file, it will be divided into 64 MB stored in one block and 36 MB in another block. Also,
each block is given a unique name, i.e., blk_n (n = any number). Each block is uploaded to
one DataNode in the cluster. On each of the machines or clusters, there is something called
as a daemon or a piece of software that runs in the background.
The daemons of HDFS are as follows:
NameNode: It is the master node that maintains or manages all data. It points to DataNodes
and retrieves data from them. The file system data is stored on a NameNode.
Secondary NameNode: It is the master node and is responsible for keeping the checkpoints
of the file system metadata that is present on the NameNode.
DataNode: DataNodes have the application data that is stored on the servers. It is the slave
node that basically has all the data of the files in the form of blocks.
As we know, HDFS stores the application data and the files individually on dedicated servers.
The file content is replicated by HDFS on various DataNodes based on the replication factor
to assure the authenticity of the data. The DataNode and the NameNode communicate with
each other using TCP protocols.
The following prerequisites are required to be satisfied by HDFS for the Hadoop architecture
to perform efficiently:
o There must be good network speed in order to manage data transfer.
o Hard drives should have a high throughput.
MapReduce Layer
MapReduce is a patented software framework introduced by Google to support distributed

computing on large datasets on clusters of computers.
11
It is basically an operative programming model that runs in the Hadoop background
providing simplicity, scalability, recovery, and speed, including easy solutions for data
processing. This MapReduce framework is proficient in processing a tremendous amount of
data parallelly on large clusters of computational nodes.
MapReduce is a programming model that allows you to process your data across an entire
cluster. It basically consists of Mappers and Reducers that are different scripts you write or
different functions you might use when writing a MapReduce program. Mappers have the
ability to transform your data in parallel across your computing cluster in a very efficient
manner; whereas, Reducers are responsible for aggregating your data together.
Mappers and Reducers put together can be used to solve complex problems.
Working of the MapReduce Architecture
The job of MapReduce starts when a client submits a file. The file first goes to the JobTracker.
It combines Reduce functions, with the location, for input and output data. When a file is
received, the JobTracker sends a request to the NameNode that has the location of the
DataNode. The NameNode will send that location to the JobTracker. Next, the JobTracker
will go to that location in the DataNode. Then, the JobTracker present in the DataNode
sends a request to the select TaskTrackers.
Next, the processing of the map phase begins. In this phase, the TaskTracker retrieves all the
input data. For each record, a map function is invoked, which has been parsed by the
‘InputFormat’ producing key–value pairs in the memory buffer.
Sorting the memory buffer is done next wherein different reducer nodes are sorted by
invoking a function called combine. When the map task is completed, the JobTracker gets a
notification from the TaskTracker for the same. Once all the TaskTrackers notify the
JobTracker, the JobTracker notifies the select TaskTrackers, to begin with the reduce phase.
The TaskTracker’s work now is to read the region files and sort the key–value pairs for each
and every key. Lastly, the reduce function is invoked, which collects the combined values into
an output file.
How does Hadoop work?
Hadoop runs code across a cluster of computers and performs the following tasks:
 Data is initially divided into files and directories. Files are then divided into
consistently sized blocks ranging from 128 MB in Hadoop 2 to 64 MB in Hadoop 1.
 Then, the files are distributed across various cluster nodes for further processing of
data.
 The JobTracker starts its scheduling programs on individual nodes.
 Once all the nodes are done with scheduling, the output is returned.
Data from HDFS is consumed through MapReduce applications. HDFS is also responsible for
multiple replicas of data blocks that are created along with the distribution of nodes in a
cluster, which enables reliable and extremely quick computations.
12
So, in the first step, the file is divided into blocks and is stored in different DataNodes. If a
job request is generated, it is directed to the JobTracker.
The JobTracker doesn’t really know the location of the file. So, it contacts with the
NameNode for this.
The NameNode will now find the location and give it to the JobTracker for further
processing. Now, since the JobTracker knows the location of the blocks of the requested file,
it will contact the TaskTracker present on a particular DataNode for the data file. The
TaskTracker will now send the data it has to the JobTracker.
Finally, the JobTracker will collect the data and send it back to the requested source.
How does Yahoo! use Hadoop Architecture?
In Yahoo!, there are 36 different Hadoop clusters that are spread across Apache HBase,
Storm, and YARN, i.e., there are 60,000 servers in total made from 100s of distinct hardware
configurations. Yahoo! runs the largest multi-tenant Hadoop installation in the world. There
are approximately 850,000 Hadoop jobs daily, which are run by Yahoo!.
The cost of storing and processing data using Hadoop is the best way to determine whether
Hadoop is the right choice for your company. When comparing on the basis of the expense
for managing data, Hadoop is much cheaper than any legacy systems.
Hadoop Installation
Hadoop is basically supported by the Linux platform and its facilities. If you are working on
Windows, you can use Cloudera VMware that has preinstalled Hadoop, or you can use
Oracle VirtualBox or the VMware Workstation. In this tutorial, I will be demonstrating the
installation process for Hadoop using the VMware Workstation 12. You can use any of the
above to perform the installation. I will do this by installing CentOS on my VMware.
Prerequisites
 VirtualBox/VMWare/Cloudera: Any of these can be used for installing the

operating system on.
 Operating System: You can install Hadoop on Linux-based operating systems.
Ubuntu and CentOS are very commonly used among them. In this tutorial, we are using
CentOS.
 Java: You need to install the Java 8 package on your system.
 Hadoop: You require the Hadoop 2.7.3 package.
Hadoop Installation on Windows
Note: If you are working on Linux, then skip to Step 9.
Step 1: Installing VMware Workstation

 Download VMware Workstation
 Once downloaded, open the .exe file and set the location as required
13
 Follow the required steps of installation
Step 2: Installing CentOS

 Install CentOS
 Save the file in any desired location
Step 3: Setting up CentOS in VMware 12
When you open VMware, the following window pops up:
Click on Create a New Virtual Machine
14
1. As seen in the screenshot above, browse the location of your CentOS file you
downloaded. Note that it should be a disc image file
2. Click on Next
15
1. Choose the name of your machine. Here, I have given the name as CentOS 64-bit
2. Then, click Next
16
1. Specify the disk capacity. Here, I have specified it to be 20 GB
2. Click Next
 Click on Finish
17
 After this, you should be able to see a window as shown below. This screen indicates
that you are booting the system and getting it ready for installation. You will be given a
time of 60 seconds to change the option from Install CentOS to others. You will need
to wait for 60 seconds if you need the option selected to be Install CentOS
Note: In the image above, you can see three options, such as, I Finished Installing, Change
Disc, and Help. You don’t need to touch any of these until your CentOS is successfully
installed.
18
 At the moment, your system is being checked and is getting ready for installation
19
 Once the checking percentage reaches 100%, you will be taken to a screen as shown
below:
Step 4: Here, you can choose your language. The default language is English, and that is
what I have selected
20
1. If you want any other language to be selected, specify it
2. Click on Continue
Step 5: Setting up the Installation Processes
 From Step 4, you will be directed to a window with various options as shown below:
21
 First, to select the software type, click on the SOFTWARE SELECTION option
22
o Now, you will see the following window:1. Select the Server with GUI option
to give your server a graphical appeal
2. Click on Done
 After clicking on Done, you will be taken to the main menu where you had previously
selected SOFTWARE SELECTION
 Next, you need to click on INSTALLATION DESTINATION
23
24
 On clicking this, you will see the following window:
1. Under Other Storage Options, select I would like to make additional space

available
2. Then, select the radio button that says I will configure partitioning
3. Then, click on Done
25
o Next, you’ll be taken to another window as shown below:
1. Select the partition scheme here as Standard Partition

2. Now, you need to add three mount points here. For doing that, click on ‘+’
26
a) Select the Mount Point /boot as shown above
b) Next, select the Desired Capacity as 500 MiB as shown below:
27
c) Click on Add mount point
d) Again, click on ‘+’ to add another Mount Point
28
e) This time, select the Mount Point as swap and Desired Capacity as 2 GiB
29
f) Click on Add Mount Point
g) Now, to add the last Mount Point, click on + again
30
h) Add another Mount Point ‘/’ and click on Add Mount Point
31
i) Click on Done, and you will see the following window:
32
This is just to make you aware of all the changes you had made in the partition of your drive
o Now, click on Accept Changes if you’re sure about the partitions you have
made
o Next, select NETWORK & HOST NAME
33
o You’ll be taken to a window as shown below:
34
1. Set the Ethernet settings as ON
2. Change the HOST name if required
3. Apply the settings
4. Finally, click on Done
o Next, click on Begin Installation
35
Step 6: Configuration
 Once you complete Step 5, you will see the following window where the final
installation process will be completed.
 But before that, you need to set the ROOT PASSWORD and create a user
36
 Click on ROOT PASSWORD, which will direct you to the following window:
37
1. Enter your root password here
2. Confirm the password
3. Click on Done
o Now, click on USER CREATION, and you will be directed to the following
window:
38
1. Enter your Full name. Here, I have entered Intell
2. Next, enter your User name; here, intell (This generally comes up automatically)
3. You can either make this password-based or make this a user administrator
4. Enter the password
5. Confirm your password
6. Finally, click on Done
 You’ll see the Reboot button, as seen below, when your installation is done, which
takes up to 20–30 minutes
39
 In the next screen, you will see the installation process in progress
40
 Wait until a window pops up to accept your license infoStep 7: Setting up the
License Information
 Accept the License Information
41
Step 8: Logging into CentOS
 You will see the login screen as below:
42
Enter the user ID and password you had set up in Step 6
Your CentOS installation is now complete!
Now, you need to start working on CentOS, and not on your local operating system. If you
have jumped to this step because you are already working on Linux/Ubuntu, then continue
with the following steps.
Step 9: Downloading and Installing Java 8

 Save this file in your home directory
 Extract the Java tar file using the following command:
tar -xvf jdk-8u101-linux-i586.tar.gz

Step 10: Downloading and Installing Hadoop
o Download a stable release packed as a zipped file and unpack it somewhere
on your file system
 Extract the Hadoop file using the following command on the terminal:
tar -xvf hadoop-2.7.3.tar.gz

 You will be directed to the following window:
43
Step 11: Moving Hadoop to a Location
 Use the following code to move your file to a particular location, here Hadoop:
mv hadoop-2.7.3/home/intell/hadoop
Note: The location of the file you want to change may differ. For demonstration
purposes, I have used this location, and this will be the same throughout this tutorial.
You can change it according to your choice.
 Here, Home will remain the same.
 Intellipaat is the user name I have used. You can change it according to your user
name.
 Hadoop is the location where I want to save this file. You can change it as well if you
want.
mv Hadoop-2.7.3 /home/intel/hadoop
Step 12: Editing and Setting up HadoopFirst, you need to set the path in the ~/.bashrc file.
You can set the path from the root user by using the command ~/.bashrc. Before you
edit ~/.bashrc, you need to check your Java configurations.
Enter the command:
update-alternatives-config java
44
You will now see all the Java versions available in the machine. Here, since I have only one
version of Java which is the latest one, it is shown below:
You can have multiple versions as well.

 Next, you need to select the version you want to work on. As you can see, there is a
highlighted path in the above screenshot. Copy this path and place it in a gedit file. This
path is just for being used in the upcoming steps
 Enter the selection number you have chosen. Here, I have chosen the number 1
 Now, open ~/.bashrc with the vi editor (the screen-oriented text editor in Linux)
Note: You have to become a root user first to be able to edit ~/.bashrc.

 Enter the command: su
 You will be prompted for the password. Enter your root password
45
 When you get logged into your root user, enter the command: vi ~/.bashrc
46
 The above command takes you to the vi editor, and you should be able to see the
following screen:
 To access this, press Insert on your keyboard, and then, start writing the following set
of codes for setting a path for Java:
 fi
 #HADOOP VARIABLES START
 export JAVA_HOME= (path you copied in the previous step)
 export HADOOP_HOME=/home/(your username)/hadoop
 export PATH=$PATH:$HADOOP_INSTALL/bin
 export PATH=$PATH:$HADOOP_INSTALL/sbin
 export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_HOME=$HADOOP_INSTALL
 export HADOOP_HDFS_HOME=$HADOOP_INSTALL
 export YARN_HOME=$HADOOP_INSTALL
 export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_/INSTALL/lib/native
 export HADOOP_OPTS=”Djava.library.path”=$HADOOP_INSTALL/lib”
#HADOOP VARIABLES END
47
After writing the code, click on Esc on your keyboard and write the command :wq!
This will save and exit you from the vi editor. The path has been set now as it can be
seen in the image below:
48
Step 13: Adding Configuration Files
 Open hadoop-env.sh with the vi editor using the following command:
vi /home/intell/hadoop/etc/hadoop/hadoop-env.sh
 Replace this path with the Java path to tell Hadoop which path to use. You will see
the following window coming up:
 Change the JAVA_HOME variable to the path you had copied in the previous step
49
Step 14:
Now, there are several XML files that need to be edited, and you need to set the
property and the path for them.
 Editing core-site.xml
o Use the same command as in the previous step and just change the last part
to core-site.xml as given below:
vi /home/intell/hadoop/etc/hadoop/core-site.xml
Next, you will see the following window:
50
o Enter the following code in between the configuration tags as below:
o <configuration>
o <property>
o <name>fs.defaultFS</name>
o <value>hdfs://(your localhost):9000</value>
o </property>
o </configuration>
o Now, exit from this window by entering the command :wq!

 Editing yarn-site.xml
o Enter the command:
vi /home/intell/hadoop/etc/hadoop/yarn-site.xml
You will see the following window:
51
o Enter the code in between the configuration tags as shown below:
o <configuration>
o <property>
o <name>yarn.nodemanager.aux-services</name>
o <value>mapreduce_shuffle</value>
o </property>
o <property>
o <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
o <value>org.apache.hadoop.mapred.ShuffleHandler</value>
o </property>
</configuration>
52
o Exit from this window by pressing Esc and then writing the command :wq!
 Editing mapred-site.xml
o Copy or rename a file mapred-site.xml.template with the name mapred-
site.xml.Note: If you go to the following path, you will see that there is no file
named mapred-site.xml:
Home > intell > hadoop > hadoop-2.7.3 > etc > hadoop
So, we will copy the contents of mapred-site .xml.template to mapred-site.xml.
o Use the following command to copy the contents:
cp /home/intell/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml.template
/home/intell/hadoop/hadoop-2.7.3/etc/hadoop/ mapred-site.xml
Once the contents have been copied to a new file named mapred-site.xml, you can
verify it by going to the following path:
Home > intell > hadoop > hadoop-2.7.3 > etc > hadoop
o Now, use the following command to add configurations:
vi/home/intell/hadoop/etc/hadoop/mapred-site.xml
o In the new window, enter the following code in between the configuration
tags as below:
o <configuration>
o <property>
o <name>mapreduce.framework.name</name>
o <value>yarn</value>
53
o </property>
</configuration>
o Exit using Esc and the command :wq!

 Editing hdfs-site.xml
Before editing hdfs-site.xml, two directories have to be created, which will contain the
namenode and the datanode.
o Enter the following code for creating a directory, namenode:
mkdir -p /home/intell/hadoop_store/hdfs/namenode

Note: Here, mkdir means creating a new file.
o Similarly, to create the datanode directory, enter the following command:
mkdir -p /home/intell/hadoop_store/hdfs/datanode
 Now, go to the following path to check both the files:
Home > intell > hadoop_store > hdfsYou can find both directories in the specified
path as in the images below:
54
o Now, to configure hdfs-site.xml, use the following command:
vi /home/intell/hadoop/etc/hadoop/hdfs-site.xml
55
o Enter the following code in between the configuration tags:
o <configuration>
o <property>
o <name>dfs.replication</name>
o <value>1</value>
o </property>
o <property>
o <name>dfs.namenode.name.dir</name>
o <value>file:/home/intell/hadoop_store/hdfs/namenode</value>
o </property>
o <property>
o <name>dfs.datanode.data.dir</name>
o <value> file:/home/intell/hadoop_store/hdfs/namenode</value>
o </property>
</configuration>
o Exit using Esc and the command :wq!
All your configurations are done. And Hadoop Installation is done now!
Step 15: Checking Hadoop
56
You will now need to check whether the Hadoop installation is successfully done on your
system or not.
 Go to the location where you had extracted the Hadoop tar file, right-click on
the bin, and open it in the terminal
 Now, write the command, ls
Next, if you see a window as below, then it means that Hadoop is successfully installed!
The Hadoop High-level Architecture
Hadoop Architecture based on the two main components namely MapReduce and HDFS
57
Different Hadoop Architectures based on the Parameters chosen
The Apache Hadoop Module
Hadoop Common: Includes the common utilities which supports the other Hadoop
modules
HDFS: Hadoop Distributed File System provides unrestricted, high-speed access to the data
application.
Hadoop YARN: This technology is basically used for scheduling of job and efficient
management of the cluster resource.
MapReduce: This is a highly efficient methodology for parallel processing of huge volumes
of data.
58
Then there are other projects included in the Hadoop module like:
Apache Ambari Apache Spark Sqoop

Cassandra Hive Oozie
HBase Pig Zookeepe
How does Hadoop Work?
Hadoop helps to execute large amount of processing where the user can connect together
multiple commodity computers to a single-CPU, as a single functional distributed system
and have the particular set of clustered machines that reads the dataset in parallel and
provide intermediate, and after integration gets the desired output.
Hadoop runs code across a cluster of computers and performs the following tasks:
 Data are initially divided into files and directories. Files are divided into consistent
sized blocks ranging from 128M and 64M.
 Then the files are distributed across various cluster nodes for further processing of
data.
 Job tracker starts its scheduling programs on individual nodes.
 Once all the nodes are done with scheduling then the output is return back.
The Challenges facing Data at Scale and the Scope of Hadoop
Big Data are categorized into:

 Structured –which stores the data in rows and columns like relational data sets
 Unstructured – here data cannot be stored in rows and columns like video, images,
etc.
 Semi-structured – data in format XML are readable by machines and human
59
There is a standardized methodology that Big Data follows highlighting usage methodology
of ETL.
ETL – stands for Extract, Transform, and Load.
Extract –fetching the data from multiple sources
Transform – convert the existing data to fit into the analytical needs
Load –right systems to derive value in it.
Comparison to Existing Database Technologies
Most database management systems are not up to scratch for operating at such lofty levels
of Big data exigencies either due to the sheer technical inefficient. When the data is totally
unstructured, the volume of data is humongous, where the results are at high speeds, then
finally only platform that can effectively stand up to the challenge is Apache Hadoop.
Hadoop majorly owes its success to a processing framework called as MapReduce that is
central to its existence. The MapReduce technology gives opportunity to all programmers
contributes their part where large data sets are divided and are independently processed in
parallel. These coders doesn’t need to knew the high-performance computing and can work
efficiently without worrying about intra-cluster complexities, monitoring of tasks, node
failure management, and so on.
Hadoop also contributes it’s another platform namely known as Hadoop Distributed File
System (HDFS). The main strength of HDFS is its ability to rapidly scale and work without a
hitch irrespective of any fault with the nodes. HDFS in essence divides large file into smaller
blocks or units ranging from 64 to 128MB later are copied onto a couple of nodes of the
cluster. From this HDFS ensures no work would stop even when some nodes going out of
service. HDFS owns APIs to ensure The MapReduce program is used for reading and writing
data (contents) simultaneously at high speeds. When there is a need to speed up
performance, and then add extra nodes in parallel to the cluster and the increased demand
can be immediately met.
Advantages of Hadoop
 It give access to the user to rapidly write and test the distributed systems and then
automatically distributes the data and works across the machines and in turn utilizes the
primary parallelism of the CPU cores.
 Hadoop library are developed to find/search and handle the failures at the
application layer.
 Servers can be added or removed from the cluster dynamically at any point of time.
 It is open source based on Java applications and hence compatible on all the
platforms.
Hadoop Features and Characteristics
Apache Hadoop is the most popular and powerful big data tool, which provides world’s best
reliable storage layer HDFS (Hadoop Distributed File System), a batch Processing engine
namely MapReduce and a Resource Management Layer like YARN. Open-source Apache
60
Hadoop is an open source project. It means its code can be modified according to business
requirements.
 Distributed Processing– The data storage is maintained in a distributed manner

in HDFS across the cluster, data is processed in parallel on cluster of nodes.
 Fault Tolerance– By default the three replicas of each block is stored across the
cluster in Hadoop and it’s changed only when required. Hadoop’s fault tolerant can be
examined in such cases when any node goes down, the data on that node can be
recovered easily from other nodes. Failures of a particular node or task are recovered
automatically by the framework.
 Reliability– Due to replication of data in the cluster, data can be reliable which is
stored on the cluster of machine despite machine failures. Even if your machine goes
down, and then also your data will be stored reliably.
 High Availability– Data is available and accessible even there occurs a hardware
failure due to multiple copies of data. If any incidents occurred such as if your machine
or few hardware crashes, then data will be accessed from other path.
 Scalability– Hadoop is highly scalable and in a unique way hardware can be easily
added to the nodes. It also provides horizontal scalability which means new nodes can
be added on the top without any downtime.
 Economic– Hadoop is not very expensive as it runs on cluster of commodity
hardware. We do not require any specialized machine for it. Hadoop provides huge cost
reduction since it is very easy to add more nodes on the top here. So if the requirement
increases, then there is an increase of nodes, without any downtime and without any
much of pre planning.
 Easy to use– No need of client to deal with distributed computing, framework takes
care of all the things. So it is easy to use.
 Data Locality– Hadoop works on data locality principle which states that the
movement of computation of data instead of data to computation. When client submits
his algorithm, then the algorithm is moved to data in the cluster instead of bringing
data to the location where algorithm is submitted and then processing it.
Hadoop Assumptions
Hadoop is written with huge amount of clusters of computers in mind and is built upon the
following assumptions:
 Hardware may fail due to any external or technical malfunction where instead
commodity hardware can be used.
 Processing will be run in batches and there exits an emphasis on high throughput as
opposed to low latency.
 Applications which run on HDFS have large sets of data. A typical file in HDFS may be
of gigabytes to terabytes in size.
 Applications require a write-once-read-many access model.
 Moving Computation is cheaper compared to the Moving Data.
Hadoop Design Principles
The following are the design principles on which Hadoop works:

 System shall manage and heal itself as per the requirement occurred.
61
 Fault Tolerant are automatically and transparently route are managed around failures
speculatively execute redundant tasks if certain nodes are detected to be running of
slower phase.
 Performance is scaled based on linearity.
 Proportional change in terms of capacity with resource been change (Scalability)
 Compute must be moved to data.
 Data Locality is termed as lower latency, lower bandwidth.
 It is based on simple core, modular and extensible (Economical).
62
Hadoop Ecosystem
Core Hadoop ecosystem is nothing but the different components that are built on the
Hadoop platform directly. However, there are a lot of complex interdependencies between
these systems.
HDFS
Starting from the base of the Hadoop ecosystem, there is HDFS or Hadoop Distributed File
System. It is a system that allows you to distribute the storage of big data across a cluster of
computers. That means, all of your hard drives look like a single giant cluster on your system.
That’s not all; it also maintains the redundant copies of data. So, if one of your computers
happen to randomly burst into flames or if some technical issues occur, HDFS can actually
recover from that by creating a backup from a copy of the data that it had saved
automatically, and you won’t even know if anything happened. So, that’s the power of HDFS,
i.e., the data storage is in a distributed manner having redundant copies.
YARN
Next in the Hadoop ecosystem is YARN (Yet Another Resource Negotiator). It is the place
where the data processing of Hadoop comes into play. YARN is a system that manages the
resources on your computing cluster. It is the one that decides who gets to run the tasks,
when and what nodes are available for extra work, and which nodes are not available to do
so. So, it’s like the heartbeat of Hadoop that keeps your cluster going.
MapReduce
One interesting application that can be built on top of YARN is MapReduce. MapReduce, the
next component of the Hadoop ecosystem, is just a programming model that allows you to
process your data across an entire cluster. It basically consists of Mappers and Reducers that
are different scripts, which you might write, or different functions you might use when
writing a MapReduce program. Mappers have the ability to transform your data in parallel
across your computing cluster in a very efficient manner; whereas, Reducers are responsible
for aggregating your data together. This may sound like a simple model, but MapReduce is
very versatile. Mappers and Reducers put together can be used to solve complex problems.
We will talk about MapReduce in one of the upcoming sections of this Hadoop tutorial.
Apache Pig
Next up in the Hadoop ecosystem, we have a technology called Apache Pig. It is just a high-
level scripting language that sits on top of MapReduce. If you don’t want to write Java or
Python MapReduce codes and are more familiar with a scripting language that has
somewhat SQL-style syntax, Pig is for you. It is a very high-level programming API that allows
you to write simple scripts. You can get complex answers without actually writing Java code
in the process. Pig Latin will transform that script into something that will run on
MapReduce. So, in simpler terms, instead of writing your code in Java for MapReduce, you
63
can go ahead and write your code in Pig Latin which is similar to SQL. By doing so, you won’t
have to perform MapReduce jobs. Rather, just writing a Pig Latin code will perform
MapReduce functions.
Hive
Now, in the Hadoop ecosystem, there comes Hive. It also sits on top of MapReduce and
solves a similar type of problem like Pig, but it looks more like a SQL. So, Hive is a way of
taking SQL queries and making the distributed data sitting on your file system somewhere
look like a SQL database. It has a language known as Hive SQL. It is just a database in which
you can connect to a shell client and ODBC (Open Database Connectivity) and execute SQL
queries on the data that is stored on your Hadoop cluster even though it’s not really a
relational database under the hood. If you’re familiar with SQL, Hive might be a very useful
API or interface for you to use.
Apache Ambari
Apache Ambari is the next in the Hadoop ecosystem which sits on top of everything and
gives you a view of your cluster. It is basically an open-source administration tool responsible
for tracking applications and keeping their status. It lets you visualize what runs on your
cluster, what systems you’re using, and how much resources are being used. So, Ambari lets
you have a view into the actual state of your cluster in terms of the applications that are
running on it. It can be considered as a management tool that will manage the monitors
along with the health of several Hadoop clusters.
Mesos
Mesos isn’t really a part of Hadoop, but it’s included in the Hadoop ecosystem as it is an
alternative to YARN. It is also a resource negotiator just like YARN. Mesos and YARN solve
the same problem in different ways. The main difference between Mesos and YARN is in
their scheduler. In Mesos, when a job comes in, a job request is sent to the Mesos master,
and what Mesos does is it determines the resources that are available and it makes offers
back. These offers can be accepted or rejected. So, Mesos is another way of managing your
resources in the cluster.
Apache Spark
Spark is the most interesting technology of this Hadoop ecosystem. It sits on the same level
as MapReduce and right above Mesos to run queries on your data. It is mainly a real-time
data processing engine developed in order to provide faster and easy-to-use analytics than
MapReduce. Spark is extremely fast and is under a lot of active development. It is a very
powerful technology as it uses the in-memory processing of data. If you want to efficiently
and reliably process your data on the Hadoop cluster, you can use Spark for that. It can
64
handle SQL queries, do Machine Learning across an entire cluster of information, handle
streaming data, etc.
Tez
Tez is similar to Spark and is next in the Hadoop ecosystem it uses some of the same
techniques as Spark. It tells you what MapReduce does as it produces a more optimal plan
for executing your queries. Tez, when used in conjunction with Hive, tends to accelerate
Hive’s performance. Hive is placed on top of MapReduce, but you can place it on top of Tez,
as Hive through Tez can be a lot faster than Hive through MapReduce. They are both
different means of optimizing queries together.
Apache HBase
Next up in the Hadoop ecosystem is HBase. It is set on the side, and it is a way of exposing
data on your cluster to the transactional platform. So, it is called the NoSQL database, i.e., it
is a columnar data store that is a very fast database and is meant for large transaction rates.
It can expose data stored in your cluster which might be transformed in some way by Spark
or MapReduce. It provides a very fast way of exposing those results to other systems.
Apache Storm
Apache Storm is basically a way of processing streaming data. So, if you have streaming data
from sensors or weblogs, you can actually process it in real time using Storm. Processing
data doesn’t have to be a batch thing anymore; you can update your Machine Learning
models or transform data into the database, all in real time, as the data comes in.
Oozie
Next up in the Hadoop ecosystem, there Oozie. Oozie is just a way of scheduling jobs on
your cluster. So, if you have a task that needs to be performed on your Hadoop cluster
involving different steps and maybe different systems, Oozie is the way for scheduling all
these things together into jobs that can be run on some order. So, when you have more
complicated operations that require loading data into Hive, integrating that with Pig, and
maybe querying it with Spark, and then transforming the results into HBase, Oozie can
manage all that for you and make sure that it runs reliably on a consistent basis.
ZooKeeper
ZooKeeper is basically a technology for coordinating everything on your cluster. So, it is a

technology that can be used for keeping track of the nodes that are up and the ones that are
down. It is a very reliable way of keeping track of shared states across your cluster that
different applications can use. Many of these applications rely on ZooKeeper to maintain
reliable and consistent performance across a cluster even when a node randomly goes down.
Therefore, ZooKeeper can be used for keeping track of which the master node is, which node
is up, or which node is down. Actually, it’s even more extensible than that.
65
Data Ingestion
The below-listed systems in the Hadoop ecosystem are focused mainly on the problem of
data ingestion, i.e., how to get data into your cluster and into HDFS from external sources.
Let’s have a look at them.
 Sqoop: Sqoop is a tool used for transferring data between relational database servers
and Hadoop. Sqoop is used to import data from various relational databases like Oracle
to Hadoop HDFS, MySQL, etc. and to export from HDFS to relational databases.
 Flume: Flume is a service for aggregating, collecting, and moving large amounts of
log data. Flume has a flexible and simple architecture that is based on streaming data
flows. Its architecture is robust and fault-tolerant with reliable and recovery mechanisms.
It uses the extensible data model that allows for online analytic applications. Flume is
used to move the log data generated by application servers into HDFS at a higher
speed.
 Kafka: Kafka is also an open-source streaming data processing software that solves a
similar problem as Flume. It is used for building real-time data pipelines and streaming
apps reducing complexity. It is horizontally scalable and fault-tolerant. Kafka aims to
provide a unified, low-latency platform to handle real-time data feeds. Asynchronous
communication and messages can be established with the help of Kafka. This ensures
reliable communication.
In the next section, we will be learning about HDFS in detail.
HDFS in Hadoop
So, what is HDFS? HDFS or Hadoop Distributed File System, which is completely written in
Java programming language, is based on the Google File System (GFS). Google had only
presented a white paper on this, without providing any particular implementation. It is
interesting that around 90 percent of the GFS architecture has been implemented in HDFS.
HDFS was formerly developed as a storage infrastructure for the Apache Nutch web search
engine project, and hence it was initially known as the Nutch Distributed File System (NDFS).
Later on, the HDFS design was developed essentially for using it as a distributed file system.
HDFS is extremely fault-tolerant and can hold a large number of datasets, along with
providing ease of access. The files in HDFS are stored across multiple machines in a
systematic order. This is to eliminate all feasible data losses in the case of any crash, and it
helps in making applications accessible for parallel processing. This file system is designed
for storing a very large amount of files with streaming data access.
To know ‘What is HDFS in Hadoop?’ in detail, let’s first see what a file system is? Well, a file
system is one of the fundamental parts of all operating systems. It basically administers the
storage in the hard disk.
Why do you need another file system?
66
Now, you know ‘What is HDFS in Hadoop?’ It is basically a file system. But, the question here
is, why do you need another file system?
Have you ever used a file system before?
The answer would be yes!
Let’s say, a person has a book and another has a pile of unordered papers from the same
book and both of them need to open Chapter 3 of the book. Who do you think would get to
Chapter 3 faster? The one with the book, right? Because, that person can simply go to the
index, look for Chapter 3, check out the page number, and go to the page. Meanwhile, the
one with the pile of papers has to go through the entire pile and if he is lucky enough, he
might find Chapter 3. Just like a well-organized book, a file system helps navigate data that is
stored in your storage.
Without a file system, the information stored in your hard disk will be a large body of data in
which there would be no way to tell where one piece of information stops and the next
begins.
The file system manages how a dataset is saved and retrieved. So, when reading and writing
of files is done on your hard disk, the request goes through a distinct file system. The file
system has some amount of metadata of your files such as size, filename, created time,
owner, modified time, etc.
When you want to write a file to a hard disk, the file system helps in figuring out where in the
hard disk the file should be written and how efficiently it can do so. How do you think the file
system manages to do that? Since it has all the details about the hard disk, including the
empty spaces available in it, it can directly write that particular file there.
Now, we will talk about HDFS, by working with Example.txt which is a 514 MB file.

When you upload a file into HDFS, it will automatically be split into 128 MB fixed-size blocks
(In the older versions of Hadoop, the file used to be divided into 64 MB fixed-size blocks). So
basically, it takes care of placing the blocks in three different DataNodes by replicating each
block three times.
Now that you have understood why you need HDFS, next in this section on ‘What is HDFS?’
let’s see why it is a perfect match for big data.
HDFS is a perfect tool for working with big data. The following list of facts proves it.
 HDFS uses the MapReduce method for accessing data, which is very fast.
 HDFS follows the data coherency model, in which the data is synchronized across the
server. It is very simple to implement and is highly robust and scalable.
 HDFS is compatible with any kind of commodity hardware and operating system
processors
 As data is saved in multiple locations, it is safe enough.
 It is conveniently accessible to use a web browser which makes it highly utilitarian.
Let’s see the architecture of HDFS.
67
HDFS Architecture
The following image gives the most important components present in the HDFS architecture.
It has a Master-Slave architecture and has several components in it.
Let’s start with the basic two nodes in the HDFS architecture, i.e., the DataNode and the
NameNode.
DataNode
Nodes wherein the blocks are physically stored are known as DataNodes. Since these nodes
hold the actual data of the cluster, they are termed as DataNodes. Every DataNode knows
the blocks it is responsible for, but it might sometimes miss some major information.
Although the DataNode knows about the block it is responsible for, it doesn’t care to know
about the other blocks and the other DataNodes. This is a problem for you as a user because
you don’t know anything about the blocks other than the file name, and you should be able
to work only with the file name in the Hadoop cluster.
So the question here is: if the DataNodes do not know which block belongs to which file,
then who has the key information? The key information is maintained by a node called the
NameNode.
NameNode
A NameNode keeps track of all the files or datasets in HDFS. It knows the list of blocks that
are made up of files in HDFS, not only the list of blocks but also the location of them.
Why is a NameNode so important? Imagine that a NameNode is down in your Hadoop

cluster. In this scenario, there would be no way you could look up for the files in the cluster
because you won’t be able to figure out the list of blocks made up of the files. Also, you
won’t be able to figure out the location of the blocks. Apart from the block locations, a
NameNode also has the metadata of the files and folders in HDFS, which includes
information like, the size, replication factor, created by, created on, last modified by, last
modified on, etc.
Due to the significance of the NameNode, it is also called the master node, and the
DataNodes are called slave nodes, and hence the master–slave architecture.
NameNode persists all the metadata information about the files and folder and hard disk,
except for the block location.
Since NameNodes are in constant communication with each other, when a NameNode starts
up, the DataNodes will try to connect with the NameNode and broadcast the list of blocks
that each of them is responsible for. The NameNode will hold the block locations in memory
and never persist the information in the hard disk. Because, in a busy cluster HDFS is
constantly changing with the new data files coming into the cluster, and if the NameNode
has to persist every change to the block by writing the information to a hard disk, it would
be a bottleneck. Hence, with performance reasons in mind, the NameNode will hold the
block locations in memory so that it can give a faster response to the clients. Therefore, it is
68
clear that the NameNode is the most powerful node in the cluster in terms of capacity. A
NameNode failure is clearly not an option.
Other than the NameNode and the DataNodes, there is another component called the
secondary NameNode. It works simultaneously with a primary NameNode as a helper.
Although, the secondary NameNode is not a backup NameNode.
The functions of a secondary NameNode are listed below:

 The secondary NameNode reads all files, along with the metadata, from the RAM of
the NameNode. It also writes the metadata into the file system or to the hard disk.
 The secondary NameNode is also responsible for combining EditLogs with fsImage
present in the NameNode.
 At regular intervals, the EditLogs are downloaded from the NameNode and are
applied to fsImage by the secondary NameNode.
 The secondary NameNode has periodic checkpoints in HDFS, and hence it is also
called the checkpoint node.
Blocks
The data in HDFS is stored in the form of multiple files. These files are divided into one or
more segments and are further stored in individual DataNodes. These file segments are
known as blocks. The default block size is 128 MB in Apache Hadoop 2.x and 64 MB in
Apache Hadoop 1.x, which can be modified as per the requirements from the HDFS
configuration.
HDFS blocks are huge compared to disk blocks and they are designed this way for cost
reduction.
By making a particular set of blocks large enough the time consumed for transferring data
from the disk can be reduced. Therefore, with HDFS, the time consumed to transfer a huge
file made up of multiple blocks works at a faster disk transfer rate.
In the next part of this ‘What is HDFS?’ tutorial, let’s look at the benefits of HDFS.
Benefits of HDFS
 HDFS supports the concept of blocks: When uploading a file into HDFS, the file is
divided into fixed-size blocks to support distributed computation. HDFS keeps track of
all the blocks in the cluster.
 HDFS maintains data integrity: Data failures or data corruption are inevitable in any
big data environment. So, it maintains data integrity and helps recover from data loss by
replicating the blocks and more than the node.
 HDFS supports scaling: If you like to expand your cluster by adding more nodes, it’s
very easy to do with HDFS.
 No particular hardware required: There is no need for any specialized hardware to
run or operate HDFS. It is basically built up to work with commodity computers.
69
Now, we come to the end of this section on ‘What is HDFS?’ of the Hadoop tutorial. We
learned ‘What is HDFS?’, the need for HDFS, and its architecture. In the next section of this
tutorial, we shall be learning about HDFS Commands.
Starting HDFS
Format the configured HDFS file system and then open the namenode (HDFS server) and
execute the following command.
$ hadoop namenode -format

Start the distributed file system and follow the command listed below to start the namenode
as well as the data nodes in cluster.
$ start-dfs.sh
Listing Files in HDFS
Finding the list of files in a directory and the status of a file using ‘ls’ command in the
terminal. Syntax of ls can be passed to a directory or a filename as an argument which are
displayed as follows:
$ $HADOOP_HOME/bin/hadoop fs -ls <args>
Inserting Data into HDFS
Below mentioned steps are followed to insert the required file in the Hadoop file system.
Step1: Create an input directory
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step2: Use put command transfer and store the data file from the local systems to the HDFS
using the following commands in the terminal.
$ $HADOOP_HOME/bin/hadoop fs -put /home/intellipaat.txt /user/input

Step3: Verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
Retrieving Data from HDFS
For an instance if you have a file in HDFS called Intellipaat. Then retrieve the required file
from the Hadoop file system by carrying out:
Step1: View the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/intellipaat

Step2: Gets the file from HDFS to the local file system using get command as shown below
70
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
Shutting Down the HDFS
Shut down the HDFS files by following the below command
$ stop-dfs.sh
Multi-Node Cluster
Installing Java
Syntax of java version command
$ java -version
Following output is presented.
java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
Creating User Account
System user account is used on both master and slave systems for the Hadoop installation.
# useradd hadoop
# passwd hadoop
Mapping the nodes
Hosts files should be edited in /etc/ folder on each and every nodes and IP address of each
system followed by their host names must be specified mandatorily.
# vi /etc/hosts
Enter the following lines in the /etc/hosts file.
192.168.1.109 hadoop-master
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2
Configuring Key Based Login
Ssh should be setup in each node so they can easily converse with one another without any
prompt for password.
71
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Installation of Hadoop
Hadoop should be downloaded in the master server using the following procedure.
# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
Configuring Hadoop
Hadoop server must be configured in core-site.xml and should be edited where ever
required.
<configuration>
<property>
<name>fs.default.name</name><value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml file should be editted.
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/name/data</value>
<final>true</final>
72
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml file should be edited as per the requirement example is being shown.
<configuration>
<property>
<name>mapred.job.tracker</name><value>hadoop-master:9001</value>
</property>
</configuration>
JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited as follows:
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf
Installing Hadoop on Slave Servers
Hadoop should be installed on all the slave servers
# su hadoop
$ cd /opt/hadoop
$ scp -r hadoop hadoop-slave-1:/opt/hadoop
Configuring Hadoop on Master Server
Master server configuration
# su hadoop
$ cd /opt/hadoop/hadoop
Master Node Configuration
$ vi etc/hadoop/masters
73
hadoop-master
Slave Node Configuration
$ vi etc/hadoop/slaves
hadoop-slave-1
hadoop-slave-2
Name Node format on Hadoop Master
# su hadoop
$ bin/hadoop namenode –format
11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:
************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-master/192.168.1.109
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.0
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473;
compiled by 'hortonfo' on Monday May 6 06:59:37 UTC 2013
STARTUP_MSG: java = 1.7.0_71
************************************************************
11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap
editlog=/opt/hadoop/hadoop/dfs/name/current/edits
………………………………………………….
11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name
has been successfully formatted.
11/10/14 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG:
************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15
************************************************************
Hadoop Services
Starting Hadoop services on the Hadoop-Master procedure explains its setup.
$ cd $HADOOP_HOME/sbin
$ start-all.sh
Addition of a New DataNode in the Hadoop Cluster is as follows:
Networking
74
Add new nodes to an existing Hadoop cluster with some suitable network configuration.
Consider the following network configuration for new node Configuration:
IP address : 192.168.1.103
netmask : 255.255.255.0
hostname : slave3.in
Adding a User and SSH Access

Add a user working under “hadoop” domain and the user must have the access added and
password of Hadoop user can be set to anything one wants.
useradd hadoop
passwd hadoop
To be executed on master
mkdir -p $HOME/.ssh
chmod 700 $HOME/.ssh
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
chmod 644 $HOME/.ssh/authorized_keys
Copy the public key to new slave node in hadoop user $HOME directory
scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/
Execution done on slaves
su hadoop ssh -X hadoop@192.168.1.103

Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then the
permission for the same must be changed as per the requirement.
cd $HOME
mkdir -p $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
ssh login must be changed from the master machine. It is possible that the ssh to the new
node without a password from the master must be verified.
ssh hadoop@192.168.1.103 or hadoop@slave3

Setting Hostname for New Node
Hostname is setup in the file directory /etc/sysconfig/network

On new slave3 machine
75
NETWORKING=yes
HOSTNAME=slave3.in
Machine must be restarted again or hostname command should be run under new machine
with the corresponding hostname to make changes effectively.
On slave3 node machine:
hostname slave3.in
/etc/hosts must be updated on all machines of the cluster
192.168.1.102 slave3.in slave3
ping the machine with hostnames to check whether it is resolving to IP address.
ping master.in
Start the DataNode on New Node
Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-

daemon.sh script. Master (NameNode) should correspondingly join the cluster after
automatically contacted. New node should be added to the configuration/slaves file in the
master server. New node will be identified by script-based commands.
Login to new node
su hadoop or ssh -X hadoop@192.168.1.103

HDFS is started on a newly added slave node
./bin/hadoop-daemon.sh start datanode

jps command output must be checked on a new node.
$ jps
7141 DataNode
10312 Jps
Removing a DataNode
Node can be removed from a cluster while it is running, without any worries of data loss. A
decommissioning feature is made available by HDFS which ensures that removing a node is
performed securely.
Step 1
Login to master machine so that the user can check Hadoop is being installed.
$ su hadoop
Step 2
Before starting the cluster an exclude file must be configured where a key named
dfs.hosts.exclude should be added to our$HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.
76
NameNode’s local file system contains a list of machines which are not permitted to connect
to HDFS receives full path by this key and the value associated with it as follows.
<property>
<name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-
1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description>
</property>
Step 3
Hosts with respect to decommission are determined.
File reorganization by the hdfs_exclude.txt for each and every machine to be
decommissioned which will results in preventing them from connecting to the NameNode.
slave2.in
Step 4
Force configuration reloads.
“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run

$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes
NameNode will be forced made to re-read its configuration, as this is inclusive for the newly
updated ‘excludes’ file. Nodes will be decommissioned over a period of time intervals, and
allowing time for each node’s blocks to be replicated onto machines which are scheduled to
be active.jps command output should be checked on slave2.in. Once the work is done
DataNode process will shutdown automatically.
Step 5
Shutdown nodes.
The decommissioned hardware can be carefully shut down for maintenance purpose after
the decommission process has been finished.
$ $HADOOP_HOME/bin/hadoop dfsadmin -report

Step 6
Excludes are edited again and once the machines have been decommissioned, they are
removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes”
will read the excludes file back into the NameNode.
Data Nodes will rejoin the cluster after the maintenance has been completed, or if additional
capacity is needed in the cluster again is being informed.
To run/shutdown tasktracker
$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker

$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker
Add a new node with the following steps
77
1) Take a new system which gives access to create a new username and password
2) Install the SSH and with master node setup ssh connections
3) Add sshpublic_rsa id key having an authorized keys file
4) Add the new data node hostname, IP address and other informative details in /etc/hosts
slaves file192.168.1.102 slave3.in slave3
5) Start the DataNode on the New Node
6) Login to the new node command like suhadoop or Ssh -X hadoop@192.168.1.103
7) Start HDFS of newly added in the slave node by using the following command
./bin/hadoop-daemon.sh start data node
8) Check the output of jps command on a new node.
MapReduce in Hadoop
Now that you know about HDFS, it is time to talk about MapReduce. So, in this section, we’re
going to learn the basic concepts of MapReduce.
We will learn MapReduce in Hadoop using a fun example!
MapReduce in Hadoop is nothing but the processing model in Hadoop. The programming
model of MapReduce is designed to process huge volumes of data parallelly by dividing the
work into a set of independent tasks. As we learned in the Hadoop architecture, the
complete job or work is submitted by the user to the master node, which is further divided
into small tasks, i.e., into slave nodes.
How does MapReduce in Hadoop work?
Now, how does MapReduce in Hadoop work? Let’s explore a scenario.

Imagine that the governor of the state assigns you to manage the upcoming census drive for
State A. Your task is to find the population of all cities in State A. You are given four months
to do this with all the resources you may require.
So, how would you complete this task? Estimating the population of all cities in a big state is
not a simple task for a single person. A practical thing to do would be to actually divide the
state by cities and assigning each city to separate individuals. They would calculate the
population of the respective cities.
Here is an illustration for this scenario considering only three cities, X, Y, and Z:
Person 1, Person 2, and Person 3 will be in charge of the X, Y, and Z cities, respectively.
So, you have broken down State A into cities where each city is allocated to different people,
and they are solely responsible for figuring out the population of their respective cities. Now,
you need to provide specific instructions to each of them. You will ask them to go to each
home and find out how many people live there.
Assume, there are five people in the home Person 1 first visits in City X. He notes down, X 5.
It’s a proper divide and conquer approach right?
The same set of instructions would be carried out by everyone associated. That means that
Person 2 will go to City Y and Person 3 will go to City Z and will do the same. In the end, they
would need to submit their results to the state’s headquarters where someone would
78
aggregate the results. Hence, in this strategy, you will be able to calculate the population of
State A in four months.
Next year, you’re asked to do the same job but in two months. What would you do? Won’t
you just double the number of people performing the task? So, you would divide City X into
two parts and would assign one part to Person 1 and have one more person to take charge
of the other part. Then, the same would be done for Cities Y and Z, and the same set of
instructions would be given again.
Also there would be two headquarters HQ1 and HQ2. So, you will ask the census takers at X1
and X2 to send their results to HQ1 or HQ2. Similarly, you would instruct census takers for
Cities Y and Z and tell them that they should either send the results to HQ1 or HQ2. Problem
solved!
So, with twice the force, you would be able to achieve the same in two months.
Now, you have a good enough model. This model is called MapReduce.
MapReduce in Hadoop is a distributed programming model for processing large datasets.

This concept was conceived at Google and Hadoop adopted it. It can be implemented in any
programming language, and Hadoop supports a lot of programming languages to write
MapReduce programs.
You can write a MapReduce program in Scala, Python, C++, or Java. MapReduce is not a
programming language; rather, it is a programming model. So, the MapReduce system in
Hadoop manages data transfer for parallel execution across distributed servers or nodes.
Now let’s look at the phases involved in MapReduce.
Phases in MapReduce
There are mainly three phases in MapReduce: the map phase, the shuffle phase, and the
reduce phase. Let’s understand them in detail with the help of the same scenario.
Map Phase
The phase wherein individuals count the population of their assigned cities is called the map
phase. There are some terms related to this phase.
 Mapper: The individual (census taker) involved in the actual calculation is called a

mapper.
 Input split: The part of the city each census taker is assigned with is known as the
input split.
 Key–Value pair: The output from each mapper is a key–value pair. As you can see in
the image, the key is X and the value is, say, 5.
Reduce Phase
79
By now, the large dataset has been broken down into various input splits and the instances
of the tasks have been processed. This is when the reduce phase comes into place. Similar to
the map phase, the reduce phase processes each key separately.
 Reducers: The individuals who work in the headquarters are known as the reducers.
This is because they reduce or consolidate the outputs from many different mappers.
After the reducer has finally finished the task, a results file is generated, which is stored in
HDFS. Then, HDFS replicates these results.
Shuffle Phase
The phase in which the values from different mappers are copied or transferred to reducers
is known as the shuffle phase.
The shuffle phase comes in-between the map phase and the reduce phase.
Now, let’s see the architectural view of how map and reduce work together:
When the input data is given to a mapper, it is processed through some user-defined
functions that are written in the mapper. The output is generated by the mapper, i.e., the
intermediate data. This intermediate data is the input for a reducer.
The output from the reducer is then processed by some user-defined functions that are
written in the reducer, and then the final output is produced. Further, this output is saved in
HDFS and the replication is done. So far, we learned how Hadoop MapReduce works.
Moving ahead, let’s discuss some features of MapReduce.
Features of the MapReduce System
The important features of MapReduce are illustrated as follows:

 Abstracts developers from the complexity of distributed programming languages
 In-built redundancy and fault tolerance is available
 The MapReduce programming model is language independent
 Automatic parallelization and distribution are available
 Enables the local processing of data
 Manages all the inter-process communication
 Parallelly manages distributed servers running across various tasks
 Manages all communications and data transfers between various parts of the system
module
 Redundancy and failures are provided for the overall management of the whole
process.
YARN in Hadoop
So, what is YARN in Hadoop? Apache YARN (Yet Another Resource Negotiator) is a resource
management layer in Hadoop. YARN came into the picture with the introduction of Hadoop
2.x. It allows various data processing engines such as interactive processing, graph
processing, batch processing, and stream processing to run and process data stored
in HDFS (Hadoop Distributed File System).
80
YARN was introduced to make the most out of HDFS, and job scheduling is also handled by
YARN.
Now that YARN has been introduced, the architecture of Hadoop 2.x provides a data
processing platform that is not only limited to MapReduce. It lets Hadoop process other-
purpose-built data processing systems as well, i.e., other frameworks can run on the same
hardware on which Hadoop is installed.
Why is YARN in Hadoop used?
In spite of being thoroughly proficient at data processing and computations, Hadoop 1.x had
some shortcomings like delays in batch processing, scalability issues, etc. as it relied on
MapReduce for processing big datasets. With YARN, Hadoop is now able to support a variety
of processing approaches and has a larger array of applications. Hadoop YARN clusters are
now able to run stream data processing and interactive querying side by side with
MapReduce batch jobs. YARN framework runs even the non-MapReduce applications, thus
overcoming the shortcomings of Hadoop 1.x.
Next, let’s discuss the Hadoop YARN architecture.
Hadoop YARN Architecture
Now, we will discuss the architecture of YARN. Apache YARN framework contains a Resource
Manager (master daemon), Node Manager (slave daemon), and an Application Master. Let’s
now discuss each component of Apache Hadoop YARN one by one in detail.
Resource Manager
Resource Manager is the master daemon of YARN. It is responsible for managing several
other applications, along with the global assignments of resources such as CPU and memory.
It is basically used for job scheduling. Resource Manager has two components:
 Scheduler: Schedulers’ task is to distribute resources to the running applications. It
only deals with the scheduling of tasks and hence it performs no tracking and no
monitoring of applications.
 Application Manager: Application Manager manages applications running in the
cluster. Tasks, such as the starting of Application Master or monitoring, are done by the
Application Manager.
Node Manager
Node Manager is the slave daemon of YARN. It has the following responsibilities:
 Node Manager has to monitor the container’s resource usage, along with reporting it
to the Resource Manager.
 The health of the node on which YARN is running is tracked by the Node Manager.
 It takes care of each node in the cluster while managing the workflow, along with
user jobs on a particular node.
 It keeps the data in the Resource Manager updated
81
 Node Manager can also destroy or kill the container if it gets an order from the
Resource Manager to do so.
The third component of Apache Hadoop YARN is the Application Master.
Application Master
Every job submitted to the framework is an application, and every application has a specific
Application Master associated with it. Application Master performs the following tasks:
 It coordinates the execution of the application in the cluster, along with managing
the faults.
 It negotiates resources from the Resource Manager.
 It works with the Node Manager for executing and monitoring other components’
tasks.
 At regular intervals, heartbeats are sent to the Resource Manager for checking its
health, along with updating records according to its resource demands.
Container
A container is a set of physical resources (CPU cores, RAM, disks, etc.) on a single node. The
tasks of a container are listed below:
 It grants the right to an application to use a specific amount of resources (memory,
CPU, etc.) on a specific host.
 YARN containers are particularly managed by a Container Launch context which is
Container Life Cycle (CLC). This record contains a map of environment variables,
dependencies stored in remotely accessible storage, security tokens, payload for Node
Manager services, and the command necessary to create the process.
How does Apache Hadoop YARN work?
YARN separates HDFS and MapReduce, making the Hadoop environment more suitable for
applications that can’t wait for the batch processing jobs to get finished. So, no more batch
processing delays with YARN! This architecture lets you process data with multiple
processing engines using real-time streaming, interactive SQL, batch processing, handling of
data stored in a single platform, and working with analytics in a completely different manner.
It can be considered as the basis of the next generation of Hadoop ecosystem, ensuring that
the forward-thinking organizations are realizing the modern data architecture.
How is an application submitted in YARN?
1. Submit the job

2. Get an application ID
3. Retrieval of the context of application submission
 Start Container Launch
 Launch Application Master
4. Allocate Resources.
 Container
82
 Launching
5. Executing
Workflow of an Application in YARN
1. Submission of the application by Client

2. Container allocation for starting Application Manager
3. Registering the Application Manager with Resource Manager
4. Application Manager asks for containers from Resource Manager
5. Application Manager notifies Node Manager to launch containers
6. Application code gets executed in the container
7. Client contacts Resource Manager/Application Manager to monitor the status of the
application
8. Application Manager gets disconnected with Resource Manager
Features of YARN
 High-degree compatibility: Applications created use the MapReduce framework

that can be run easily on YARN.
 Better cluster utilization: YARN allocates all cluster resources in an efficient and
dynamic manner, which leads to better utilization of Hadoop as compared to the
previous version of it.
 Utmost scalability: Whenever there is an increase in the number of nodes in the
Hadoop cluster, the YARN Resource Manager assures that it meets the user
requirements.
 Multi-tenancy: Various engines that access data on the Hadoop cluster can
efficiently work together all because of YARN as it is a highly versatile technology.
YARN vs MapReduce
In Hadoop 1.x, the batch processing framework MapReduce was closely paired with HDFS.
With the addition of YARN to these two components, giving birth to Hadoop 2.x, came a lot
of differences in the ways in which Hadoop worked. Let’s go through these differences.
Criteria YARN MapReduce

Type of processing Real-time, batch, and Silo and batch processing with
interactive processing with a single engine
multiple engines
Cluster resource optimization Excellent due to central Average due to fixed Map and
resource management Reduce slots
Suitable for MapReduce and non- Only MapReduce applications
MapReduce applications
Managing cluster resource Done by YARN Done by JobTracker
Namespace Hadoop supports multiple Supports only one namespace,
namespaces i.e., HDFS
83
Pig in Hadoop
Pig Hadoop is basically a high-level programming language that is helpful for the analysis of
huge datasets. Pig Hadoop was developed by Yahoo! and is generally used with Hadoop to
perform a lot of data administration operations.
For writing data analysis programs, Pig renders a high-level programming language called
Pig Latin. Several operators are provided by Pig Latin using which personalized functions for
writing, reading, and processing of data can be developed by programmers.
For analyzing data through Apache Pig, we need to write scripts using Pig Latin. Then, these
scripts need to be transformed into MapReduce tasks. This is achieved with the help of Pig
Engine.
Why Apache Pig?
By now, we know that Apache Pig is used with Hadoop, and Hadoop is based on the Java
programming language. Now, the question that arises in our minds is ‘Why Pig?’ The need
for Apache Pig came up when many programmers weren’t comfortable with Java and were
facing a lot of struggle working with Hadoop, especially, when MapReduce tasks had to be
performed. Apache Pig came into the Hadoop world as a boon for all such programmers.
 After the introduction of Pig Latin, now, programmers are able to work on
MapReduce tasks without the use of complicated codes as in Java.
 To reduce the length of codes, the multi-query approach is used by Apache Pig,
which results in reduced development time by 16 folds.
 Since Pig Latin is very similar to SQL, it is comparatively easy to learn Apache Pig if we
have little knowledge of SQL.
Features of Pig Hadoop
There are several features of Apache Pig:

1. In-built operators: Apache Pig provides a very good set of operators for performing
several data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to
write a Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically optimized. This
makes the programmers concentrate only on the semantics of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured
data and store the results in HDFS.
Apache Pig Architecture
The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy.
84
Pig Hadoop framework has four main components:
1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser.
The parser is responsible for checking the syntax of the script, along with other
miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph
(DAG) that contains Pig Latin statements, together with other logical operators
represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer is responsible for carrying out the logical
optimizations.
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimize The logical plan is
then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs
are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop
for yielding the desired result.
Downloading and Installing Pig Hadoop
Follow the below steps for the Apache Pig installation. These steps are for
Linux/CentOS/Windows (using VM/Ubuntu/Cloudera). In this tutorial section on ‘Pig
Hadoop’, we are using CentOS.
Step 1: Download the Pig.tar file by writing the following command on your Terminal:
wget http://www-us.apache.org/dist/pig/pig-0.16.0/pig-0.16.0.tar.gz
85
Step 2: Extract the tar file (you downloaded in the previous step) using the following
command:
tar -xzf pig-0.16.0.tar.gz
Your tar file gets extracted automatically from this command. To check whether your file is
extracted, write the command ls for displaying the contents of the file. If you see the below
output, the Pig file has been successfully extracted.
Step 3: Edit the .bashrc file for updating the Apache Pig environment variables. It is required
to set this up in order to access Pig from a directory instead of going to the Pig directory for
executing Pig commands. Other applications can also access Pig using this path of Apache
Pig from this file
A new window will open up wherein you need to add a few commands.
86
When the above window pops up, write down the following commands at the end of the file:
# Set PIG_HOME
export PIG_HOME=/home/training/pig-0.16.0
export PATH=$PATH:/home/training/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
87
You then need to save this file in:
File > Save
You have to close this window and then, on your terminal, enter the following command for
getting the changes updated:
source .bashrc
Step 4: Check the Pig Version. To check whether Pig is successfully installed, you can run the
following command:
pig -version
88
Step 5: Start the Grunt Shell (used to run Pig Latin scripts) by running the command: Pig
By default, Pig Hadoop chooses to run MapReduce jobs in which access is required to the

Hadoop cluster and the HDFS installation. But there is another mode, i.e., a local mode in
which all the files are installed and run using a localhost, along with the file system. You can
run the localhost mode using the command: pig -x local
I hope you were able to successfully install Apache Pig. In this section on Apache Pig, we
learned ‘What is Apache Pig?’, why we need Pig, its features, architecture, and finally the
installation of Apache Pig.
Apache Pig:
It is a high-level platform for creating programs that runs on Hadoop, the language is known
as Pig Latin. Pig can execute its Hadoop jobs in MapReduce
Data types:
A particular kind of data defined by the values it can take
89
 Simple data types:
o Int – It is a signed 32 bit integer
o Long- It is a signed 64 bit integer
o Float- 32 bit floating point
o Double- 64 bit floating point
o Chararray- Character array in UTF 8 format
o Bytearray- byte array (blob)
o Boolean: True or False
 Complex data types:
o Tuple: It is an ordered set of fields
o Bag: It is a collection of tuples
o Map: A set of key value pairs
Apache Pig Components:
 Parser: Parser is used to check the syntax of the scripts.

 Optimizer: It is used for the logical optimizations such as projection and push down
 Compiler: Compiler is used to compile the optimized logical plan into a series of
MapReduce jobs
 Execution engine: The MapReduced jobs are executed on Hadoop, and the desired
results are obtained
Pig execution modes:
 Grunt mode: This is a very interactive and useful mode in testing syntax checking and
ad hoc data exploration
 Script mode: It is used to run set of instructions from a file
 Embedded mode: It is useful to execute pig programs from a java program
 Local mode: In this mode the entire pig job runs as a single JVM process
 MapReduce Mode: In this mode, pig runs the jobs as a series of map reduce jobs
 Tez: In this mode, pig jobs run as a series of tez jobs
90
Apache Pig Architecture
Pig commands equivalent to the SQL functions:
Functions Pig commands

SELECT FOREACH alias GENERATE column_name,column_name;
SELECT* FOREACH alias GENERATE *;
DISTINCT DISTINCT(FOREACH aliasgenerate column_name, column_name);
WHERE FOREACH (FILTER alias BY column_nameoperator
value)GENERATE column_name, column_name;
AND/OR FILTER alias BY (column_name operator value1AND column_name
operator value2)OR column_name operator value3;
ORDER BY ORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
TOP/LIMIT FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias
number;TOP(number, column_index, alias);
GROUP BY FOREACH (GROUP alias BY column_name)GENERATE
function(alias.column_name);
LIKE FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT
NULL;
IN FILTER alias BY column_name IN(value1, value2,…);
JOIN FOREACH (JOIN alias1 BY column_name,alias2 BY
column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL FOREACH(JOINalias1 BY column_name LEFT|RIGHT|FULL,alias2 BY
91
OUTERJOIN column_name) GENERATE column_name(s);
UNION ALL UNION alias1, alias2;
AVG FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT FOREACH (GROUP alias ALL) GENERATE COUNT(alias);
COUNT DISTINCT FOREACH alias{Unique _column=DISTINT Column_name);};
MAX FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MIN FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUM FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVING FILTER alias BYAggregate_function(column_name)operatorValue;
UCASE/UPPER FOREACH aliasGENERATEUPPER(column_name);
LCASE/LOWER FOREACH aliasGENERATELOWER(column_name);
SUBSTRING FOREACH
aliasGENERATESUBSTRING(column_name,start,Star+length) as
Some_name;
LEN FOREACH aliasGENERATE SIZE(column_name)
ROUND FOREACH aliasGENEARATE ROUND(column_name);
Pig Operators:
Type Command Description

Loading and storing LOAD It is used to load data into a relation
DUMP Dumps the data into the console
STORE Stores data in a given location
Grouping data and GROUP Groups based on the key will group
joining COGROUP the data from multiple relations
CROSS JOIN Cross join is used to join two or
more relations
Storing LIMIT It is used for limiting the results
ORDER It is used for sorting by categories or
fields
Data sets UNION It is used for combining multiple
SPLIT relations
It is used for splitting the relations
Basic Operators:
Operators Description
Arithmetic operators +, -, *, /, %, ?, :
Boolean operators And, or, not
Casting operators Casting from one datatype to another
Comparison Operators ==, !=, >, <, >=, <=, matches
Construction operators Used to construct tuple(), bag{}, map[]
92
Dereference operators Used to dereferencing as tuples(tuple.id or tuple.(id,…)),
bags(bag.id or bag.(id,…))and
maps(map# ‘key’)
Disambiguate operators (::)
It used to identify field names after JOIN,COGROUP,CROSS, or
FLATTEN Operators
Flatten operator It is used to flatten un-nests tuples as well as bags
Null operator Is null, is not null
Sign operators +-> has no effect,
–>It changes the sign of a positive/negative number
Relational Operators:
Operators Description
COGROUP/ GROUP It is used to group the data in one or more relations
COGROUP operator groups together the tuples that has the
same group key
CROSS This operator is used to compute the cross product of two or
more relations
DEFINE This operator assigns an alias to an UDF or a streaming
command
DISTINCT This operator will remove the duplicate tuples from a relation
FILTER It is used to generate the transformation for each statement as
specified
FOREACH It selects the tuples for a relation based on a the specified
condition
IMPORT This operator imports macros defined in a separate file
JOIN This operator performs inner join of two or more relations based
on common field values
LOAD This operator loads the data from a file system
MAPREDUCE This operator executes the native MapReduce jobs in a Pig script
ORDER BY This will sort the relation based on two or more fields
SAMPLE Divides the relation into two or more relations, and selects a
random data sample based on a specified size
SPLIT This will partition the relation based on some conditions or
expressions as specified
STORE This will store or save the result in a file system
STREAM This operator sends the data to an external script or program
UNION This operator is used to compute the unions of two or more
relations
Diagnostic Operators:
93
Operator Description
Describe Returns the schema of the relation
Dump It will dump or display the result on screen
Explain Displays execution plans
Illustrate It displays the step by step execution for the sequence of
statements
Differentiation between Operational vs. Analytical Systems
Operational Analytical
Latency 1 ms to 100 ms 1 min to 100 min

Concurrency 1000 to100,000 1 to 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL Database MapReduce, MPP Database

Traditional Enterprise Approach
This approach of enterprise will use a computer to store and process big data. For storage
purpose is available of their choice of database vendors such as Oracle, IBM, etc. The user
interacts with the application, which executes data storage and analysis.
Limitation
This approach are good for those applications which require low storage, processing and
database capabilities, but when it comes to dealing with large amounts of scalable data, it
imposes a bottleneck.
Solution
Google solved this problem using an algorithm based on MapReduce. This algorithm divides
the task into small parts or units and assigns them to multiple computers, and intermediate
results together integrated results in the desired results. Intellipaat’s Big Data Hadoop
training will really help you get a better understanding the concepts of Big Data Solutions in
Open Data Platform!
94
Pig built-in functions:
Type Examples
EVAL functions AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE
etc
LOAD or STORE functions Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage
etc
Math functions ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
String functions TRIM, RTRIM, SUBSTRING, LOWER, UPPER etc
DateTime function GetDay, GetHour, GetYear, ToUnixTime, ToString etc
Eval functions:
 AVG(col): computes the average of the numerical values in a single column of a bag
 CONCAT(string expression1, string expression2) : Concatenates two expressions of
identical type
 COUNT(DataBag bag): Computes the number of elements in a bag excluding null
values
 COUNT STAR (DataBag bag1, DataBag bag 2): Computes the number of elements in a
bag including null values.
 DIFF(DataBag bag1, DataBag bag2): It is used to compare two bags, if any element in
one bag is not present in the other bag are returned in a bag
 IsEmpty(DataBag bag), IsEmpty(Map map): It is used to check if the bag or map is
empty
 Max(col): Computes the maximum of the numeric values or character in a single
column bag
 MIN(col): Computes the minimum of the numeric values or character in a single
column bag
 DEFINE pluck pluckTuple(expression1): It allows the user to specify a string prefix, and
filters the columns which begins with that prefix
 SIZE(expression): Computes the number of elements based on any pig data
 SUBSTRACT(DataBag bag1, DataBag bag2): It returns the bag which does not contain
bag1 element in bag2
 SUM: Computes the sum of the values in a single-column bag
 TOKENIZE(String expression[,‘field delimiter’): It splits the string and outputs a bag of
words
Load or Store Functions:
 PigStorage ():
Syntax: PigStorage(field_delimiter)
A = LOAD ‘Employee’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);
Loads and stores data as structured text file
 TextLoader():
95
Syntax: A = LOAD ‘data’ USING TextLoader();
Loads unstructured data in UTF 8 format
 BinStorage():
Syntax: A = LOAD ‘data’ USING BinStorage();

Loads and stores data in machine readable format
 Handling compression:
It loads and stores compressed data in Pig

 JsonLoader, JsonStorage:
Syntax: A = load ‘a.json’ using JsonLoader();

It loads and stores JSON data
 Pig dump:
Syntax: STORE X INTO ‘output’ USING PigDump ();

Stores data in UTF 8 format
Math functions:
 ABS:
Syntax: ABS(expression)
It returns the absolute value of an expression
 COS:
Syntax: COS(expression)
It Returns the trigonometric cosine of an expression.
 SIN:
Syntax: SIN (expression)

It returns the sine of an expression.
 CEIL:
Syntax: CEIL(expression)
It is used to return the value of an expression rounded up to the nearest integer
 TAN:
Syntax: TAN(expression)
It is used to return the trigonometric tangent of an angle.
 ROUND:
Syntax: ROUND(expression)
It returns the value of an expression rounded to an integer (if the result type is float) or long
(if the result type is double)
 RANDOM:
96
Synatx: RANDOM ()
It returns a pseudo random number (type double) greater than or equal to 0.0 and less than
1.0
 Floor:
Syntax: FLOOR(expression)
Returns the value of an expression rounded down to the nearest integer.
 CBRT:
Synatx: CBRT(expression)
It returns the cube root of an expression
 EXP:
Syntax: EXP(expression)
Returns Euler’s number e raised to the power of x.
String Functions:
 INDEXOF:
97
Syntax: INDEXOF (string, ‘character’, startIndex)
It returns an index of the first occurrence of a character in a string
 LAST_INDEX:
Syntax: LAST_INDEX_OF (expression)

It returns an index of the last occurrence of a character in a string
 TRIM:
Syntax: TRIM(expression)
It returns a copy of the string with leading and trailing whitespaces removed
 SUBSTRING:
Syntax: SUBSTRING (string, startIndex, stopIndex)

It will return a substring from a given string
 UCFIRST:
Syntax: UCFIRST(expression)
It will return a string with the first character changed to the upper case
 LOWER:
Syntax: LOWER(expression)
Converts all characters in a string to lowercase
 UPPER:
Synatx: UPPER(expression)
Converts all characters in a string to the uppercase
Tuple, Bag and Map functions:
Function Syntax Description

TOTUPLE TOTUPLE(expression [, expression It is used to convert one or more
…]) expressions to the type Tuple
TOBAG TOBAG(expression [, expression …]) It is used to convert one or more
expression to the individual tuple, which
is then placed in a bag
TOMAP TOMAP(key-expression, value- It is used to convert key/value
expression [, key-expression, value- expression pairs to a Map
expression …])
TOP TOP(topN,column,relation) Returns a top-n tuples from a bag of
tuples
In the next section of this tutorial, we will learn about Apache Hive.
Hadoop Hive
98
Apache Hive is an open-source data warehouse system that has been built on top
of Hadoop. You can use Hive for analyzing and querying large datasets that are stored in
Hadoop files. Processing structured and semi-structured data can be done by using Hive.
What is Hive in Hadoop?
Don’t you think writing MapReduce jobs is tedious work? Well, with Hadoop Hive, you can
just go ahead and submit SQL queries and perform MapReduce jobs. So, if you are
comfortable with SQL, then Hive is the right tool for you as you will be able to work on
MapReduce tasks efficiently. Similar to Pig, Hive has its own language, called HiveQL (HQL). It
is similar to SQL. HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin
does. The best part is that you don’t need to learn Java to work with Hadoop Hive.
Hadoop Hive runs on our system and converts SQL queries into a set of jobs for execution
on a Hadoop cluster. Basically, Hadoop Hive classifies data into tables providing a method
for attaching the structure to data stores in HDFS.
Facebook uses Hive to address its various requirements, like running thousands of tasks on
the cluster, along with thousands of users for a huge variety of applications. Since Facebook
has a huge amount of raw data, i.e., 2 PB, Hadoop Hive is used for storing this voluminous
data. It regularly loads around 15 TB of data on a daily basis. Now, many companies, such as
IBM, Amazon, Yahoo!, and many others, are also using and developing Hive.
Why do we need Hadoop Hive?
Let’s now talk about the need for Hive. To understand that, let’s see what Facebook did with
its big data.
Basically, there were a lot of challenges faced by Facebook before they had finally
implemented Apache Hive. One of those challenges was the size of data that has been
generated on a daily basis. Traditional databases, such as RDBMS and SQL, weren’t able to
handle the pressure of such a huge amount of data. Because of this, Facebook was looking
for better options. It started using MapReduce in the beginning to overcome this problem.
But, it was very difficult to work with MapReduce as it needed mandatory programming
expertise in SQL. Later on, Facebook realized that Hadoop Hive had the potential to actually
overcome the challenges it faced.
Apache Hive helps developers get away with writing complex MapReduce tasks. Hadoop
Hive is extremely fast, scalable, and extensible. Since Apache Hive is comparable to SQL, it is
easy for the SQL developers as well to implement Hive queries.
Additionally, the Hive is capable of decreasing the complexity of MapReduce by providing an
interface wherein a user can submit various SQL queries. So, technically, you don’t need to
learn Java for working with Apache Hive.
Hive Architecture
Let’s now talk about the Hadoop Hive architecture and the major working force behind
Apache Hive. The components of Apache Hive are as follows:
99
o Driver: The driver acts as a controller receiving HiveQL statements. It begins
the execution of statements by creating sessions. It is responsible for monitoring
the life cycle and the progress of the execution. Along with that, it also saves the
important metadata that has been generated during the execution of the HiveQL
statement.
o Metastore: A metastore stores metadata of all tables. Since Hive includes
partition metadata, it helps the driver in tracking the progress of various datasets
that have been distributed across a cluster, hence keeping track of data. In a
metastore, the data is saved in an RDBMS format.
o Compiler: The compiler performs the compilation of a HiveQL query. It
transforms the query into an execution plan that contains tasks.
o Optimizer: An optimizer performs many transformations on the execution
plan for providing an optimized DAG. An optimizer aggregates several
transformations together like converting a pipeline of joins to a single join. It can
also split the tasks for providing better performance.
o Executor: After the processes of compilation and optimization are completed,
the execution of the task is done by the executor. It is responsible for pipelining the
tasks.
Differences Between Hive and Pig
Hive Pig
Used for data analysis Pig is used for data and programs
Used for processing the structured data It is used for the semi-structured data
Has HiveQL Has Pig Latin
Used for creating reports Used for programming
Works on the server side Works on the client side
Does not support Avro Supports Avro
Features of Apache Hive
Let’s now look at the features of Apache Hive:

 Hive provides easy data summarization and analysis and query support.
 Hive supports external tables, making it feasible to process data without having to
store it into HDFS.
 Since Hadoop has a low-level interface, Hive fits in here properly.
 Hive supports the partitioning of data at the data level for better performance.
 There is a rule-based optimizer present in Hive responsible for optimizing logical
plans.
 Hadoop can process external data using Hive.
Limitations of Apache Hive
Though Hive is a progressive tool, it has some limitations as well.

 Apache Hive doesn’t offer any real-time queries.
100
 Online transaction processing is not well-supported by Apache Hive.
 There can be a delay while performing Hive queries.
That is all for this Apache Hive tutorial. In this section about Apache Hive, you learned about
Hive that is present on top of Hadoop and is used for data analysis. It uses a language called
HiveQL that translates SQL-like queries into relevant MapReduce jobs. In the upcoming
section of this Hadoop tutorial, you will be learning about Hadoop clusters.
Overview to Hive:
All the industries deal with the Big data that is large amount of data and Hive is a tool that is
used for analysis of this Big Data. Apache Hive is a tool where the data is stored for analysis
and querying.
Apache Hive: It is a data warehouse infrastructure based on Hadoop framework which is

perfectly suitable for data summarization, analysis and querying. It uses an SQL like language
called HQL (Hive query Language)
HQL: It is a query language used to write the custom map reduce framework in Hive to
perform more sophisticated analysis of the data
Table: Table in hive is a table which contains logically stored data
Components of Hive:
 Meta store: Meta store is where the schemas of the Hive tables are stored, it stores
the information about the tables and partitions that are in the warehouse.
 SerDe: Serializer, Deserializer which gives instructions to hive on how to process
records
Hive interfaces:
 Hive interfaces includes WEB UI

 Hive command line
 HD insight (windows server)
Hive Function Meta commands:
 Show functions: Lists Hive functions and operators

 Describe function [function name]: Displays short description of the particular
function
 Describe function extended [function name]: Displays extended description of the
particular function
Types of Hive Functions:
 UDF (User defined Functions): It is a function that fetches one or more columns
from a row as arguments and returns a single value
 UDTF (User defined Tabular Functions): This function is used to produce multiple
columns or rows of output by taking zero or more inputs
101
 Macros: It is a function that uses other Hive functions
User defined aggregate functions: A user defined function that takes multiple rows or
columns and returns the aggregation of the data
User defined table generating functions: A function which takes a column from single
record and splitting it into multiple rows
Indexes: Indexes are created to the speedy access to columns in the database
 Syntax: Create index <INDEX_NAME> on table <TABLE_NAME>
Thrift: A thrift service is used to provide remote access from other processors
Meta store: This is a service which stores the metadata information such as table schemas
Hcatalog: It is a metadata and table management system for Hadoop platform which
enables storage of data in any format.
Hive SELECT statement syntax using HQL:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
 Select: Select is a projection operator in HiveQL, which scans the table specified by
the FROM clause
 Where: Where is a condition which specifies what to filter
 Group by: It uses the list of columns, which specifies how to aggregate the records
 Cluster by, distribute by, Sort by: Specifies the algorithm to sort, distribute and create
cluster, and the order for sorting
 Limit: This specifies how many records to be retrieved
Partitioner: Partitioner controls the partitioning of keys of the intermediate map outputs,

typically by a hash function which is same as the number of reduce tasks for a job
Partitioning: It is used for distributing load horizontally. It is a way of dividing the tables into
related parts based on values such as date, city, departments etc.
Bucketing: It is a technique to decompose the datasets into more manageable parts
Hive commands in HQL:
Data Definition Language (DDL): It is used to build or modify tables and objects stored in a
database
Some of the DDL commands are as follows:
 To create database in Hive: create database<data base name>
 To list out the databases created in a Hive warehouse: show databases
 To use the database created: USE <data base name>
 To describe the associated database in metadata: describe<data base name>
102
 To alter the database created: alter<data base name>
Data Manipulation Language (DML): These statements are used to retrieve, store, modify,
delete, insert and update data in a database
 Inserting data in a database: The Load function is used to move the data into a
particular Hive table.
LOAD data <LOCAL> inpath <file path> into table [tablename]

 Drop table: The drop table statements deletes the data and metadata from the table:
drop table<table name>

 Aggregation: It is used to count different categories from the table :
Select count (DISTINCT category) from tablename;

 Grouping: Group command is used to group the result set, where the result of one
table is stored in the other:
Select <category>, sum( amount) from <txt records> group by <category>

 To exit from the Hive shell: Use the command quit
Hive data types:
 Integral data types:

o Tinyint
o Smallint
o Int
o Bigint
 String types:
o VARCHAR-Length(1 to 65355)
103
o CHAR-Length(255)
 Timestamp: It supports the traditional Unix timestamp with optional nanosecond
precision
o Dates
o Decimals
 Union type: It is a collection of heterogenous data types.
o Syntax: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
 Complex types:
o Arrays: Syntax-ARRAY<data_type>
o Maps: Syntax- MAP<primitive_type, data_type>
o Structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Operations that can be performed on Hive:
Function SQL Query

To retrieve information SELECT from_columns FROM table WHERE conditions;
To select all values SELECT * FROM table;
To select a particular SELECT * FROM table WHERE rec_name = “value”;
category values
To select for multiple SELECT * FROM TABLE WHERE rec1 = “value1” AND rec2 =
criteria “value2”;
For selecting specific SELECT column_name FROM table;
columns
To retrieve unique output SELECT DISTINCT column_name FROM table;
records
For sorting SELECT col1, col2 FROM table ORDER BY col2;
For sorting backwards SELECT col1, col2 FROM table ORDER BY col2 DESC;
For counting rows from SELECT COUNT(*) FROM table;
the table
For grouping along with SELECT owner, COUNT(*) FROM table GROUP BY owner;
counting
For selecting maximum SELECT owner, COUNT(*) FROM table GROUP BY owner;
values
Selecting from multiple SELECT pet.name, comment FROM pet JOIN event ON (pet.name
tables and joining = event.name);
Metadata functions and query used for operations:
Function Hive query or commands

Selecting a database USE database;
Listing databases SHOW DATABASES;
listing table in a database SHOW TABLES;
104
Describing format of a DESCRIBE (FORMATTED|EXTENDED) table;
table
Creating a database CREATE DATABASE db_name;
Dropping a database DROP DATABASE db_name (CASCADE);
Command Line statements:
Function Hive commands

To run the query hive -e ‘select a.col from tab1 a’
To run a query in a silent hive -S -e ‘select a.col from tab1 a’
mode
To select hive hive -e ‘select a.col from tab1 a’ -hiveconf
configuration variables hive.root.logger=DEBUG,console
To use the initialization hive -i initialize.sql
script
To run the non-interactive hive -f script.sql
script
To run script inside the source file_name
shell
To run the list command dfs –ls /user
To run ls (bash command) !ls
from the shell
To set configuration set mapred.reduce.tasks=32
variables
Tab auto completion set hive.<TAB>
To display all variables set
starting with hive
To revert all variables reset
To add jar files to add jar jar_path
distributed cache
To display all the jars in list jars
the distributed cache
To delete jars from the delete jar jar_name
distributed cache
Hadoop Streaming
Hadoop Streaming uses UNIX standard streams as the interface between Hadoop and your
program so you can write Mapreduce program in any language which can write to standard
output and read standard input. Hadoop offers a lot of methods to help non-Java
development.
105
The primary mechanisms are Hadoop Pipes which gives a native C++ interface to Hadoop
and Hadoop Streaming which permits any program that uses standard input and output to
be used for map tasks and reduce tasks.
With this utility one can create and run Map/Reduce jobs with any executable or script as the
mapper and/or the reducer.
Hadoop Streaming Example using Python
Hadoop Streaming supports any programming language that can read from standard input
and write to standard output. For Hadoop streaming, one must consider the word-count
problem. Codes are written for the mapper and the reducer in python script to be run under
Hadoop.
Mapper Code
!/usr/bin/python
import sys
for intellipaatline in sys.stdin: # Input takes from standard input
intellipaatline = intellipaatline.strip() # Remove whitespace either side
words = intellipaatline.split() # Break the line into words
for myword in words: # Iterate the words list
output print '%s\t%s' % (myword, 1) # Write the results to standard
Reducer Code
#!/usr/bin/python from operator

import item getter
import sys current_word = "" current_count = 0 word = "" for intellipaatline in sys.stdin:
# Input takes from standard input intellipaatline = intellipaatline.strip()
# Remove whitespace either side word , count = intellipaatline.split('\t', 1)
# Split the input we got from mapper.py try:
# Convert count variable to integer count = int(count) except ValueError:
# Count was not a number, so silently ignore this line continue if current_word == word:
current_count += count else:
if current_word: print '%s\t%s' % (current_word, current_count)
# Write result to standard o/p current_count = count current_word = word if current_word
== word:
# Do not forget to output the last word if needed! print '%s\t%s' % (current_word,
current_count)
Mapper and Reducer codes should be saved in mapper.py and reducer.py in Hadoop home
directory.
WordCount Execution
106
$ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1. 2.1.jar \
-input input_dirs \ -
output output_dir \ -
mapper<path/mapper.py \
-reducer <path/reducer.py
Where “\” is used for line continuation for clear readability
How Hadoop Streaming Works?
 Input is read from standard input and the output is emitted to standard output by
Mapper and the Reducer. Utility creates a Map/Reduce job, submits the job to an
appropriate cluster, and monitors the progress of the job until completion.
 Every mapper task will launch the script as a separate process when the mapper is
initialized after a script is specified for mappers. Mapper task inputs are converted into
lines and fed to the standard input and Line oriented outputs are collected from the
standard output of the procedure Mapper and every line is changed into a key, value
pair which is collected as the outcome of the mapper.
 Each reducer task will launch the script as a separate process and then the reducer is
initialized after a script is specified for reducers. As the reducer task runs, reducer task
input key/values pairs are converted into lines and feds to the standard input (STDIN) of
the process.
 Each line of the line-oriented outputs is converted into a key/value pair after it is
collected from the standard output (STDOUT) of the process, which is then collected as
the output of the reducer.
Important Commands
Parameters Description
-input directory/file-name Input location for mapper. (Required)
-output directory-name Output location for reducer. (Required)
-mapper executable or Mapper executable. (Required)
script or JavaClassName
-reducer executable or Reducer executable. (Required)
script or JavaClassName
-file file-name Create the mapper, reducer or combiner executable available
locally on the compute nodes.
-inputformat Class you offer should return key, value pairs of Text class. If not
JavaClassName specified TextInputFormat is used as the default.
-outputformat Class you offer should take key, value pairs of Text class. If not
JavaClassName specified TextOutputformat is used as the default.
-partitioner JavaClassName Class that determines which reduce a key is sent to.
-combiner streaming Combiner executable for map output.
Command or
JavaClassName
107
-inputreader For backwards compatibility: specifies a record reader class
instead of an input format class.
-verbose Verbose output.
-lazyOutput Creates output lazily. For example if the output format is based
on FileOutputFormat, the output file is created only on the first
call to output.collect or Context.write.
-numReduceTasks Specifies the number of reducers.
-mapdebug Script to call when map task fails.
-reducedebug Script to call when reduction makes the task failure
-cmdenv name=value Passes the environment variable to streaming commands.

Hadoop Pipes
It is the name of the C++ interface to Hadoop MapReduce. Unlike Hadoop Streaming which
uses standard I/O to communicate with the map and reduce code Pipes uses sockets as the
channel over which the tasktracker communicates with the process running the C++ map or
reduce function. JNI is not used.
That’s all for this section of the hadoop tutorial. Let’s move on to the next one on Pig!
Setting Up A Multi Node Cluster In Hadoop
Installing Java
Syntax of java version command
$ java -version
Following output is presented.
java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
Creating User Account

System user account on both master and slave systems should be created to use the Hadoop
installation.
# useradd hadoop
# passwd hadoop
Mapping the nodes

hosts file should be edited in /etc/ folder on all nodes and IP address of each system
followed by their host names must be specified.
# vi /etc/hosts
Enter the following lines in the /etc/hosts file.
108
192.168.1.109 hadoop-master
192.168.1.145 hadoop-slave-1
192.168.56.1 hadoop-slave-2
Configuring Key Based Login

Ssh should be setup in each node such that they can converse with one another without any
prompt for password.
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
Installing Hadoop
Hadoop should be downloaded in the master server.
# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
Configuring Hadoop
Hadoop server must be configured
core-site.xml should be edited.
<configuration>
<property>
<name>fs.default.name</name><value>hdfs://hadoop-master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
109

hdfs-site.xml file should be editted.
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/name/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml file should be editted.
<configuration>
<property>
<name>mapred.job.tracker</name><value>hadoop-master:9001</value>
</property>
</configuration>
JAVA_HOME, HADOOP_CONF_DIR, and HADOOP_OPTS should be edited.
export JAVA_HOME=/opt/jdk1.7.0_17
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/conf

Installing Hadoop on Slave Servers
Hadoop should be installed on all the slave servers
# su hadoop
$ cd /opt/hadoop
110

Configuring Hadoop on Master Server
Master server should be configured
# su hadoop

Master Node Configuration
$ vi etc/hadoop/masters
hadoop-master

Slave Node Configuration
$ vi etc/hadoop/slaves
hadoop-slave-1
hadoop-slave-2

Name Node format on Hadoop Master
# su hadoop
$ bin/hadoop namenode –format

11/10/14 10:58:07 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-master/192.168.1.109
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.0
STARTUP_MSG: build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1479473;
compiled by 'hortonfo' on Mon May 6 06:59:37 UTC 2013
STARTUP_MSG: java = 1.7.0_71
************************************************************/
11/10/14 10:58:08 INFO util.GSet: Computing capacity for map BlocksMap
editlog=/opt/hadoop/hadoop/dfs/name/current/edits
………………………………………………….
………………………………………………….
………………………………………………….
111
11/10/14 10:58:08 INFO common.Storage: Storage directory /opt/hadoop/hadoop/dfs/name
has been successfully formatted.
11/10/14 10:58:08 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.1.15
************************************************************/
Hadoop Services
Starting Hadoop services on the Hadoop-Master.
$ cd $HADOOP_HOME/sbin
$ start-all.sh
Addition of a New DataNode in the Hadoop Cluster

Networking
Add new nodes to an existing Hadoop cluster with some suitable network configuration.
suppose the following network configuration.
For New node Configuration:
IP address : 192.168.1.103
netmask : 255.255.255.0
hostname : slave3.in
Adding a User and SSH Access

Add a User
“hadoop” user must be added and password of Hadoop user can be set to anything one
wants.
useradd hadoop
passwd hadoop

To be executed on master
mkdir -p $HOME/.ssh
ssh-keygen -t rsa -P '' -f $HOME/.ssh/id_rsa
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Copy the public key to new slave node in hadoop user $HOME directory
scp $HOME/.ssh/id_rsa.pub hadoop@192.168.1.103:/home/hadoop/

To be executed on slaves
su hadoop ssh -X hadoop@192.168.1.103
112
Content of public key must be copied into file “$HOME/.ssh/authorized_keys” and then
the permission for the same must be changed.
cd $HOME
mkdir -p $HOME/.ssh
cat id_rsa.pub >>$HOME/.ssh/authorized_keys
ssh login must be changed from the master machine. Possibility of ssh to the new node
without a password from the master must be verified.
ssh hadoop@192.168.1.103 or hadoop@slave3

Set Hostname of New Node
Hostname is set in file /etc/sysconfig/network
On new slave3 machine

NETWORKING=yes
HOSTNAME=slave3.in
Machine must be restarted or hostname command should be run to a new machine with the
respective hostname to make changes effective.
On slave3 node machine:
hostname slave3.in
/etc/hosts must be updated on all machines of the cluster
192.168.1.102 slave3.in slave3
ping the machine with hostnames to check whether it is resolving to IP.
ping master.in

Start the DataNode on New Node
Datanode daemon should be started manually using $HADOOP_HOME/bin/hadoop-

daemon.sh script. Master(NameNode) should join the cluster after being automatically
contacted. New node should be added to the conf/slaves file in the master server. New node
will be recognized by script-based commands.
Login to new node
su hadoop or ssh -X hadoop@192.168.1.103
HDFS is started on a newly added slave node
./bin/hadoop-daemon.sh start datanode
jps command output must be checked on a new node.
113
$ jps
7141 DataNode
10312 Jps
Removing a DataNode
Node can be removed from a cluster as it is running, without any data loss. A
decommissioning feature is made available by HDFS which ensures that removing a node is
performed securely.
Step 1
Login to master machine user where Hadoop is installed.
$ su hadoop

Step 2
Before starting the cluster an exclude file must be configured. A key named dfs.hosts.exclude
should be added to our $HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.
NameNode’s local file system which contains a list of machines which are not permitted to
connect to HDFS receives full path by this key and the value associated with it.
<property>
<name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-
1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description>
</property>

Step 3
Hosts to decommission are determined.
Additions should be made to file recognized by the hdfs_exclude.txt for every machine to be
decommissioned which will prevent them from connecting to the NameNode.
slave2.in

Step 4
Force configuration reload.
“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run
$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes

NameNode will be forced to re-read its configuration, this is inclusive of the newly updated
‘excludes’ file. Nodes will be decommissioned over a period of time, allowing time for each
node’s blocks to be replicated onto machines which are scheduled to remain active.
jps command output should be checked on slave2.in. DataNode process will shutdown
automatically.

114
Step 5
Shutdown nodes.
The decommissioned hardware can be carefully shut down for maintenance after the
decommission process has been finished.
$ $HADOOP_HOME/bin/hadoop dfsadmin -report

Step 6
Excludes are edited again and once the machines have been decommissioned, they can be
removed from the ‘excludes’ file. “$HADOOP_HOME/bin/hadoop dfsadmin
-refreshNodes” will read the excludes file back into the NameNode;DataNodes will rejoin
the cluster after the maintenance has been completed, or if additional capacity is needed in
the cluster again.
To run/shutdown tasktracker
$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker

$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker
Sqoop
Sqoop is an automated set of volume data transfer tool which allows to simple import,
export of data from structured based data which stores NoSql systems, relational databases
and enterprise data warehouses to Hadoop ecosystems.
Key features of Sqoop
It has following features:

 JDBC based implementation are used
 Auto generation of tedious user side code
 Integration with hive
 Extensible Backend
Why Sqoop
 Forcing Map Reduce to access data from RDBMS is repetitive, error prone and
costlier than excepted.
 Data are required to prepare for effective map reduce consumption.
115
Important Sqoop control commands to import RDBMS data are as follows:
 Append: Append data to an existing dataset in HDFS by using the append command
 Columns: columns are used to import from the table. –columns <col,col……>
 Where: where clause to use during the import from the table. –where <where
clause>
The common large components in Sqoop are namely Blog and Clob. If the object is less than
16 MB, it is stored as an in line with the rest of the data. If there are big objects, then they are
temporarily stored in a subdirectory containing the name _lob. Those data are then
materialized in memory for processing further. If we set the lob limit as ZERO (0) then it is
stored in external memory for a time peroid.
Sqoop allows to Export and Import the data from the data table based on where clause. And
the syntax is as follows:
–-columns <col1,col2……>
–-where <condition>
–-query <SQL query>
Example:
sqoop import –connect jdbc:mysql://db.one.com/corp –table INTELLIPAAT_EMP –where
“start_date> ’2016-07-20’ ”
sqoopeval –connect jdbc:mysql://db.test.com/corp –query “SELECT * FROM intellipaat_emp
LIMIT 20”
sqoop import –connect jdbc:mysql://localhost/database –username root –password aaaaa –
columns “name,emp_id,jobtitle”
116
Sqoop supports data imported for the following services:
 HDFS  Hcatalog
 Hive  Accumulo
 HBase
Sqoop basically needs a connector to connect different relational databases. Almost all
Database vendors are using the JDBC connector available specific for the typical Database;
Sqoop needs a JDBC driver of the database for further interaction. No, Sqoop requires the
JDBC and a connector to connect to the database.
Sqoop command to control the number of mappers

We can control the large number of mappers by executing the following parameter –num-
mapers in sqoop command. The –num-mapper’s arguments control the number of map
tasks, where the degree of parallelism is being used. Initially start with a small number of
map tasks, and then later choose a high number of mappers starting with the performance
which may down on the database side.
Syntax: -m, –num-mappers <n>

Sqoop command are shown all the databases in MySQL server
$ sqoop list –databases –connect jdbc:mysql://database.test.com/
Sqoopmetastore
It is a tool basically used for hosting in a shared metadata repository. Multiple and remote
users can define and execute saved jobs that are defined in meta store. End users are
configured to connect the metastore with respect to sqoop-site.xml or with the
–meta-connect argument.
The purpose and usage of sqoop-merge is:
This tool combines two set of datasets where entries are the one dataset which overwrite
entries of an older dataset preserving only the new version of the records between both the
data sets.
Impala
It is an open source platform massively parallel processing (MPP) SQL query engine for data
stored in a computer cluster running Apache Hadoop.
Goals of Impala
 General purpose SQL query engine:

•Must work both for transactional and analytical workloads
•Support queries that get from milliseconds to hours timelimit.
 Runs directly within Hadoop:
•Reads Hadoop file formats which are broadly used
117
•Talks to Hadoop storage managers which are extensively used
•Runs on same nodes that run Hadoop processes
 High performance:
•Runtime code generation
•Usage of C++ in place of Java
•Completely new execution engine which is not build on MapReduce
Oozie in Hadoop
Apache Oozie is a scheduler system used to run and manage Hadoop jobs in a distributed
environment. Oozie supports combining multiple complex jobs that run in a particular order
for accomplishing a more significant task. With Oozie, within a particular set of tasks, two or
more jobs can be programmed to run in parallel.
The reason why Oozie is being used so much is that it is nicely integrated with the Hadoop
stack that supports several Hadoop jobs, such as, Pig, Hive, and Sqoop, along with other
system-specific tasks, such as Shell and Java.
Oozie is used for triggering the workflow actions that use the Hadoop execution engine for
executing various tasks. Oozie leverages the present-day Hadoop machinery for failover,
load balancing, etc.
Oozie is responsible for detecting the completion of tasks by polling and callback. When
starting a task, Oozie provides a unique callback HTTP URL to the task, and it notifies the URL
when the task is complete. If the task doesn’t invoke the callback URL, Oozie polls the task
for completion.
Let’s now look at the types of jobs in Oozie.
Types of Oozie Jobs
118
Oozie Workflow Jobs
Workflow is just a sequence of jobs that are arranged to be represented as DAG (Directed
Acyclic Graphs). The jobs depend on each other. This is because an action is executed only
after the output from the previous action is retrieved. Decision trees can be used for figuring
out how and on what conditions some jobs should run.
There can be various types of actions that are directly based on a particular job and each
type of action can have its own tags as well. Job scripts must be placed in HDFS before the
execution of the workflow.
Oozie Coordinator Jobs
These jobs are made up of workflow jobs that are triggered by data availability and time.
Workflows present in the job coordinator start when some given condition is satisfied.
Processes in coordinator jobs:
 Start: Starts DateTime for a job
 End: Ends DateTime for the job
 TimeZone: Timezone of the coordinator application
 Frequency: Frequency in minutes of the execution of jobs
Oozie Bundle
Coordinator and workflow jobs are present as packages in Oozie Bundle. Oozie Bundle lets
you execute a particular set of coordinator applications, called a data pipeline. There is no
explicit dependency here, but data dependancy can be used to create an implicit data
application pipeline.
You can start/stop/suspend/resume/rerun Oozie Bundle. It gives a better and easy
operational control.
Advancing in this Apache Oozie tutorial, we will understand how to create a Workflow Job.
How does Oozie work?
Now that you know ‘What is Oozie?’, let’s see how exactly Oozie works. Basically, Oozie is a
service that runs in the cluster. Workflow definitions are submitted by the clients for
immediate processing. There are two nodes, namely, control-flow nodes and action nodes.
The action node is the one representing workflow tasks such as running a MapReduce task,
importing data, running a Shell script, etc.
Next, the control-flow node is responsible for controlling the workflow execution in
between actions. This is done by allowing constructs like conditional logic. The control-flow
node includes a start node (used for starting a workflow job), an end node (designating the
end of a job), and an error node (pointing to an error if any).
At the end of the workflow, HTTP callback is used by Oozie for updating the client with the
workflow status.
Why Oozie?
119
The actual motive of using Oozie is for managing several types of jobs that are being
processed in the Hadoop system.
In the form of DAG, several dependencies in-between jobs are specified by the user. This
information is consumed by Oozie and is taken care of in a particular order as present in the
workflow. By doing this, the user’s time for managing the entire workflow is saved. Along
with that Oozie specifies the frequency of the execution of a job.
Features of Oozie
 Client API, as well as a command-line interface, is present in Oozie that can be used
for launching, controlling, and monitoring a job from the Java application.
 Using Web Service APIs, jobs can be controlled from anywhere.
 The execution of jobs, which are scheduled for running periodically, is possible with
Oozie.
 Email notifications can be sent after the completion of jobs
That is all for the Oozie tutorial. So far, we learned ‘What is Oozie in Hadoop?’, how it works,
why we need Oozie, and the features of Oozie. In the next section of this tutorial, we will
learn about Apache Flume.
Apache Flume in Hadoop
Apache Flume is basically a tool or a data ingestion mechanism responsible for collecting
and transporting huge amounts of data such as events, log files, etc. from several sources to
one central data store. Apache Flume is a unique tool designed to copy log data or
streaming data from various different web servers to HDFS.
Apache Flume supports several sources as follows:
 ‘Tail’: The data is piped from the local files and is written into the HDFS via Flume. It
is somewhat similar to a Unix command, ‘tail’.
 System logs
 Apache logs: This enables Java applications for writing events to files in HDFS via
Flume
Features of Flume
Before going further, let’s look at the features of Flume:

 Log data from different web servers is ingested by Flume into HDFS and HBase very
efficiently. Along with that, huge volumes of event data from social networking sites can
also be retrieved.
 Data can be retrieved from multiple servers immediately into Hadoop by using
Flume.
 Huge source of destination types is supported by Flume.
 Based on streaming data flows, Flume has a flexible design. This design stands out to
be robust and fault-tolerant with different recovery mechanisms.
 Data is carried between sources and sinks by Apache Flume which can either be
event-driven or can be scheduled.
120
Flume Architecture
Refer to the image below for understanding the Flume architecture better. In Flume
architecture, there are data generators that generate data. This data that has been generated
gets collected by Flume agents. The data collector is another agent that collects data from
various other agents that are aggregated. Then, it is pushed to a centralized store, i.e., HDFS.
Let’s now talk about each of the components present in the Flume architecture:
 Flume Events
The basic unit of the data which is transported inside Flume is what we call an Event.
Generally, it contains a payload of the byte array. Basically, that can be transported from
the source to the destination accompanied by optional headers.
 Flume Agents
However, in Apache Flume, an independent daemon process (JVM) is what we call an
agent. At first, it receives events from clients or other agents. Afterward, it forwards it to
its next destination that is sink or agent. Note that, it is possible that Flume can have
more than one agent.
o Flume Source: Basically, Flume source receives data from the data
generators. Then it transfers it to one or more channels as Flume events. There are
various types of sources Apache Flume supports. Moreover, each source receives
events from a specified data generator.
o Flume Channel: A transient store that receives the events from the source
also buffers them till they are consumed by sinks is what we call a Flume channel.
To be very specific it acts as a bridge between the sources and the sinks in Flume.
Basically, these channels can work with any number of sources and sinks are they
are fully transactional.
o Flume Sink: Generally, to store data into centralized stores like HBase and
HDFS we use the Flume sink component. Basically, it consumes events from the
channels and then delivers it to the destination. Also, we can say that the sink’s
destination is might be another agent or the central stores.
 Flume Clients
Those who generate events and then sent it to one or more agents is what we call
Flume Clients.
Now, that we have seen in-depth the architecture of Flume, let’s look at the advantages of
Flume as well.
Advantages of Flume
Here are the advantages of using Flume −

 Using Apache Flume we can store the data into any of the centralized stores (HBase,
HDFS).
 When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized stores
and provides a steady flow of data between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message delivery.
121
 Flume is reliable, fault-tolerant, scalable, manageable, and customizable.
Now, we come to the end of this tutorial on Flume. We learned about Apache Flume in
depth along with that we saw the architecture of Flume.
Zookeeper
It allows the distribution of processes to organize with each other through a shared
hierarchical name space of data registers.
 Zookeeper Service is replicated or duplicated over a set of machines.
 All machines save a copy of the data in memory set.
 A leader is chosen based on the service startup
 Clients is only connected to a single Zookeeper server and keep a TCP connection
constantly.
 Client can read from any Zookeeper server then writes go through the leader and
requires the majority consensus.
Hue
It is an open source platform based on Web interface for analyzing the data with Hadoop
and Spark. It is a series of application consisting of executing queries, copying files, building
workflows.
Features of Hue
It following features are as follows–

 Spark Notebooks
 Wizards to import data onto Hadoop
 Dynamic search dashboards are required for Solr
 Browsers are required for YARN, HDFS, Hive table Metastore, HBase, ZooKeeper
 SQL Editors are implemented for Impala, Hive, MySql, Sqlite, PostGres, Sqlite and
Oracle
 Pig Editor, Sqoop2, Oozie workflows Editors and Dashboards
Kafka overview
Apache Kafka provided fault-tolerant, scalable messaging:
 Topics
 Producers
 Consumers
 Brokers
Topics
Kafka maintains feeds of messages in categories called topics. Each topic has a user-defined
category (or feed name), to which messages are published.
122
For each topic, the Kafka cluster maintains a structured commit log with one or more
partitions:
Kafka appends new messages to a partition in an ordered, immutable sequence. Each

message in a topic is assigned a sequential number that uniquely identifies the message
within a partition. This number is called an offset, and is represented in the diagram by
numbers within each cell (such as 0 through 12 in partition 0).
Partition support for topics provides parallelism. In addition, because writes to a partition are
sequential, the number of hard disk seeks is minimized. This reduces latency and increases
performance.
Producers
Producers are processes that publish messages to one or more Kafka topics. The producer is
responsible for choosing which message to assign to which partition within a topic.
Assignment can be done in a round-robin fashion to balance load, or it can be based on a
semantic partition function.
Consumers
Consumers are processes that subscribe to one or more topics and process the feeds of
published messages from those topics. Kafka consumers keep track of which messages have
already been consumed by storing the current offset. Because Kafka retains all messages on
disk for a configurable amount of time, consumers can use the offset to rewind or skip to any
point in a partition.
Brokers
A Kafka cluster consists of one or more servers, each of which is called a broker. Producers
send messages to the Kafka cluster, which in turn serves them to consumers. Each broker
manages the persistence and replication of message data.
123
Kafka Brokers scale and perform well in part because Brokers are not responsible for keeping
track of which messages have been consumed. Instead, the message consumer is responsible
for this. This design feature eliminates the potential for back-pressure when consumers
process messages at different rates.
What is new in Apache Kafka 2.0
Apache Kafka 2.0 introduces some important enhancements and new features.
 The replication protocol has been improved to avoid log divergence between leader
and follower during fast leader failover. We have also improved resilience of brokers by
reducing the memory footprint of message down-conversions. By using message
chunking, both memory usage and memory reference time have been reduced to avoid
OutOfMemory errors in brokers.
 KIP-255 adds a framework for authenticating to Kafka brokers using OAuth2 bearer
tokens. The SASL/OAUTHBEARER implementation is customizable using callbacks for
token retrieval and validation.
 Host name verification is now enabled by default for SSL connections to ensure that
the default SSL configuration is not susceptible to man-in-the-middle attacks. You can
disable this verification if required.
 You can now dynamically update SSL truststores without broker restart. You can also
configure security for broker listeners in ZooKeeper before starting brokers, including SSL
keystore and truststore passwords and JAAS configuration for SASL. With this new
feature, you can store sensitive password configs in encrypted form in ZooKeeper rather
than in cleartext in the broker properties file.
 Kafka clients are now notified of throttling before any throttling is applied when
quotas are enabled. This enables clients to distinguish between network errors and large
throttle times when quotas are exceeded.
 We have added a configuration option for Kafka consumer to avoid indefinite
blocking in the consumer.
 We have dropped support for Java 7 and removed the previously deprecated Scala
producer and consumer.
Building a High-Throughput Messaging System with Apache Kafka
Apache Kafka is a fast, scalable, durable, fault-tolerant publish-subscribe messaging system.

Common use cases include:
124
 Stream processing
 Messaging
 Website activity tracking
 Metrics collection and monitoring
 Log aggregation
 Event sourcing
 Distributed commit logging
Kafka works with Apache Storm and Apache Spark for real-time analysis and rendering of
streaming data. The combination of messaging and processing technologies enables stream
processing at linear scale.
For example, Apache Storm ships with support for Kafka as a data source using Storm’s core
API or the higher-level, micro-batching Trident API. Storm’s Kafka integration also includes
support for writing data to Kafka, which enables complex data flows between components in
a Hadoop-based architecture.
Apache Atlas Overview
Apache Atlas provides governance capabilities for Hadoop.
Apache Atlas uses both prescriptive and forensic models enriched by business taxonomical
metadata. Atlas is designed to exchange metadata with other tools and processes within and
outside of the Hadoop stack, thereby enabling platform-agnostic governance controls that
effectively address compliance requirements.
Apache Atlas enables enterprises to effectively and efficiently address their compliance
requirements through a scalable set of core governance services. These services include:
 Search and Proscriptive Lineage – facilitates pre-defined and ad hoc exploration of

data and metadata, while maintaining a history of data sources and how specific data was
generated.
 Metadata-driven data access control.
 Flexible modeling of both business and operational data.
 Data Classification – helps you to understand the nature of the data within Hadoop
and classify it based on external and internal sources.
 Metadata interchange with other metadata tools.
Apache Atlas features
Apache Atlas is a low-level service in the Hadoop stack that provides core metadata services.
Atlas currently provides metadata services for the following components:
 Hive
 Ranger
 Sqoop
 Storm/Kafka (limited support)
 Falcon (limited support)
125
Apache Atlas provides the following features:
 Knowledge store that leverages existing Hadoop metastores: Categorized into a

business-oriented taxonomy of data sets, objects, tables, and columns. Supports the
exchange of metadata between HDP foundation components and third-party applications
or governance tools.
 Data lifecycle management: Leverages existing investment in Apache Falcon with a
focus on provenance, multi-cluster replication, data set retention and eviction, late data
handling, and automation.
 Audit store: Historical repository for all governance events, including security events
(access, grant, deny), operational events related to data provenance and metrics. The
Atlas audit store is indexed and searchable for access to governance events.
 Security: Integration with HDP security that enables you to establish global security
policies based on data classifications and that leverages Apache Ranger plug-in
architecture for security policy enforcement.
 Policy engine: Fully extensible policy engine that supports metadata-based, geo-
based, and time-based rules that rationalize at runtime.
 RESTful interface: Supports extensibility by way of REST APIs to third-party
applications so you can use your existing tools to view and manipulate metadata in the
HDP foundation components.
Atlas Architecture Overview
Atlas-Ranger integration
126
You can use Apache Ranger with Apache Atlas to implement dynamic classification-based
security policies.
Atlas provides data governance capabilities and serves as a common metadata store that is
designed to exchange metadata both within and outside of the Hadoop stack. Ranger
provides a centralized user interface that can be used to define, administer and manage
security policies consistently across all the components of the Hadoop stack. The Atlas-
Ranger integration unites the data classification and metadata store capabilities of Atlas with
security enforcement in Ranger.
You can use Atlas and Ranger to implement dynamic classification-based security policies, in
addition to role-based security policies. Ranger’s centralized platform empowers data
administrators to define security policy based on Atlas metadata tags or attributes and apply
this policy in real-time to the entire hierarchy of entities including databases, tables, and
columns, thereby preventing security violations.
Ranger-Atlas Access Policies
 Classification-based access controls: A data entity such as a table or column can be

marked with the metadata tag related to compliance or business taxonomy (such as
“PCI”). This tag is then used to assign permissions to a user or group. This represents an
evolution from role-based entitlements, which require discrete and static one-to-one
mapping between user/group and resources such as tables or files. As an example, a data
steward can create a classification tag “PII” (Personally Identifiable Information) and
assign certain Hive table or columns to the tag “PII”. By doing this, the data steward is
denoting that any data stored in the column or the table has to be treated as “PII”. The
data steward now has the ability to build a security policy in Ranger for this classification
and allow certain groups or users to access the data associated with this classification,
while denying access to other groups or users. Users accessing any data classified as “PII”
by Atlas would be automatically enforced by the Ranger policy already defined.
 Data Expiry-based access policy: For certain business use cases, data can be toxic
and have an expiration date for business usage. This use case can be achieved with Atlas
and Ranger. Apache Atlas can assign expiration dates to a data tag. Ranger inherits the
expiration date and automatically denies access to the tagged data after the expiration
date.
 Location-specific access policies: Similar to time-based access policies,
administrators can now customize entitlements based on geography. For example, a US-
based user might be granted access to data while she is in a domestic office, but not
while she is in Europe. Although the same user may be trying to access the same data, the
different geographical context would apply, triggering a different set of privacy rules to
be evaluated.
 Prohibition against dataset combinations: With Atlas-Ranger integration, it is now
possible to define a security policy that restricts combining two data sets. For example,
consider a scenario in which one column consists of customer account numbers, and
another column contains customer names. These columns may be in compliance
individually, but pose a violation if combined as part of a query. Administrators can now
apply a metadata tag to both data sets to prevent them from being combined.
Cross-component Lineage
127
Apache Atlas now provides the ability to visualize cross-component lineage, delivering a
complete view of data movement across a number of analytic engines such as Apache
Storm, Kafka, Falcon, and Hive.
This functionality offers important benefits to data stewards and auditors. For example, data
that starts as event data through a Kafka bolt or Storm Topology is also analyzed as an
aggregated dataset through Hive, and then combined with reference data from a RDBMS via
Sqoop, can be governed by Atlas at every stage of its lifecycle. Data stewards, Operations,
and Compliance now have the ability to visualize a data set’s lineage, and then drill down
into operational, security, and provenance-related details. As this tracking is done at the
platform level, any application that uses these engines will be natively tracked. This allows for
extended visibility beyond a single application view.
128
Spark vs MapReduce
Both technologies are equipped with amazing features. However, with the increased need of
real-time analytics, these two are giving tough competition to each other. Read a
comparative analysis of Hadoop MapReduce and Apache Spark.
Big data is everywhere. Wait till 2020 and you will have over 50 billion Internet-connected
devices, thanks to Internet of Things (IoT). All this relates to one thing—data is on a scale
that is unprecedented in the history of humankind. For instance, 90 percent of the data that
is in existence today was created in the last two years alone.
All this means that there needs to be a radical new way to handle all that data, process it in
hitherto unheard volumes, and derive meaningful insights from it to help businesses leap
forward in this cut-throat corporate scenario. This is where the argument comes into the
picture: whether Apache MapReduce has run its course and is being taken over by a nimbler
rival technology, Apache Spark.
Some of the interesting facts about these two technologies are as follows:
 Spark Machine Learning abilities are obtained by MLlib.

 Apache Spark can be embedded in any OS.
 Execution of a Map task is followed by a Reduce task to produce the final output.
 Output from the Map task is written to a local disk, while the output from the Reduce
task is written to HDFS.
Spark Vs. MapReduce
Check out the detailed comparison between these two technologies.
Key Features Apache Spark Hadoop MapReduce

Speed 10–100 times faster than MapReduce Slower
Analytics Supports streaming, Machine Comprises simple Map and Reduce
Learning, complex analytics, etc. tasks
Suitable for Real-time streaming Batch processing
Coding Lesser lines of code More lines of code
Processing In-memory Local disk
Location

What Are MapReduce and Spark?
The above table clearly points out that Apache Spark is way better than Hadoop MapReduce
or, in other words, more suitable for the real-time analytics. However, it would be interesting
to know what makes Spark better than MapReduce. But, before that you should know what
exactly these technologies are. Read below:
MapReduce is a methodology for processing huge amounts of data in a parallel and

distributed setting. Two tasks undertaken in the MapReduce programming are the Mapper
129
and the Reducer. Mapper takes up the job of sorting data that is available, and Reducer is
entrusted with the task of combining the data and converting it into smaller chunks.
MapReduce, HDFS, and YARN are the three important components of Hadoop systems.
Spark is a new and rapidly growing open-source technology that works well on cluster of
computer nodes. Speed is one of the hallmarks of Apache Spark. Developers working in this
environment get an application programming interface that is based on the framework of
RDD (Resilient Distributed Dataset). RDD is nothing but the abstraction provided by Spark
that lets you segregate nodes into smaller divisions on the cluster in order to independently
process data.
What Makes MapReduce Lag Behind in the Race
So far, you must have perceived a clear picture of Spark and MapReduce workflows. It is
clear that MapReduce is not suitable according to the evolving real-time big data needs.
Following are the reasons behind this fact:
 Response time today has to be super fast.

 There are scenarios when the data from the graph has to be extracted.
 Sometimes, mapping generates a lot of keys which take time to sort.
 There are times when diverse sets of data need to be combined.
 When there is Machine Learning involved, then this technology fails.
 For repeated processing of data, it takes too much for the iterations.
 For tasks that have to be cascaded, there are a lot of inefficiencies involved.
How Does Spark Have an Edge over MapReduce
Some of the benefits of Apache Spark over Hadoop MapReduce are given below:
 Processing at high speeds: The process of Spark execution can be up to 100 times

faster due to its inherent ability to exploit the memory rather than using the disk
storage. MapReduce has a big drawback since it has to operate with the entire set of
data in the Hadoop Distributed File System on the completion of each task, which
increases the time and cost of processing data.
 Powerful caching: When dealing with Big Data, there is a lot of caching involved and
this increases the workload while using MapReduce, but Spark does it in-memory.
 Increased iteration cycles: There is a need to work on the same data again and
again, especially in Machine Learning scenarios, and Spark is perfectly suitable for
such applications.
 Multiple operations using in-built libraries: MapReduce is capable of using in-built

libraries for batch processing tasks. Whereas, Spark provides the option of utilizing
the in-built libraries to build interactive queries in SQL, Machine Learning, streaming,
and batch processing, among other things.
Some Other Obvious Benefits of Spark over MapReduce
130
Spark is not tied to Hadoop unlike MapReduce which cannot work outside of Hadoop. So,
there are talks going around with subject matter experts claiming that Spark might one day
even phase out Hadoop, but there is still a long way ahead. Spark lets you write an
application in a language of your choice like Java, Python, and so on. It supports streaming
data and SQL queries and an extensive use of data analytics in order to make sense of the
data, and it might even support the machine-led learning like the IBM Watson cognitive
computing technology.
Bottom Line
Spark is able to access diverse data sources and make sense of them all. This is especially
important in a world where IoT is gaining a steady groundswell and machine-to-machine
communications amount for a bulk of data. This also means that MapReduce is not up to the
challenge to take on the Big Data exigencies of the future.
In the race to achieve the fastest way of doing things, using the least amount of resources,
there will always be a clash of the Titans. The future belongs to those technologies that are
nimble, adaptable, resourceful, and most of all that which can cater to the diverse needs of
enterprises without a hitch, and Apache Spark seems to be ticking all the checkboxes and
possibly the future belongs to it.
Apache Spark Use Cases
Known as one of the fastest Big Data processing engine, Apache Spark is widely used across
organizations in myriad of ways.
Apache Spark has gained immense popularity over the years and is being implemented by
many competing companies across the world. Many organizations such as eBay, Yahoo, and
Amazon are running this technology on their big data clusters.
Spark, the utmost lively Apache project at the moment across the world with a flourishing
open-source community known for its ‘lightning-fast cluster computing,’ has surpassed
Hadoop by running with 100 times faster speed in memory and 10 times faster speed in
disks.
Spark has originated as one of the strongest Big Data technologies in a very short span of
time as it is an open-source substitute to MapReduce associated to build and run fast and
secure apps on Hadoop. Spark comes with a Machine Learning library, graph algorithms, and
real-time streaming and SQL app, through Spark Streaming and Shark, respectively.
For instance, a simple program for printing ‘Hello World!’ requires more lines of code in
MapReduce but much lesser in Spark. Here’s the example:
sparkContext.textFile(“hdfs://…”)
.flatmap(line => line.split(“ “))
.map(word=> (word,1)).reduceByKey(_+_)
131
.saveAsTexFile(hdfs://..)
Further Use Cases of Apache Spark
For every new arrival of technology, the innovation done should be clear for the test cases in
the marketplace. There must be proper approach and analysis on how the new product
would hit the market and at what time it should with fewer alternatives.
Now when you think about Spark, you should know why it is deployed, where it would stand
in the crowded marketplace, and whether it would be able to differentiate itself from its
competitors?
With these questions in mind, go on with the chief deployment modules that illustrate the
uses cases of Apache Spark.
Data Streaming
Apache Spark is easy to use and brings up a language-integrated API to stream processing.
It is also fault-tolerant, i.e., it helps semantics without extra work and recovers data easily.
This technology is used to process the streaming data. Spark streaming has the potential to
handle additional workloads. Among all, the common ways used in businesses are:
 Streaming ETL
 Data enrichment
 Trigger event detection
 Complex session analysis
Interactive Analysis
 Spark provides an easy way to study APIs, and also it is a strong tool for interactive
data analysis. It is available in Python or Scala.
 MapReduce is made to handle batch processing and SQL on Hadoop engines which
are usually considered to be slow. Hence, with Spark, it is fast to perform any
identification queries against live data without sampling.
 Structured streaming is also a new feature that helps in web analytics by allowing
customers to run a user-friendly query with web visitors.
Fog Computing
 Fog computing runs a program 100 times faster in memory and 10 times faster in the
disk than Hadoop. It helps write apps quickly in Java, Scala, Python, and R.
 It includes SQL, streaming, and hard analytics and can run anywhere
(standalone/cloud, etc.).
 With the rise of Big Data Analytics, the concept that arises is IoT (Internet of Things).
IoT implants objects and devices with small sensors that interact with each other, and
users are making use of it in a revolutionary way.
132
 It is a decentralized computing infrastructure where data, compute, storage, and
applications are located, somewhere between the data source and the cloud. It brings
the advantages of the cloud closer to where data is created and acted upon, more or
less the way edge computing does it.
To summarize, Apache Spark helps calculate the processing of large amount of real-time or
archived data, both structured and unstructured, without anything being held or attached.
It’s linking appropriate complex possibilities similar to graph algorithms and Machine
Learning. Spark brings processing of Big Data to a large quantity.
Conclusion
In real time, Apache Spark is used in many notable business industries such as Uber,
Pinterest, etc. These companies gather terabytes of event data from users and engage them
in real-time interactions such as video streaming and many other user interfaces, thus,
maintaining constant smooth and high-quality customer experience.
133
Hadoop NoSQL
Relational Database (RDBMS) is a technology used on a large scale in commercial systems,

banking, flight reservations, or applications using data structured. SQL (Structured Query
Language) is the query language oriented to these applications.
Database applications stand out in the consistency of data schemas. We can scale it, but not
use it as infinite scaling.
The need to analyze data in large volumes, from different sources and formats, has given rise
to NoSQL (Not Only SQL) technology. They are not relational and not based on schemas
(rules governing data or objects). All NoSQL implementations are looking for the scaled
handling of large volumes of unstructured data.
NoSQL databases can grow and focus more on performance, allowing replication of data
across multiple network nodes, reading, writing, and processing data at incredible speed,
using distributed parallel processing paradigms. We can use NoSQL in real-time data analysis,
such as personalization of sites from user behavior tracking, IoT (Internet of Things) such as
vehicle telematics or mobile device telemetry.
NoSQL Types
The three main types of NoSQL are.

 Column Database (column-oriented)
 Key-Value Database (key/value oriented)
 Document Database (document-oriented)
1. Column Database
A NoSQL database that stores data in tables and manages them by columns instead of rows.
Called as the columnar database management system (CDBMS).
It converts columns into data files.
One benefit is the fact that it can compress data, allowing operations such as the minimum,
maximum, sum, counting, and averages. They can be auto-indexed, using less disk space than
a relational database system including the same data.
Apache HBase-Is a NoSQL-oriented Columns. Developed to run on top of Hadoop with

HDFS.
Designed from the concepts of the original columnar database and developed by Google,
called “BigTable.” It is excellent for real-time research, reading and accessing large volumes of
data.
2. Key-Value Database
A key/value oriented NoSQL stores data in collections of key/value pairs. For example, a
student id number may be the key, and the student’s name may be the value.
134
It is a dictionary, storing a value, such as an integer, and a string (JSON or Matrix file
structure), along with the key to reference that value.
Apache Cassandra-Cassandra is a powerful NoSQL based key/value model. Facebook
developed it in 2008, is scalable and fault tolerant.
Developed to solve Big Data analytical problems in real time involving Petabytes of data using
MapReduce. Cassandra can run without Hadoop, but it becomes powerful when connected to
Hadoop and HDFS.
3. Document Database (document-oriented)
Document-oriented NoSQL are like key/value documents.

NoSQL organizes documents into collections analogous to relational tables. We can research
based on values, not just key-based ones.
MongoDB-It is a document-oriented NoSQL, developed by MongoDB Inc., and distributed

free by the Apache Foundation.
MongoDB stores JSON document data as if it were a schema, meaning fields may differ from
one document to another, and the data structure may change.
We can execute it without Hadoop, but it becomes powerful when connected to Hadoop and
HDFS.
Let’s look into HBase, Cassandra and MongoDB in more detail:
135
Apache HBase
Apache Hbase is a popular and highly efficient Column-oriented NoSQL database built on
top of Hadoop Distributed File System that allows performing read/write operations on large
datasets in real time using Key/Value data.
It is an open source platform and is horizontally scalable. It is the database which distributed
based on the column oriented. It is built on top most of the Hadoop file system. It is based
on the non relational database system (NoSQL). HBase is truly and faithful, open source
implementation devised on Google’s Bigtable.
Column oriented databases are those databases which store the data tables in terms of
sections of columns of data instead of rows of data. It is specified based on distribution,
persistent, strictly consistent storage system with near-optimal write in terms of Input/output
channel saturation and excellent reading performance which make use makes use of efficient
disk space by supporting pluggable compression algorithms that can be chosen based on
the nature of the data in particular set of column families.
HBase manages shifting the load and failures elegantly and clearly to the client side.
Scalability is built in and clusters can be grown or shrunk while the system is still production
stage. Changing the cluster does not involve any difficult rebalancing or resharding
procedure but is fully automated as per the customer requirements.
Apache HBase is an Apache Hadoop project and Open Source, non-relational

distributed Hadoop database that had its genesis in the Google’s Bigtable. The programming
language of HBase is Java. Today it is an integral part of the Apache Software Foundation
and the Hadoop ecosystem. It is a high availability database that exclusively runs on top of
the HDFS and provides the Capabilities of Google Bigtable for the Hadoop framework for
storing huge volumes of unstructured data at breakneck speeds in order to derive valuable
insights from it.
It has an extremely fault-tolerant way of storing data and is extremely good for storing
sparse data. Sparse data is something like looking for a needle in a haystack. A real-life
example of sparse data would be like looking for someone who has spent over $100,000
dollars in a single transaction on Amazon among the tens of millions of transactions that
happen on any given week.
Criteria HBase
Cluster basis Hadoop
Deployed Batch Jobs
136
for
API Thrift or REST
The Architecture of Apache HBase
The Apache HBase carries all the features of the original Google Bigtable paper like the
Bloom filters, in-memory operations and compression. The tables of this database can serve
as the input for MapReduce jobs on the Hadoop ecosystem and it can also serve as output
after the data is processed by MapReduce. The data can be accessed via the Java API or
through the REST API or even the Thrift and AVRO gateways.
What HBase is that it is basically a column-oriented key-value data store, and the since it
works extremely fine with the kind of data that Hadoop process it is natural fit for deploying
as a top layer on HDFS. It is extremely fast when it comes to both read and write operations
and does not lose this extremely important quality even when the datasets are humongous.
Therefore, it is being widely used by corporations for its high throughput and low
input/output latency. It cannot work as a replacement for the SQL database but it is perfectly
possible to have an SQL layer on top of HBase to integrate it with the various business
intelligence and analytics tools.
As it is obvious that HBase does not support SQL scripting but the same is written in Java like
what we do for a MapReduce application.
Why should you use the HBase technology?
HBase is one of the core components of the Hadoop ecosystem along with the other two
being HDFS and MapReduce. As part of the Hortonworks Data Platform the Apache Hadoop
ecosystem is available as a highly secure, enterprise ready big data framework. It is being
regularly deployed by some of the biggest companies like Facebook messaging system and
so on. Some of the salient features of HBase that makes it one of the most sought after
message storing system is as follows:
 It has a completely distributed architecture and can work on extremely large scale
data
 It works for extremely random read and write operations
 It has high security and easy management of data
 It provides an unprecedented high write throughput
 Scaling to meet additional requirements is seamless and quick
 Can be used for both structured and semi-structured data types
 It is good when you don’t need full RDBMS capabilities
 It has a perfectly modular and linear scalability feature
 The data reads and writes are strictly consistent
 The table sharding can be easily configured and automatized
 The various servers are provided automatic failover support
 The MapReduce jobs can be backed with HBase Tables
 Client access is seamless with Java APIs.
137
What is the scope of Apache HBase?
One of the most important features of HBase is that it can handle data sets which number in
billions of rows and millions of columns. It can extremely well combine the various data
sources that are coming from a wide variety of types, structures and schemas. The best part
is that it can be integrated natively with Hadoop in order to provide a seamless fit. It also
works extremely well with YARN. HBase provides very low latency access over fast-changing
and humungous amounts of data.
Why do we need this technology and what is the problem that it is solving?
HBase is a very progressive NoSQL database that is seeing increased use in today’s world
that is overwhelmed with Big Data. It has a very simple Java programming roots which can
be deployed for scaling HBase on a big scale. There are a lot of business scenarios wherein
we are exclusively working with sparse data which is to look for a handful of data fields
matching certain criteria within data fields that are numbering in the billions. It is extremely
fault-tolerant and resilient and can work on multiple types of data making it useful for varied
business scenarios.
It is a column-oriented table making it very easy to look for the right data among billions of
data fields. You can easily shard the data into tables with the right configuration and
automatization. HBase is perfectly suited for analytical processing of data. Since analytical
processing has huge amounts of data required it causes queries to exceed the limit that is
possible on a single server. This is when the distributed storage comes into the picture.
There is also a need for handling large amounts of reads and writes which is just not possible
using an RDBMS database and so HBase is the perfect candidate for such applications. The
read/write capacity of this technology can be scaled to even millions/second giving it an
unprecedented advantage. Facebook uses it extensively for real-time messaging
applications and Pinterest uses for multiple tasks running up to 5 million operations per
second.
Using Apache HBase to store and access data
Configuring HBase and Hive
Follow this step to complete the configuration:
Modify the hive-site.xml configuration file. Add the required path to the jars. The jars will be
used by Hive to write data into the HBase. The full list of JARs to add can be seen by running
the commandhbase mapredcp on the command-line.
<property>
<name>hive.aux.jars.path</name>
<value>
file:///usr/hdp/3.0.1.0-61/hbase/lib/commons-lang3-3.6.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-zookeeper-2.0.0.3.0.1.0-61.jar,
138
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-mapreduce-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-annotations-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-miscellaneous-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-databind-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-hadoop-compat-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-metrics-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-client-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-protocol-shaded-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/jackson-core-2.9.5.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/protobuf-java-2.5.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-netty-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/metrics-core-3.2.1.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-server-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-hadoop2-compat-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-metrics-api-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-common-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-protocol-2.0.0.3.0.1.0-61.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/hbase-shaded-protobuf-2.1.0.jar,
file:///usr/hdp/3.0.1.0-61/hbase/lib/htrace-core4-4.2.0-incubating.jar,
file:///usr/hdp/3.0.1.0-61/zookeeper/zookeeper-3.4.6.3.0.1.0-61.jar
</value>
</property>
Using HBase Hive integration
Before you begin to use the Hive HBase integration, complete the following steps:
 Use the HBaseStorageHandler to register the HBase tables with the Hive metastore.
You can also register the Hbase tables directly in Hive using
the HiveHBaseTableInputFormat and HiveHBaseTableOutputFormat classes.
 As part of the registration process, specify a column mapping. There are two
SERDEPROPERTIES that controls the HBase column mapping to Hive:
o Hbase.columns.mapping
o Hbase.table.default.storage.type
HBase Hive integration example
A change to Hive in HDP 3.0 is that all StorageHandlers must be marked as “external”. There
is no such thing as an non-external table created by a StorageHandler. If the corresponding
HBase table exists when the Hive table is created, it will mimic the HDP 2.x semantics of an
“external” table. If the corresponding HBase table does not exist when the Hive table is
created, it will mimic the HDP 2.x semantics of a non-external table (e.g. the HBase table is
dropped when the Hive table is dropped).
From the Hive shell, create a HBase table:
139
CREATE EXTERNAL TABLE hbase_hive_table (key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hbase_hive_table",

"hbase.mapred.output.outputtable" = "hbase_hive_table");
The hbase.columns.mapping property is mandatory. The hbase.table.name property is

optional. The hbase.mapred.output.outputtable property is optional; It is needed, if you plan
to insert data to the table from the HBase shell, access the hbase_hive_table:
$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010
hbase(main):001:0> list hbase_hive_table
1 row(s) in 0.0530 seconds
hbase(main):002:0> describe hbase_hive_table
Table hbase_hive_table is ENABLED
hbase_hive_table COLUMN FAMILIES DESCRIPTION{NAME => 'cf', DATA_BLOCK_ENCODING

=> 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1',
COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS
=> 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} 1 row(s)
in 0.2860 seconds
hbase(main):003:0> scan "hbase_hive_table "
ROW COLUMN+CELL
Insert the data into the HBase table through Hive:
INSERT OVERWRITE TABLE HBASE_HIVE_TABLE SELECT * FROM pokes WHERE foo=98;
From the HBase shell, verify that the data got loaded:
140
hbase(main):009:0> scan "hbase_hive_table"
ROW COLUMN+CELL
98 column=cf1:val, timestamp=1267737987733, value=val_98
From Hive, query the HBase data to view the data that is inserted in the hbase_hive_table:
hive> select * from HBASE_HIVE_TABLE;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
98 val_98
Time taken: 4.582 seconds
Using Hive to access an existing HBase table example
Use the following steps to access the existing HBase table through Hive.
 You can access the existing HBase table through Hive using the CREATE EXTERNAL
TABLE:
CREATE EXTERNAL TABLE hbase_table_2(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key
,cf1:val")
TBLPROPERTIES("hbase.table.name" = "some_existing_table",
"hbase.mapred.output.outputtable" = "some_existing_table");
 You can use different type of column mapping to map the HBase columns to Hive:
o Multiple Columns and Families
To define four columns, the first being the rowkey: “:key,cf:a,cf:b,cf:c”

o Hive MAP to HBase Column Family
When the Hive datatype is a Map, a column family with no qualifier might be used.
This will use the keys of the Map as the column qualifier in HBase: “cf:”
141
o Hive MAP to HBase Column Prefix
When the Hive datatype is a Map, a prefix for the column qualifier can be provided
which will be prepended to the Map keys: “cf:prefix_.*”
Note: The prefix is removed from the column qualifier as compared to the key in the
Hive Map. For example, for the above column mapping, a column of “cf:prefix_a”
would result in a key in the Map of “a”.
 You can also define composite row keys. Composite row keys use multiple Hive
columns to generate the HBase row key.
o Simple Composite Row Keys
A Hive column with a datatype of Struct will automatically concatenate all elements in
the struct with the termination character specified in the DDL.
o Complex Composite Row Keys and HBaseKeyFactory
Custom logic can be implemented by writing Java code to implement a KeyFactory

and provide it to the DDL using the table property key “hbase.composite.key.factory”.
Understanding Bulk Loading
A common pattern in HBase to obtain high rates of data throughput on the write path is to
use “bulk loading”. This generates HBase files (HFiles) that have a specific format instead of
shipping edits to HBase RegionServers. The Hive integration has the ability to generate
HFiles, which can be enabled by setting the property “hive.hbase.generatehfiles” to true, for
example, `set hive.hbase.generatehfiles=true`. Additionally, the path to a directory which to
write the HFiles must also be provided, for example,`set hfile.family.path=/tmp/hfiles”.
After the Hive query finishes, you must execute the “completebulkload” action in HBase to
bring the files “online” in your HBase table. For example, to finish the bulk load for files in
“/tmp/hfiles” for the table “hive_data”, you might run on the command-line:
$ hbase completebulkload /tmp/hfiles hive_data
Understanding HBase Snapshots
When an HBase snapshot exists for an HBase table which a Hive table references, you can
choose to execute queries over the “offline” snapshot for that table instead of the table itself.
First, set the property to the name of the HBase snapshot in your Hive script: `set
hive.hbase.snapshot.name=my_snapshot`. A temporary directory is required to run the query
over the snapshot. By default, a directory is chosen inside of “/tmp” in HDFS, but this can be
overridden by using the property “hive.hbase.snapshot.restoredir”.
142
Apache Cassandra
Apache Cassandra database is an open source distributed database management system

designed to handle large amounts of Big data across many commodity servers, providing
high availability with no single point of failure. Cassandra is a non-relational highly scalable,
eventually consistent, distributed, structured column family based data store. Cassandra is a
peer to peer architecture.
 Originally developed at Facebook organization

 Written in Java
 Open-source
 Name came from Greek Mythology
 Cassandra uses mixture of concepts from Google’s BigTable and Distributed Hash
Table (DHT) of Amazon’s Dynamo
Now, let’s discuss, what has changed with introduction of NoSQL?
 Massive data volumes

 Extreme query load
 Flexible schema evolution- schema never gets fixed and it gets evolved
 Schema changes can be gradually introduced in the system.
Cassandra falls under the Columnar or extensible record category where each key is
associated with many associates. It still uses tables but have no joins. Cassandra does not
support joins or sub-queries, except for batch analysis via Hadoop. Rather, Cassandra
emphasizes denormalization through features like collections. Cassandra stores data by
columns, not like traditional row oriented databases. To know more about Cassandra
training courses, you can visit intellipaat.com
What is CAP Theorem?
It is also knows as Brewer’s theorem, states that it is impossible for a distributed computer
system to simultaneously provide all three of the following guarantees:
 Consistency (all nodes see the same data at the same time)
 Availability (a guarantee that every request receives a response about whether it was
successful or failed)
 Partition Tolerance (the system continues to operate despite arbitrary message loss)
According to this theorem, a distributed system can satisfy any two of these guarantee at the
same time, but not all three.
So, we need to understand that partitioning is unavoidable when a network partition fails, i.e.
systems can’t communicate with each other, in that particular point, the system should be
operational but while it is operational whether it has to hold on availability or hold on to
consistency is what each distributed system has to decide.
143
Cassandra has a concept called Tunable consistency model, this is the only database that has
this particular concept, so you can setup Cassandra either for availability or consistency,
Cassandra can work in both the modes unlike other databases.
We can’t build banking or financial systems using Cassandra instead it is used in social
media. OLTP or payment models can’t be used in Cassandra.
Cassandra Compare to HBase?
HBase is a NoSQL, distributed database model that is included in the Apache Hadoop
Project. It runs on top of the Hadoop Distributed File System (HDFS). HBase is designed for
data lake use cases and is not typically used for web and mobile applications. Cassandra, by
contrast, offers the availability and performance necessary for developing always-on
applications.
Combining Cassandra and Hadoop
Today’s organizations have two data needs. The need for a database devoted to online
operations and the analysis of “hot” data generated by Web, mobile and IOT applications.
And the need for a batch oriented big data platform that supports the processing of vast
amounts of “cold” unstructured historical data. By tightly integrating Cassandra and Hadoop
to work together, both needs can be served.
While Cassandra works very well as a highly fault tolerant backend for online systems,
Cassandra is not as analytics friendly as Hadoop. Deploying Hadoop on top of Cassandra
creates the ability to analyze data in Cassandra without having to first move that data into
Hadoop. Moving data off Cassandra into Hadoop and HDFS is a complicated and time-
consuming process. Thus Hadoop on Cassandra gives organizations a convenient way to get
specific operational analytics and reporting from relatively large amounts of data residing in
Cassandra in real time fashion. Armed with faster and deeper big data insights, organizations
that leverage both Hadoop and Cassandra can better meet the needs of their customers and
gain a stronger edge over their competitors.
Architecture in brief
Cassandra is designed to handle big data workloads across multiple nodes with no single
point of failure. Its architecture is based on the understanding that system and hardware
failures can and do occur. Cassandra addresses the problem of failures by employing a peer-
to-peer distributed system across homogeneous nodes where data is distributed among all
nodes in the cluster. Each node frequently exchanges state information about itself and other
nodes across the cluster using peer-to-peer gossip communication protocol. A sequentially
written commit log on each node captures write activity to ensure data durability. Data is
then indexed and written to an in-memory structure, called a memtable, which resembles a
write-back cache. Each time the memory structure is full, the data is written to disk in
an SSTables data file. All writes are automatically partitioned and replicated throughout the
cluster. Cassandra periodically consolidates SSTables using a process called compaction,
144
discarding obsolete data marked for deletion with a tombstone. To ensure all data across the
cluster stays consistent, various repair mechanisms are employed.
Cassandra is a partitioned row store database, where rows are organized into tables with a
required primary key. Cassandra's architecture allows any authorized user to connect to any
node in any datacenter and access data using the CQL language. For ease of use, CQL uses a
similar syntax to SQL and works with table data. Developers can access CQL
through cqlsh, DevCenter, and via drivers for application languages. Typically, a cluster has
one keyspace per application composed of many different tables.
Client read or write requests can be sent to any node in the cluster. When a client connects
to a node with a request, that node serves as the coordinator for that particular client
operation. The coordinator acts as a proxy between the client application and the nodes that
own the data being requested. The coordinator determines which nodes in the ring should
get the request based on how the cluster is configured.
Key structures
 Node: Where you store your data. It is the basic infrastructure component of
Cassandra.
 Datacenter: A collection of related nodes. A datacenter can be a physical datacenter
or virtual datacenter. Different workloads should use separate datacenters, either
physical or virtual. Replication is set by datacenter. Using separate datacenters
prevents Cassandra transactions from being impacted by other workloads and keeps
requests close to each other for lower latency. Depending on the replication factor,
data can be written to multiple datacenters. datacenters must never span physical
locations.
 Cluster: A cluster contains one or more datacenters. It can span physical locations.
 Commit log: All data is written first to the commit log for durability. After all its data
has been flushed to SSTables, it can be archived, deleted, or recycled.
 SSTable: A sorted string table (SSTable) is an immutable data file to which Cassandra
writes memtables periodically. SSTables are append only and stored on disk
sequentially and maintained for each Cassandra table.
 CQL Table: A collection of ordered columns fetched by table row. A table consists of
columns and has a primary key.
Key components for configuring Cassandra
 Gossip
A peer-to-peer communication protocol to discover and share location and state
information about the other nodes in a Cassandra cluster. Gossip information is also
persisted locally by each node to use immediately when a node restarts.
 Partitioner
145
A partitioner determines which node will receive the first replica of a piece of data,
and how to distribute other replicas across other nodes in the cluster. Each row of
data is uniquely identified by a primary key, which may be the same as its partition
key, but which may also include other clustering columns. A partitioner is a hash
function that derives a token from the primary key of a row. The partitioner uses the
token value to determine which nodes in the cluster receive the replicas of that row.
The Murmur3Partitioner is the default partitioning strategy for new Cassandra
clusters and the right choice for new clusters in almost all cases.
You must set the partitioner and assign the node a num_tokens value for each node.
The number of tokens you assign depends on the hardware capabilities of the
system. If not using virtual nodes (vnodes), use the initial_token setting instead.
 Replication factor
The total number of replicas across the cluster. A replication factor of 1 means that
there is only one copy of each row on one node. A replication factor of 2 means two
copies of each row, where each copy is on a different node. All replicas are equally
important; there is no primary or master replica. You define the replication factor for
each datacenter. Generally, you should set the replication strategy greater than one,
but no more than the number of nodes in the cluster.
 Replica placement strategy

Cassandra stores copies (replicas) of data on multiple nodes to ensure reliability and
fault tolerance. A replication strategy determines which nodes to place replicas on.
The first replica of data is simply the first copy; it is not unique in any sense.
The NetworkTopologyStrategy is highly recommended for most deployments
because it is much easier to expand to multiple datacenters when required by future
expansion.
When creating a keyspace, you must define the replica placement strategy and the
number of replicas you want.
 Snitch
A snitch defines groups of machines into datacenters and racks (the topology) that
the replication strategy uses to place replicas.
You must configure a snitch when you create a cluster. All snitches use a dynamic
snitch layer, which monitors performance and chooses the best replica for reading. It
is enabled by default and recommended for use in most deployments. Configure
dynamic snitch thresholds for each node in the cassandra.yaml configuration file.
The default SimpleSnitch does not recognize datacenter or rack information. Use it

for single-datacenter deployments or single-zone in public clouds.
The GossipingPropertyFileSnitch is recommended for production. It defines a node's
datacenter and rack and uses gossip for propagating this information to other nodes.
146
 The cassandra.yaml configuration file
The main configuration file for setting the initialization properties for a cluster,
caching parameters for tables, properties for tuning and resource utilization, timeout
settings, client connections, backups, and security.
By default, a node is configured to store the data it manages in a directory set in

the cassandra.yaml file.
In a production cluster deployment, you can change the commitlog-directory to a

different disk drive from the data_file_directories.
 System keyspace table properties

You set storage configuration attributes on a per-keyspace or per-table basis
programmatically or using a client application, such as CQL.
Storage engine
Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical
relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-
before-write, especially in a large distributed system, can result in large latencies in read
performance and other problems. For example, two clients read at the same time; one
overwrites the row to make update A, and the other overwrites the row to make update B,
removing update A. This race condition will result in ambiguous query results - which update
is correct?
To avoid using read-before-write for most writes in Cassandra, the storage engine groups
inserts and updates in memory, and at intervals, sequentially writes the data to disk in
append mode. Once written to disk, the data is immutable and is never overwritten. Reading
data involves combining this immutable sequentially-written data to discover the correct
query results. You can use Lightweight transactions (LWT) to check the state of the data
before writing. However, this feature is recommended only for limited use.
A log-structured engine that avoids overwrites and uses sequential I/O to update data is
essential for writing to solid-state disks (SSD) and hard disks (HDD). On HDD, writing
randomly involves a higher number of seek operations than sequential writing. The seek
penalty incurred can be substantial. Because Cassandra sequentially writes immutable files,
thereby avoiding write amplification and disk failure, the database accommodates
inexpensive, consumer SSDs extremely well. For many other databases, write amplification is
a problem on SSDs.
How Cassandra reads and writes data
To manage and access data in Cassandra, it is important to understand how Cassandra stores
data. The hinted handoff feature plus Cassandra conformance and non-conformance to the
ACID (atomic, consistent, isolated, durable) database properties are key concepts to
understand reads and writes. In Cassandra, consistency refers to how up-to-date and
synchronized a row of data is on all of its replicas.
147
Client utilities and application programming interfaces (APIs) for developing applications for
data storage and retrieval are available.
How is data written?
Cassandra processes data at several stages on the write path, starting with the immediate
logging of a write and ending in with a write of data to disk:
 Logging data in the commit log

 Writing data to the memtable
 Flushing data from the memtable
 Storing data on disk in SSTables
Logging writes and memtable storage
When a write occurs, Cassandra stores the data in a memory structure called memtable, and
to provide configurable durability, it also appends writes to the commit log on disk. The
commit log receives every write made to a Cassandra node, and these durable writes survive
permanently even if power fails on a node. The memtable is a write-back cache of data
partitions that Cassandra looks up by key. The memtable stores writes in sorted order until
reaching a configurable limit, and then is flushed.
Flushing data from the memtable
To flush the data, Cassandra writes the data to disk, in the memtable-sorted order.. A
partition index is also created on the disk that maps the tokens to a location on disk. When
the memtable content exceeds the configurable threshold or the commitlog space exceeds
the commitlog_total_space_in_mb, the memtable is put in a queue that is flushed to disk. The
queue can be configured with
the memtable_heap_space_in_mb or memtable_offheap_space_in_mb setting in
the cassandra.yaml file. If the data to be flushed exceeds the memtable_cleanup_threshold,
Cassandra blocks writes until the next flush succeeds. You can manually flush a table
using nodetool flushor nodetool drain (flushes memtables without listening for connections
to other nodes). To reduce the commit log replay time, the recommended best practice is to
flush the memtable before you restart the nodes. If a node stops working, replaying the
commit log restores to the memtable the writes that were there before it stopped.
Data in the commit log is purged after its corresponding data in the memtable is flushed to
an SSTable on disk.
Storing data on disk in SSTables
Memtables and SSTables are maintained per table. The commit log is shared among tables.
SSTables are immutable, not written to again after the memtable is flushed. Consequently, a
partition is typically stored across multiple SSTable files. A number of other SSTable
structures exist to assist read operations:
148
For each SSTable, Cassandra creates these structures:
Data (Data.db)
The SSTable data

Primary Index (Index.db)
Index of the row keys with pointers to their positions in the data file
Bloom filter (Filter.db)
A structure stored in memory that checks if row data exists in the memtable before
accessing SSTables on disk
Compression Information (CompressionInfo.db)
A file holding information about uncompressed data length, chunk offsets and other
compression information
Statistics (Statistics.db)
Statistical metadata about the content of the SSTable
Digest (Digest.crc32, Digest.adler32, Digest.sha1)
A file holding adler32 checksum of the data file
CRC (CRC.db)
A file holding the CRC32 for chunks in an uncompressed file.
SSTable Index Summary (SUMMARY.db)
A sample of the partition index stored in memory
SSTable Table of Contents (TOC.txt)
A file that stores the list of all components for the SSTable TOC
Secondary Index (SI_.*.db)
Built-in secondary index. Multiple SIs may exist per SSTable
The SSTables are files stored on disk. The naming convention for SSTable files has changed
with Cassandra 2.2 and later to shorten the file path. The data files are stored in a data
directory that varies with installation. For each keyspace, a directory within the data directory
stores each table. For example, /data/data/ks1/cf1-5be396077b811e3a3ab9dc4b9ac088d/la-
1-big-Data.db represents a data file. ks1 represents the keyspace name to distinguish the
keyspace for streaming or bulk loading data. A hexadecimal string,
5be396077b811e3a3ab9dc4b9ac088d in this example, is appended to table names to
represent unique table IDs.
149
Cassandra creates a subdirectory for each table, which allows you to symlink a table to a
chosen physical drive or data volume. This provides the capability to move very active tables
to faster media, such as SSDs for better performance, and also divides tables across all
attached storage devices for better I/O balance at the storage layer.
How is data maintained?
The Cassandra write process stores data in files called SSTables. SSTables are immutable.
Instead of overwriting existing rows with inserts or updates, Cassandra writes new
timestamped versions of the inserted or updated data in new SSTables. Cassandra does not
perform deletes by removing the deleted data: instead, Cassandra marks it with tombstones.
Over time, Cassandra may write many versions of a row in different SSTables. Each version
may have a unique set of columns stored with a different timestamp. As SSTables
accumulate, the distribution of data can require accessing more and more SSTables to
retrieve a complete row.
To keep the database healthy, Cassandra periodically merges SSTables and discards old data.
This process is called compaction.
Compaction
Compaction works on a collection of SSTables. From these SSTables, compaction collects all
versions of each unique row and assembles one complete row, using the most up-to-date
version (by timestamp) of each of the row's columns. The merge process is performant,
because rows are sorted by partition key within each SSTable, and the merge process does
not use random I/O. The new versions of each row is written to a new SSTable. The old
versions, along with any rows that are ready for deletion, are left in the old SSTables, and are
deleted as soon as pending reads are completed.
150
Compaction causes a temporary spike in disk space usage and disk I/O while old and new
SSTables co-exist. As it completes, compaction frees up disk space occupied by old SSTables.
It improves read performance by incrementally replacing old SSTables with compacted
SSTables. Cassandra can read data directly from the new SSTable even before it finishes
writing, instead of waiting for the entire compaction process to finish.
As Cassandra processes writes and reads, it replaces the old SSTables with new SSTables in
the page cache. The process of caching the new SSTable, while directing reads away from the
old one, is incremental — it does not cause a the dramatic cache miss. Cassandra provides
predictable high performance even under heavy load.
Compaction strategies
Cassandra supports different compaction strategies, which control how which SSTables are
chosen for compaction, and how the compacted rows are sorted into new SSTables. Each
151
strategy has its own strengths. The sections that follow explain each of Cassandra's
compaction strategies.
Although each of the following sections starts with a generalized recommendation, many
factors complicate the choice of a compaction strategy.
SizeTieredCompactionStrategy (STCS)
Recommended for write-intensive workloads.
The SizeTieredCompactionStrategy (STCS) initiates compaction when Cassandra has

accumulated a set number (default: 4) of similar-sized SSTables. STCS merges these
SSTables into one larger SSTable. As these larger SSTables accumulate, STCS merges
these into even larger SSTables. At any given time, several SSTables of varying sizes
are present.
Size tiered compaction after many inserts
While STCS works well to compact a write-intensive workload, it makes reads slower
because the merge-by-size process does not group data by rows. This makes it more
likely that versions of a particular row may be spread over many SSTables. Also, STCS
does not evict deleted data predictably because its trigger for compaction is SSTable
size, and SSTables might not grow quickly enough to merge and evict old data. As
the largest SSTables grow in size, the amount of disk space needed for both the new
and old SSTables simultaneously during STCS compaction can outstrip a typical
amount of disk space on a node.
 Pros: Compacts write-intensive workload very well.

 Cons: Can hold onto stale data too long. Amount of memory needed
increases over time.
To implement the best compaction strategy:
1. Review your application's requirements.

2. Configure the table to use the most appropriate strategy.
3. Test the compaction strategies against your data.
152
The following questions are based on the experiences of Cassandra developers and users
with the strategies described above.
Does your table process time series data?
If so, your best choices are TWCS or DTCS. For details, read the descriptions on this page. If
your table is not focused on time series data, the choice becomes more complicated. The
following questions introduce other considerations that may guide your choice.
Does your table handle more reads than writes, or more writes than reads?
LCS is a good choice if your table processes twice as many reads as writes or more –
especially randomized reads. If the proportion of reads to writes is closer, the performance
hit exacted by LCS may not be worth the benefit. Be aware that LCS can be quickly
overwhelmed by a high volume of writes.
Does the data in your table change often?
One advantage of LCS is that it keeps related data in a small set of SSTables. If your data
is immutable or not subject to frequent upserts, STCS accomplishes the same type of
grouping without the LCS performance hit.
Do you require predictable levels of read and write activity?
LCS keeps the SSTables within predictable sizes and numbers. For example, if your table's
read/write ratio is small, and it is expected to conform to a Service Level Agreements (SLAs)
for reads, it may be worth taking the write performance penalty of LCS in order to keep read
rates and latency at predictable levels. And you may be able to overcome this write penalty
through horizontal scaling (adding more nodes).
Will your table be populated by a batch process?
On both batch reads and batch writes, STCS performs better than LCS. The batch process
causes little or no fragmentation, so the benefits of LCS are not realized; batch processes can
overwhelm LCS-configured tables.
Does your system have limited disk space?
LCS handles disk space more efficiently than STCS: it requires about 10% headroom in
addition to the space occupied by the data is handles. STCS and DTCS generally require, in
some cases, as much as 50% more than the data space.
Is your system reaching its limits for I/O?
LCS is significantly more I/O intensive than DTCS or STCS. Switching to LCS may introduce
extra I/O load that offsets the advantages.
Testing compaction strategies
Suggestions for determining which compaction strategy is best for your system:
153
 Create a three-node cluster using one of the compaction strategies, stress test the
cluster using cassandra-stress, and measure the results.
 Set up a node on your existing cluster and use Cassandra's write survey mode to
sample live data.
Configuring and running compaction
Set the compaction strategy for a table in the parameters for the CREATE TABLE or ALTER
TABLE command.
You can start compaction manually using the nodetool compact command.
How is data updated?
Cassandra treats each new row as an upsert: if the new row has the same primary key as that
of an existing row, Cassandra processes it as an update to the existing row.
During a write, Cassandra adds each new row to the database without checking on whether a
duplicate record exists. This policy makes it possible that many versions of the same row may
exist in the database. For more details about writes, see How is data written?
Periodically, the rows stored in memory are streamed to disk into structures called SSTables.
At certain intervals, Cassandra compacts smaller SSTables into larger SSTables. If Cassandra
encounters two or more versions of the same row during this process, Cassandra only writes
the most recent version to the new SSTable. After compaction, Cassandra drops the original
SSTables, deleting the outdated rows.
Most Cassandra installations store replicas of each row on two or more nodes. Each node
performs compaction independently. This means that even though out-of-date versions of a
row have been dropped from one node, they may still exist on another node.
This is why Cassandra performs another round of comparisons during a read process. When
a client requests data with a particular primary key, Cassandra retrieves many versions of the
row from one or more replicas. The version with the most recent timestamp is the only one
returned to the client ("last-write-wins").
How is data deleted?
Cassandra's processes for deleting data are designed to improve performance, and to work
with Cassandra's built-in properties for data distribution and fault-tolerance.
Cassandra treats a delete as an insert or upsert. The data being added to the partition in
the DELETE command is a deletion marker called a tombstone. The tombstones go through
Cassandra's write path, and are written to SSTables on one or more nodes. The key
difference feature of a tombstone: it has a built-in expiration date/time. At the end of its
expiration period (for details see below) the tombstone is deleted as part of Cassandra's
normal compaction process.
154
You can also mark a Cassandra record (row or column) with a time-to-live value. After this
amount of time has ended, Cassandra marks the record with a tombstone, and handles it like
other tombstoned records.
Deletion in a distributed system
In a multi-node cluster, Cassandra can store replicas of the same data on two or more nodes.
This helps prevent data loss, but it complicates the delete process. If a node receives a delete
for data it stores locally, the node tombstones the specified record and tries to pass the
tombstone to other nodes containing replicas of that record. But if one replica node is
unresponsive at that time, it does not receive the tombstone immediately, so it still contains
the pre-delete version of the record. If the tombstoned record has already been deleted from
the rest of the cluster befor that node recovers, Cassandra treats the record on the recovered
node as new data, and propagates it to the rest of the cluster. This kind of deleted but
persistent record is called a zombie.
To prevent the reappearance of zombies, the database gives each tombstone a grace period.
The purpose of the grace period is to give unresponsive nodes time to recover and process
tombstones normally. When multiple replica answers are part of a read request, and those
responses differ, then whichever values are most recent take precedence. For example, if a
node has a tombstone but another node has a more recent change, then the final result
includes the more recent change.
If a node has a tombstone and another node has only an older value for the record, then the
final record will have the tombstone. If a client writes a new update to the tombstone during
the grace period, the database overwrites the tombstone.
When an unresponsive node recovers, Cassandra uses hinted handoff to replay the

database mutations the node missed while it was down. Cassandra does not replay a
mutation for a tombstoned record during its grace period. But if the node does not recover
until after the grace period ends, Cassandra may miss the deletion.
After the tombstone's grace period ends, Cassandra deletes the tombstone during
compaction.
The grace period for a tombstone is set by the property gc_grace_seconds. Its default value
is 864000 seconds (ten days). Each table can have its own value for this property.
How are indexes stored and updated?
Secondary indexes are used to filter a table for data stored in non-primary key columns. For
example, a table storing cyclist names and ages using the last name of the cyclist as the
primary key might have a secondary index on the age to allow queries by age. Querying to
match a non-primary key column is an anti-pattern, as querying should always result in a
continuous slice of data retrieved from the table.
If the table rows are stored based on last names, the table may be spread across several
partitions stored on different nodes. Queries based on a particular range of last names, such
155
as all cyclists with the last name Matthews will retrieve sequential rows from the table, but a
query based on the age, such as all cyclists who are 28, will require all nodes to be queried
for a value. Non-primary keys play no role in ordering the data in storage, thus querying for
a particular value of a non-primary key column results in scanning all partitions. Scanning all
partitions generally results in a prohibitive read latency, and is not allowed.
Secondary indexes can be built for a column in a table. These indexes are stored locally on
each node in a hidden table and built in a background process. If a secondary index is used
in a query that is not restricted to a particular partition key, the query will have prohibitive
read latency because all nodes will be queried. A query with these parameters is only allowed
if the query option ALLOW FILTERING is used. This option is not appropriate for production
environments. If a query includes both a partition key condition and a secondary index
column condition, the query will be successful because the query can be directed to a single
node partition.
This technique, however, does not guarantee trouble-free indexing, so know when and when
not to use an index. In the example shown above, an index on the age could be used, but a
better solution is to create a materialized view or additional table that is ordered by age.
As with relational databases, keeping indexes up to date uses processing time and resources,
so unnecessary indexes should be avoided. When a column is updated, the index is updated
as well. If the old column value still exists in the memtable, which typically occurs when
updating a small set of rows repeatedly, Cassandra removes the corresponding obsolete
index entry; otherwise, the old entry remains to be purged by compaction. If a read sees a
stale index entry before compaction purges it, the reader thread invalidates it.
How is data read?
To satisfy a read, Cassandra must combine results from the active memtable and potentially
multiple SSTables.
Cassandra processes data at several stages on the read path to discover where the data is
stored, starting with the data in the memtable and finishing with SSTables:
 Check the memtable

 Check row cache, if enabled
 Checks Bloom filter
 Checks partition key cache, if enabled
 Goes directly to the compression offset map if a partition key is found in the partition
key cache, or checks the partition summary if not
If the partition summary is checked, then the partition index is accessed
 Locates the data on disk using the compression offset map

 Fetches the data from the SSTable on disk
Read request flow
156
Row cache and Key cache request flow
Memtable
If the memtable has the desired partition data, then the data is read and then merged with
the data from the SSTables. The SSTable data is accessed as shown in the following steps.
Row Cache
Typical of any database, reads are fastest when the most in-demand data fits into memory.
The operating system page cache is best at improving performance, although the row cache
can provide some improvement for very read-intensive operations, where read operations
are 95% of the load. Row cache is contra-indicated for write-intensive operations. The row
cache, if enabled, stores a subset of the partition data stored on disk in the SSTables in
memory. In Cassandra 2.2 and later, it is stored in fully off-heap memory using a new
implementation that relieves garbage collection pressure in the JVM. The subset stored in
157
the row cache use a configurable amount of memory for a specified period of time. The row
cache uses LRU (least-recently-used) eviction to reclaim memory when the cache has filled
up.
The row cache size is configurable, as is the number of rows to store. Configuring the
number of rows to be stored is a useful feature, making a "Last 10 Items" query very fast to
read. If row cache is enabled, desired partition data is read from the row cache, potentially
saving two seeks to disk for the data. The rows stored in row cache are frequently accessed
rows that are merged and saved to the row cache from the SSTables as they are accessed.
After storage, the data is available to later queries. The row cache is not write-through. If a
write comes in for the row, the cache for that row is invalidated and is not cached again until
the row is read. Similarly, if a partition is updated, the entire partition is evicted from the
cache. When the desired partition data is not found in the row cache, then the Bloom filter is
checked.
Bloom Filter
First, Cassandra checks the Bloom filter to discover which SSTables are likely to have the
request partition data. The Bloom filter is stored in off-heap memory. Each SSTable has a
Bloom filter associated with it. A Bloom filter can establish that a SSTable does not contain
certain partition data. A Bloom filter can also find the likelihood that partition data is stored
in a SSTable. It speeds up the process of partition key lookup by narrowing the pool of keys.
However, because the Bloom filter is a probabilistic function, it can result in false positives.
Not all SSTables identified by the Bloom filter will have data. If the Bloom filter does not rule
out an SSTable, Cassandra checks the partition key cache
The Bloom filter grows to approximately 1-2 GB per billion partitions. In the extreme case,
you can have one partition per row, so you can easily have billions of these entries on a
single machine. The Bloom filter is tunable if you want to trade memory for performance.
Partition Key Cache
The partition key cache, if enabled, stores a cache of the partition index in off-heap memory.
The key cache uses a small, configurable amount of memory, and each "hit" saves one seek
during the read operation. If a partition key is found in the key cache can go directly to the
compression offset map to find the compressed block on disk that has the data. The
partition key cache functions better once warmed, and can greatly improve over the
performance of cold-start reads, where the key cache doesn't yet have or has purged the
keys stored in the key cache. It is possible to limit the number of partition keys saved in the
key cache, if memory is very limited on a node. If a partition key is not found in the key
cache, then the partition summary is searched.
The partition key cache size is configurable, as are the number of partition keys to store in
the key cache.
Partition Summary
158
The partition summary is an off-heap in-memory structure that stores a sampling of the
partition index. A partition index contains all partition keys, whereas a partition summary
samples every X keys, and maps the location of every Xth key's location in the index file. For
example, if the partition summary is set to sample every 20 keys, it will store the location of
the first key as the beginning of the SSTable file, the 20th key and its location in the file, and
so on. While not as exact as knowing the location of the partition key, the partition summary
can shorten the scan to find the partition data location. After finding the range of possible
partition key values, the partition index is searched.
By configuring the sample frequency, you can trade memory for performance, as the more
granularity the partition summary has, the more memory it will use. The sample frequency is
changed using the index interval property in the table definition. A fixed amount of memory
is configurable using the index_summary_capacity_in_mb property, and defaults to 5% of the
heap size.
Partition Index
The partition index resides on disk and stores an index of all partition keys mapped to their
offset. If the partition summary has been checked for a range of partition keys, now the
search passes to the partition index to seek the location of the desired partition key. A single
seek and sequential read of the columns over the passed-in range is performed. Using the
information found, the partition index now goes to the compression offset map to find the
compressed block on disk that has the data. If the partition index must be searched, two
seeks to disk will be required to find the desired data.
How are Cassandra transactions different from RDBMS transactions?
Cassandra does not use RDBMS ACID transactions with rollback or locking mechanisms, but
instead offers atomic, isolated, and durable transactions with eventual/tunable consistency
that lets the user decide how strong or eventual they want each transaction’s consistency to
be.
As a non-relational database, Cassandra does not support joins or foreign keys, and
consequently does not offer consistency in the ACID sense. For example, when moving
money from account A to B the total in the accounts does not change. Cassandra supports
atomicity and isolation at the row-level, but trades transactional isolation and atomicity for
high availability and fast write performance. Cassandra writes are durable.
Atomicity
In Cassandra, a write operation is atomic at the partition level, meaning the insertions or
updates of two or more rows in the same partition are treated as one write operation. A
delete operation is also atomic at the partition level.
For example, if using a write consistency level of QUORUM with a replication factor of 3,
Cassandra will replicate the write to all nodes in the cluster and wait for acknowledgement
from two nodes. If the write fails on one of the nodes but succeeds on the other, Cassandra
159
reports a failure to replicate the write on that node. However, the replicated write that
succeeds on the other node is not automatically rolled back.
Cassandra uses client-side timestamps to determine the most recent update to a column.
The latest timestamp always wins when requesting data, so if multiple client sessions update
the same columns in a row concurrently, the most recent update is the one seen by readers.
Isolation
Cassandra write and delete operations are performed with full row-level isolation. This means
that a write to a row within a single partition on a single node is only visible to the client
performing the operation – the operation is restricted to this scope until it is complete. All
updates in a batch operation belonging to a given partition key have the same restriction.
However, a Batch operation is not isolated if it includes changes to more than one partition.
Durability
Writes in Cassandra are durable. All writes to a replica node are recorded both in memory
and in a commit log on disk before they are acknowledged as a success. If a crash or server
failure occurs before the memtables are flushed to disk, the commit log is replayed on restart
to recover any lost writes. In addition to the local durability (data immediately written to
disk), the replication of data on other nodes strengthens durability.
You can manage the local durability to suit your needs for consistency using
the commitlog_sync option in the cassandra.yaml file. Set the option to
either periodic or batch.
Cassandra support for integrating Hadoop includes:
 MapReduce
 You must run separate datacenters: one or more datacenters with nodes running just
Cassandra (for Online Transaction Processing) and others with nodes running C* &
with Hadoop installed. See Isolate Cassandra and Hadoop for details.
 Before starting the datacenters of Cassandra/Hadoop nodes, disable virtual nodes
(vnodes).
To disable virtual nodes:
1. In the cassandra.yaml file, set num_tokens to 1.
2. Uncomment the initial_token property and set it to 1 or to the value of a generated
token for a multi-node cluster.
3. Start the cluster for the first time.
Setup and configuration, involves overlaying a Hadoop cluster on Cassandra nodes,
configuring a separate server for the Hadoop NameNode/JobTracker, and installing a
Hadoop TaskTracker and Data Node on each Cassandra node. The nodes in the Cassandra
datacenter can draw from data in the HDFS Data Node as well as from Cassandra. The Job
Tracker/Resource Manager (JT/RM) receives MapReduce input from the client application.
The JT/RM sends a MapReduce job request to the Task Trackers/Node Managers (TT/NM)
160
and an optional clients MapReduce. The data is written to Cassandra and results sent back to
the client.
The Apache docs also cover how to get configuration and integration support.
Input and Output Formats
Hadoop jobs can receive data from CQL tables and indexes and can write their output to
Cassandra tables as well as to the Hadoop FileSystem. Cassandra 3.0 supports the following
formats for these tasks:
 CqlInputFormat class: for importing job input into the Hadoop filesystem from CQL
tables
 CqlOutputFormat class: for writing job output from the Hadoop filesystem to CQL
tables
 CqlBulkOutputFormat class: generates Cassandra SSTables from the output of
Hadoop jobs, then loads them into the cluster using
the SSTableLoaderBulkOutputFormat class
Reduce tasks can store keys (and corresponding bound variable values) as CQL rows (and
respective columns) in a given CQL table.
Running the wordcount example
Wordcount example JARs are located in the examples directory of the Cassandra source

code installation. There are CQL and legacy examples in
the hadoop_cql3_word_count and hadoop_word_count subdirectories, respectively. Follow
instructions in the readme files.
Isolating Hadoop and Cassandra workloads
161
When you create a keyspace using CQL, Cassandra creates a virtual datacenter for a cluster,
even a one-node cluster, automatically. You assign nodes that run the same type of
workload to the same datacenter. The separate, virtual datacenters for different types of
nodes segregate workloads running Hadoop from those running Cassandra. Segregating
workloads ensures that only one type of workload is active per datacenter. Separating nodes
running a sequential data load, from nodes running any other type of workload, such as
Cassandra real-time OLTP queries is a best practice.
The cassandra utility
You can run Cassandra 3.0 with start-up parameters by adding them to the cassandra-
env.sh file (package or tarball installations). You can also enter parameters at the command
line when starting up tarball installations.
Usage
Add a parameter to the cassandra-env.sh file as follows:
JVM_OPTS="$JVM_OPTS -D[PARAMETER]"
When starting up a tarball installations, you can add parameters at the command line:
cassandra [PARAMETERS]
Examples:
 Command line: $ bin/cassandra -Dcassandra.load_ring_state=false

 cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"
Command line options
Option Description
-f Start the cassandra process in foreground. The default is to start as

background process.
-h Help.
-p filena Log the process ID in the named file. Useful for stopping Cassandra by killing
me its PID.
-v Print the version and exit.
Start-up parameters
The -D option specifies start-up parameters at the command line and in the cassandra-
env.sh file.
cassandra.auto_bootstrap=false
162
Sets auto_bootstrap to false on initial set-up of the cluster. The next time you start the
cluster, you do not need to change the cassandra.yaml file on each node to revert to true.
cassandra.available_processors=number_of_processors
In a multi-instance deployment, each Cassandra instance independently assumes that all CPU
processors are available to it. Use this setting to specify a smaller set of processors.
cassandra.boot_without_jna=true
Configures Cassandra to boot without JNA. If you do not set this parameter to true, and JNA
does not initalize, Cassandra does not boot.
cassandra.config=directory
Sets the directory location of the cassandra.yaml file. The default location depends on the
type of installation.
cassandra.expiration_date_overflow_policy=POLICY
Set the policy for TTL (time to live) timestamps that exceed the maximum value supported by
the storage engine, 2038-01-19T03:14:06+00:00. The database storage engine can only
encode TTL timestamps through January 19 2038 03:14:07 UTC due to the Year 2038
problem.
REJECT: Reject requests that contain an expiration timestamp later than 2038-01-
19T03:14:06+00:00.
CAP: Allow requests and insert expiration timestamps later than 2038-01-19T03:14:06+00:00
as 2038-01-19T03:14:06+00:00.
Default: REJECT.
cassandra.ignore_dynamic_snitch_severity=true|false (Default: false)
Setting this property to true causes the dynamic snitch to ignore the severity indicator from
gossip when scoring nodes. Severity is a numeric representation of a node based on
compaction events occurring on it, which it broadcasts via gossip. This factors into the
dynamic snitch's formula, unless overridden.
Future versions will default to true and this setting will be removed. See Failure detection
and recoveryand Dynamic snitching in Cassandra: past, present, and future.
cassandra.initial_token=token
Use when Cassandra is not using virtual nodes (vnodes). Sets the initial partitioner token for
a node the first time the node is started. (Default: disabled)
Note: Vnodes automatically select tokens.
cassandra.join_ring=true|false
When set to false, prevents the Cassandra node from joining a ring on startup. (Default: true)
You can add the node to the ring afterwards using nodetool join and a JMX call.
cassandra.load_ring_state=true|false
163
When set to false, clears all gossip state for the node on restart. (Default: true)
cassandra.metricsReporterConfigFile=file
Enables pluggable metrics reporter. See Pluggable metrics reporting in Cassandra 2.0.2.
cassandra.native_transport_startup_delay_second=seconds
Delays the startup of native transport for the number of seconds. (Default: 0)
cassandra.native_transport_port=port
Sets the port on which the CQL native transport listens for clients. (Default: 9042)
cassandra.partitioner=partitioner
Sets the partitioner. (Default: org.apache.cassandra.dht.Murmur3Partitioner)
cassandra.replace_address=listen_address or broadcast_address of dead node
To replace a node that has died, restart a new node in its place specifying the listen_address
or broadcast_address that the new node is assuming. The new node must be in the same
state as before bootstrapping, without any data in its data directory.
cassandra.replayList=table
Allows restoring specific tables from an archived commit log.
cassandra.ring_delay_ms=ms
Defines the amount of time a node waits to hear from other nodes before formally joining
the ring. (Default: 30000ms)
cassandra.rpc_port=port
Sets the port for the Thrift RPC service, which is used for client connections. (Default: 9160).
cassandra.ssl_storage_port=port
Sets the SSL port for encrypted communication. (Default: 7001)
cassandra.start_native_transport=true | false
Enables or disables the native transport server. See start_native_transport in cassandra.yaml.

(Default: true)
cassandra.start_rpc=true | false
Enables or disables the Thrift RPC server. (Default: true)
cassandra.storage_port=port
Sets the port for inter-node communication. (Default: 7000)
cassandra.triggers_dir=directory
Sets the default location for the triggers JARs.
cassandra.write_survey=true
164
Enables a tool for testing new compaction and compression strategies to experiment with
different strategies and benchmark write performance differences without affecting the
production workload. See Testing compaction and compression.
consistent.rangemovement=true
Set to true, makes bootstrapping behavior effective.
Tip: You can also add options such as maximum and minimum heap size to the cassandra-
env.sh file to pass them to the Java virtual machine at startup, rather than setting them in the
environment.
Example
Clearing gossip state when starting a node:
Command line: $ bin/cassandra -Dcassandra.load_ring_state=false
cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"
Example
Starting a Cassandra node without joining the ring:
Command line: bin/dse cassandra -Dcassandra.join_ring=false #Starts Cassandra
cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
Example
Replacing a dead node:
Command line:
bin/dse cassandra -Dcassandra.replace_address=10.91.176.160 #Starts Cassandra
cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=10.91.176.160"
165
MongoDB
MongoDB is an open-source document-based database management tool that stores data

in JSON-like formats. It is a highly scalable, flexible, and distributed NoSQL database.
With the rise in data all around the world, there has been an observable and increasing
interest surrounding the wave of the non-relational database, also known as ‘NoSQL‘.
Businesses and organisations are seeking new methods to manage the flood of data and are
drawn toward the alternate database management tools and systems that are different from
the traditional relational database systems. Here comes MongoDB into the picture.
Being a NoSQL tool means that it does not use the usual rows and columns that you so
much associate with the relational database management. It is an architecture that is built on
collections and documents. The basic unit of data in this database consists of a set of key–
value pairs.It allows documents to have different fields and structures. This database uses a
document storage format called BSON which is a binary style of JSON documents.The data
model that MongoDB follows is a highly elastic one that lets you combine and store data of
multivariate types without having to compromise on the powerful indexing options, data
access, and validation rules. There is no downtime when you want to dynamically modify the
schemas. What it means that you can concentrate more on making your data work harder
rather than spending more time on preparing the data for the database.
Architecture of MongoDB NoSQL Database
Database: In simple words, it can be called the physical container for data. Each of the
databases has its own set of files on the file system with multiple databases existing on a
single MongoDB server.
Collection: A group of database documents can be called a collection. The RDBMS

equivalent to a collection is a table. The entire collection exists within a single database.
There are no schemas when it comes to collections. Inside the collection, various documents
can have varied fields, but mostly the documents within a collection are meant for the same
purpose or for serving the same end goal.
Document: A set of key–value pairs can be designated as a document. Documents are

associated with dynamic schemas. The benefit of having dynamic schemas is that a
document in a single collection does not have to possess the same structure or fields. Also,
the common fields in a collection’s document can have varied types of data.
What makes it different from RDBMS?
You can directly compare the MongoDB NoSQL with the RDBMS and map the varied
terminologies in the two systems: The RDBMS table is a MongoDB collection, the column is a
field, the tuple/row is a document, and the table join is an embedded document. The typical
schema of a relational database shows the number of tables and the relationship between
the tables, but MongoDB does not follow the concept of relationship.
166
Go through the following table to understand how exactly an expert NoSQL database like
MongoDB differs from RDBMS.
MongoDB RDBMS
Document-oriented and non-relational Relational database
database
Document based Row based

Field based Column based
Collection based and key–value pair Table based
Gives JavaScript client for querying Doesn’t give JavaScript for querying
Relatively easy to setup Comparatively not that easy to setup
Unaffected by SQL injection Quite vulnerable to SQL injection
Has dynamic schema and ideal for Has predefined schema and not good for
hierarchical data storage hierarchical data storage
100 times faster and horizontally scalable By increasing RAM, vertical scaling can
through sharding happen
Important Features of MongoDB
 Queries: It supports ad-hoc queries and document-based queries.

 Index Support: Any field in the document can be indexed.
 Replication: It supports Master–Slave replication. MongoDB uses native application
to maintain multiple copies of data. Preventing database downtime is one of the
replica set’s features as it has self-healing shard.
 Multiple Servers: The database can run over multiple servers. Data is duplicated to
foolproof the system in the case of hardware failure.
 Auto-sharding: This process distributes data across multiple physical partitions
called shards. Due to sharding, MongoDB has an automatic load balancing feature.
 MapReduce: It supports MapReduce and flexible aggregation tools.
 Failure Handling: In MongoDB, it’s easy to cope with cases of failures. Huge
numbers of replicas give out increased protection and data availability against
database downtime like rack failures, multiple machine failures, and data center
failures, or even network partitions.
 GridFS: Without complicating your stack, any sizes of files can be stored. GridFS
feature divides files into smaller parts and stores them as separate documents.
 Schema-less Database: It is a schema-less database written in C++.
 Document-oriented Storage: It uses BSON format which is a JSON-like format.
 Procedures: MongoDB JavaScript works well as the database uses the language
instead of procedures.
167
Why do you need MongoDB technology?
This technology overcame one of the biggest pitfalls of the traditional database systems, that
is, scalability. With the ever evolving needs of businesses, their database systems also
needed to be upgraded. MongoDB has exceptional scalability. It makes it easy to fetch the
data and provides continuous and automatic integration. Along with these benefits, there are
multiple reasons why you need MongoDB:
 No downtime while the application is being scaled

 Performs in-memory processing
 Text search
 Graph processing
 Global replication
 Economical
MongoDB is meeting the business requirements. Here is how:
 MongoDB provides the right mix of technology and data for competitive advantage.
 It is most suited for mission-critical applications since it considerably reduces risks.
 It increasingly accelerated the time to value (TTV) and lowered the total cost of
ownership.
 It builds applications that are just not possible with traditional relational databases.
Benefits of MongoDB:
Distributed Data Platform: Throughout geographically distributed data centers and cloud

regions, MongoDB can be run ensuring new levels of availability and scalability. With no
downtime and without changing your application, MongoDB scales elastically in terms of
data volume and throughput. The technology gives you enough flexibility across various
data centers with good consistency.
Fast and Iterative Development: Changing business requirements will no longer affect

successful project delivery in your enterprise. A flexible data model with dynamic schema,
and with powerful GUI and command line tools, makes it fast for developers to build and
evolve applications. Automated provisioning enables continuous integration and delivery for
productive operations. Static relational schemas and complex operations of RDBMS are now
something from the past.
Flexible Data Model: MongoDB stores data in flexible JSON-like documents, which makes
data persistence and combining easy. The objects in your application code is mapped to the
document model, due to which working with data becomes easy. Needless to say that
schema governance controls, data access, complex aggregations, and rich indexing
functionality are not compromised in any way. Without downtime, one can modify the
schema dynamically. Due to this flexibility, a developer needs to worry less about data
manipulation.
Reduced TCO (Total Cost of Ownership): Application developers can do their job way
better when MongoDB is used. The operations team also can perform their job well, thanks
168
to the Atlas Cloud service. Costs are significantly lowered as MongoDB runs on commodity
hardware. The technology gives out on-demand, pay-as-you-go pricing with annual
subscriptions, along with 24/7 global support.
Integrated Feature Set: One can get a variety of real-time applications because of analytics
and data visualization, event-driven streaming data pipelines, text and geospatial search,
graph processing, in-memory performance, and global replication reliably and securely. For
RDBMS to accomplish this, there requires additional complex technologies, along with
separate integration requirements.
Long-term Commitment: You would be staggered to know about the development of this

technology. It has garnered over 30 million downloads, 4,900 customers, and over 1,000
partners. If you include this technology in your firm, then you can be sure that your
investment is in the right place.
MongoDB cannot support the SQL language for obvious reasons. MongoDB querying style is
dynamic on documents as it is a document-based query language that can be as utilitarian
as SQL. MongoDB is easy to scale, and there is no need to convert or map application
objects to database objects. It deploys the internal memory for providing faster access to
data and storing the working set.
Frequently Used Commands in MongoDB
Database Creation
 MongoDB doesn’t have any methods to create a database. It automatically creates a

database when you save values into the defined collection for the first time. The
following command will create a database named ‘database_name’ if it doesn’t exist.
If it does exist, then it will be selected.
 Command: Use Database_name
Dropping Databases
 The following command is used to drop a database, along with its associated files.
This command acts on the current database.
 Command: db.dropDatabase()
Creating a Collection
 MongoDB uses the following command to create a collection. Normally, this is not
required as MongoDB automatically creates collections when some documents are
inserted.
 Command: db.createCollection(name, options)
 Name: The string type which specifies the name of the collection to be created
 Options: The document type which specifies the memory size and the indexing of
the collection. It is an optional parameter.
Showing Collections
169
 When MongoDB runs the following command, it will display all the collections in the
server.
 Command: In shell you can type: db.getCollectionNames()
$in Operator
 The $in operator selects those documents where the value of a field is equal to the
value in the specified array. To use the $in expression, use the following prototype:
 Command: { field: { $in: [<value1>, <value2>, … <valueN> ] } }
Projection
 Often you need only specific parts of the database rather than the whole
database. Find() method displays all fields of a document. You need to set a list of
fields with value 1 or 0. 1 is used to show the field and 0 is used to hide it. This
ensures that only those fields with value 1 are selected. Among MongoDB query
examples, there is one which defines projection as the following query.
 Command: db.COLLECTION_NAME.find({},{KEY:1})
Date Operator
 This command is used to denote time.

 Command:
Date() – It returns the current date as a string.
New Date() – It returns the current date as a date object.
$not Operator
 $not does a logical NOT operation on the specified <operator-expression> and

selects only those documents that don’t match the <operator-expression>. This
includes documents that do not contain the field.
 Command: { field: { $not: { <operator-expression> } } }
Delete Commands

o Following are commands which explain MongoDB’s delete capabilities.
o Commands:
collection.remove() – It deletes a single document that matches a filter.
db.collection.deleteOne() – It deletes up to only a single document even if the
command selects more than one document.
db.collection.deletemany() – It deletes all the documents that match the
specified filter.
Where Command
 To pass either a string which has a JavaScript expression or a full JavaScript function
to the query system, the following operator can be used.
170
 Command: $where
The forEach Command
 JavaScript function is applied to each document from the cursor while iterating the
cursor.
 Command: cursor.forEach(function)
Where can you use MongoDB NoSQL database?
The MongoDB NoSQL database can be extensively used for Big Data and Hadoop
applications for working with humongous amounts of NoSQL data that is a major portion of
Big Data. MongoDB and SQL are all database systems, but what sets them apart is their
efficiency in today’s world. MongoDB can also be successfully deployed for social media and
mobile applications for parsing all streaming information which is in the unstructured format.
Content management and delivery also sees extensive use for the MongoDB NoSQL
database. Other domains are user data management and data hubs.
Some of the biggest companies on earth are successfully deploying MongoDB with already
over half of the Fortune 100 companies being customers of this incredible NoSQL database
system. It has a very vibrant ecosystem with over 100 partners and huge investor interest
who are pouring money in the technology, relentlessly.
One of the biggest insurance companies on earth MetLife is extensively using MongoDB for
its customer service applications; the online classifieds search portal, Craigslist is deeply
involved in archiving its data using MongoDB. One of the most hailed brands in the media
industry, The New York Times is using MongoDB for its photo submissions and the
application that is deployed for form-building. Finally, the extent of the MongoDB
dominance can be gauged by the fact that the world’s premier scientific endeavor that is
spearheaded by the CERN physics laboratory is extensively using MongoDB for its data
aggregation and data discovery applications.
How will this technology help you in your career growth?
 MongoDB is the most widely used NoSQL database application –InfoWorld

 A MongoDB Database Administrator in the United States can earn up to $129,000 per
annum – Indeed
 Hadoop and NoSQL markets are expected to reach $3.3 billion within the next two
years – Wikibon
MongoDB is a very useful NoSQL database that is being used by some of the biggest
corporations in the world. Due to some of the most powerful features of MongoDB, it offers
a never before seen set of features to enterprises in order to parse all their unstructured
data. Due to this, professionals who are qualified and certified in working with the basics and
the advanced levels of MongoDB tool can expect to see their careers soar at a tremendous
pace without any doubt. Due to its versatile and scalable nature, MongoDB can be used for
datasets like social media, videos, and so on. MongoDB clients and users won’t feel a need
for any other kind of databases.
171
Cassandra Versus MongoDB
Today in this world where your existence is judged by your online presence, the amount of
data is increasing more and more everyday and its storage and management has become
one of the very big issues. Data scientists are working hard and are inventing newer
techniques for handling such big data every single day. Social media, IT industries and every
way of using the internet have been collectively taking part in enhancing the dimension of
data. The existing ways of storing data are lesser in number compared to the total amount of
data. The available rows and columns are not enough to take care of the continuously
enlarging data because most of the data generated are the unstructured ones.
As the size of data keeps on changing, the scientists have discovered that the conventional
databases have to be replaced with the newer and advanced ways of storing tools. NoSQL
and Hadoop are faster-growing technologies which companies use for storage and
management of their data. Although Hadoop gets more recognition for data storage, but
observing various surveys it is found that practically NoSQL is better and more advanced.
NoSQL assembles effective relevance that constrains the production all the way through
whereabouts of commitments. In this article of mine, I will be discussing two NoSQL
databases named as MongoDB and Cassandra.
MongoDB
MongoDB is a reasonable move, towards a large number of applications. Its activities and
performance are similar to the conventional and old style of storage systems. So it is quite
easy and comfortable to use. This database is quite elastic and expandable as a result of
which it has become user-friendly and also helps users in the network. Only because of its
172
ease of use, MongoDB is popular among the engineers who take no time in working with
this database. It has got a master and slave architecture.
When we use MongoDB, we use the same data model in both the database as well as in the
code, hence it requires no layering of complicated mapping. As a result, it becomes very
simple to use which makes it immensely popular among the users.
It is never tough in going with MongoDB because companies which know this tool can take
their investments in return making it tension free to stay reliable only on few databases. It is
prepared for use in the online transaction processes. It performs and solves complicated
situations but still it cannot be regarded as the perfect one. It does not help in the
complicated transactions.
Advantages of MongoDB
 Scalability
 Flexibility
 User-friendly
 No concept of rows and columns
 No re-establishment of indexing
Disadvantages of MongoDB
 Memory is not expandable

 Joins can be done only through multiple queries
 No transactions can be done
Cassandra
173
MongoDB is popular due to its ease of use but Cassandra is popular for its ease of
management facility that too in expanded form. When users tend to construct the
conventional data more dependable with more speed, they will come closer towards
Cassandra. Cassandra has a structural design where the whole sum of space is stretchable by
the accumulation of external devices in allied on rows and columns by means of their own
assets. It supports a multiple numbers of data centers working together. With a master-less
architecture, Cassandra offers a great performance by its quality of great scalability,
awesome writings and also great solving of queries.
Deploying newer technologies for you becomes very simple and comfortable once you know
the interiors of Cassandra technology. The training of Cassandra is just a question of only
some hours. A proper training and certification of Cassandra lead you to immense
understandings and an ocean of opportunities. Once you become completely aware of the
Cassandra data model and its functioning processes, one can successfully develop
Cassandra’s applications.
Advantages of Cassandra
 Free of cost
 Peer to peer structural design
 Elasticity
 Fault tolerance
 Great Performer
 Column based
 Adjustable steadiness
Disadvantages of Cassandra
 No support to Data Integration

 No streaming of globule values.
174
 No cursor support,
 Large outputs must be physically paged
This was about few of the differences between the two databases – Cassandra and
MongoDB. If you guys have any other points which I may have left, please do share by
writing down in the comment box.
Hadoop Connector
The MongoDB Hadoop Adapter is a plugin for Hadoop that provides Hadoop the ability to
use MongoDB as an input source and/or an output source.
Installation
The MongoDB Hadoop Adapter uses the SBT Build Tool tool for compilation. SBT provides
superior support for discrete configurations targeting multiple Hadoop versions. The
distribution includes self-bootstrapping copy of SBT in the distribution as sbt. Create a copy
of the jar files using the following command:
./sbt package
The MongoDB Hadoop Adapter supports a number of Hadoop releases. You can change the
Hadoop version supported by the build by modifying the value of hadoopRelease in
the build.sbt file. For instance, set this value to:
hadoopRelease in ThisBuild := "cdh3"
configures a build against Cloudera CDH3u3.
While:
hadoopRelease in ThisBuild := "0.21"
configures a build against Hadoop 0.21 from the mainline Apache distribution.
After building, you will need to place the “core” jar and the mongo-java-driver in
the lib directory of each Hadoop server.
Getting Started with Hadoop
MongoDB and Hadoop are a powerful combination and can be used together to deliver
complex analytics and data processing for data stored in MongoDB. The following guide
shows how you can start working with the MongoDB-Hadoop adapter. Once you become
familiar with the adapter, you can use it to pull your MongoDB data into Hadoop Map-
Reduce jobs, process the data and return results back to a MongoDB collection.
MongoDB
175
The latest version of MongoDB should be installed and running. In addition, the MongoDB
commands should be in your $PATH.
Miscellaneous
In addition to Hadoop, you should also have git and JDK 1.6 installed.
Building MongoDB Adapter
The MongoDB-Hadoop adapter source is available on github. First, clone the repository and
get the release-1.0 branch:
git clone https://github.com/mongodb/mongo-hadoop.git
git checkout release-1.0
Now, edit build.sbt and update the build target in hadoopRelease in ThisBuild. In this

example, we’re using the CDH3 Hadoop distribution from Cloudera so I’ll set it as follows:
hadoopRelease in ThisBuild := "cdh3"
To build the adapter, use the self-bootstrapping version of sbt that ships with the
MongoDB-Hadoop adapter:
./sbt package
Once the adapter is built, you will need to copy it and the latest stable version of
the MongoDB Java driver to your $HADOOP_HOME/lib directory. For example, if you have
Hadoop installed in /usr/lib/hadoop:
wget --no-check-certificate https://github.com/downloads/mongodb/mongo-java-

driver/mongo-2.7.3.jar
cp mongo-2.7.3.jar /usr/lib/hadoop/lib/
cp core/target/mongo-hadoop-core_cdh3u3-1.0.0.jar /usr/lib/hadoop/lib/
Examples
Load Sample Data

The MongoDB-Hadoop adapter ships with a few examples of how to use the adapter in your
own setup. In this guide, we’ll focus on the UFO Sightings and Treasury Yield examples. To
get started, first load the sample data for these examples:
./sbt load-sample-data
To confirm that the sample data was loaded, start the mongo client and look for
the mongo_hadoop database and be sure that it contains
the ufo_sightings.in and yield_historical.in collections:
176
$ mongo
MongoDB shell version: 2.0.5
connecting to: test
> show dbs
mongo_hadoop 0.453125GB
> use mongo_hadoop
switched to db mongo_hadoop
> show collections
system.indexes
ufo_sightings.in
yield_historical.in
Treasury Yield
To build the Treasury Yield example, we’ll need to first edit one of the configuration files uses
by the example code :
emacs examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml
and set the MongoDB location for the input (mongo.input.uri) and output
(mongo.output.uri ) collections (in this example, Hadoop is running on a single node
alongside MongoDB):
...
<property>

<name>mongo.input.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.in</value>
</property>
<property>

<name>mongo.output.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.yield_historical.out</value>
</property>
...
Next, edit the main class that we’ll use for our MapReduce job
(TreasuryYieldXMLConfig.java):
emacs
examples/treasury_yield/src/main/java/com/mongodb/hadoop/examples/treasury/Tre
asuryYieldXMLConfig.java
and update the class definition as follows:
...
public class TreasuryYieldXMLConfig extends MongoTool {
static{
177
// Load the XML config defined in hadoop-local.xml
// Configuration.addDefaultResource( "hadoop-local.xml" );
Configuration.addDefaultResource( "mongo-defaults.xml" );
Configuration.addDefaultResource( "mongo-treasury_yield.xml" );
}
public static void main( final String[] pArgs ) throws Exception{

System.exit( ToolRunner.run( new TreasuryYieldXMLConfig(), pArgs ) );
}
}
...
Now let’s build the Treasury Yield example:
./sbt treasury-example/package
Once the example is done building we can submit our MapReduce job:
hadoop jar examples/treasury_yield/target/treasury-example_cdh3u3-1.0.0.jar

com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig
This job should only take a few moments as it’s a relatively small amount of data. Now check
the output collection data in MongoDB to confirm that the MapReduce job was successful:
$ mongo
connecting to: test
> use mongo_hadoop
> db.yield_historical.out.find()
{ "_id" : 1990, "value" : 8.552400000000002 }
{ "_id" : 1991, "value" : 7.8623600000000025 }
{ "_id" : 1992, "value" : 7.008844621513946 }
{ "_id" : 1993, "value" : 5.866279999999999 }
{ "_id" : 1994, "value" : 7.085180722891565 }
{ "_id" : 1995, "value" : 6.573920000000002 }
{ "_id" : 1996, "value" : 6.443531746031742 }
{ "_id" : 1997, "value" : 6.353959999999992 }
{ "_id" : 1998, "value" : 5.262879999999994 }
{ "_id" : 1999, "value" : 5.646135458167332 }
{ "_id" : 2000, "value" : 6.030278884462145 }
{ "_id" : 2001, "value" : 5.02068548387097 }
{ "_id" : 2002, "value" : 4.61308 }
{ "_id" : 2003, "value" : 4.013879999999999 }
{ "_id" : 2004, "value" : 4.271320000000004 }
{ "_id" : 2005, "value" : 4.288880000000001 }
{ "_id" : 2006, "value" : 4.7949999999999955 }
{ "_id" : 2007, "value" : 4.634661354581674 }
178
{ "_id" : 2008, "value" : 3.6642629482071714 }
{ "_id" : 2009, "value" : 3.2641200000000037 }
has more
>
UFO Sightings
This will follow much of the same process as with the Treasury Yield example with one extra
step; we’ll need to add an entry into the build file to compile this example. First, open the file
for editing:
emacs project/MongoHadoopBuild.scala
Next, add the following lines starting at line 72 in the build file:
...
lazy val ufoExample = Project( id = "ufo-sightings",
base = file("examples/ufo_sightings"),
settings = exampleSettings ) dependsOn ( core )
...
Now edit the UFO Sightings config file:
emacs examples/ufo_sightings/src/main/resources/mongo-ufo_sightings.xml
and update the mongo.input.uri and mongo.output.uri properties:
...
<property>

<name>mongo.input.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.in</value>
</property>
<property>

<name>mongo.output.uri</name>
<value>mongodb://127.0.0.1/mongo_hadoop.ufo_sightings.out</value>
</property>
...
Next edit the main class for the MapReduce job in UfoSightingsXMLConfig.java to use the
configuration file:
emacs
examples/ufo_sightings/src/main/java/com/mongodb/hadoop/examples/ufos/UfoSigh
tingsXMLConfig.java
...
public class UfoSightingsXMLConfig extends MongoTool {
179
static{
// Load the XML config defined in hadoop-local.xml
// Configuration.addDefaultResource( "hadoop-local.xml" );
Configuration.addDefaultResource( "mongo-defaults.xml" );
Configuration.addDefaultResource( "mongo-ufo_sightings.xml" );
}
public static void main( final String[] pArgs ) throws Exception{

System.exit( ToolRunner.run( new UfoSightingsXMLConfig(), pArgs ) );
}
}
...
Now build the UFO Sightings example:
./sbt ufo-sightings/package
Once the example is built, execute the MapReduce job:
hadoop jar examples/ufo_sightings/target/ufo-sightings_cdh3u3-1.0.0.jar

com.mongodb.hadoop.examples.UfoSightingsXMLConfig
This MapReduce job will take just a bit longer than the Treasury Yield example. Once it’s
complete, check the output collection in MongoDB to see that the job was successful:
$ mongo
connecting to: test
> use mongo_hadoop
> db.ufo_sightings.out.find().count()
21850
Hadoop and MongoDB Use Cases
The following are some example deployments with MongoDB and Hadoop. The goal is to
provide a high-level description of how MongoDB and Hadoop can fit together in a typical
Big Data stack. In each of the following examples MongoDB is used as the “operational” real-
time data store and Hadoop is used for offline batch data processing and analysis.
Batch Aggregation
In several scenarios the built-in aggregation functionality provided by MongoDB is sufficient

for analyzing your data. However, in certain cases, significantly more complex data
aggregation may be necessary. This is where Hadoop can provide a powerful framework for
complex analytics.
In this scenario data is pulled from MongoDB and processed within Hadoop via one or more
Map-Reduce jobs. Data may also be brought in from additional sources within these Map-
180
Reduce jobs to develop a multi-datasource solution. Output from these Map-Reduce jobs
can then be written back to MongoDB for later querying and ad-hoc analysis. Applications
built on top of MongoDB can now use the information from the batch analytics to present to
the end user or to drive other downstream features.
Data Warehouse
In a typical production scenario, your application’s data may live in multiple datastores, each
with their own query language and functionality. To reduce complexity in these scenarios,
Hadoop can be used as a data warehouse and act as a centralized repository for data from
the various sources.
In this situation, you could have periodic Map-Reduce jobs that load data from MongoDB
into Hadoop. This could be in the form of “daily” or “weekly” data loads pulled from
MongoDB via Map-Reduce. Once the data from MongoDB is available from within Hadoop,
and data from other sources are also available, the larger dataset data can be queried
against. Data analysts now have the option of using either Map-Reduce or Pig to create jobs
that query the larger datasets that incorporate data from MongoDB.
ETL Data
MongoDB may be the operational datastore for your application but there may also be other
datastores that are holding your organization’s data. In this scenario it is useful to be able to
181
move data from one datastore to another, either from your application’s data to another
database or vice versa. Moving the data is much more complex than simply piping it from
one mechanism to another, which is where Hadoop can be used.
In this scenario, Map-Reduce jobs are used to extract, transform and load data from one
store to another. Hadoop can act as a complex ETL mechanism to migrate data in various
forms via one or more Map-Reduce jobs that pull the data from one store, apply multiple
transformations (applying new data layouts or other aggregation) and loading the data to
another store. This approach can be used to move data from or to MongoDB, depending on
the desired result.
182
Hadoop Security Overview
Security is essential for organizations that store and process sensitive data in the Hadoop
ecosystem. Many organizations must adhere to strict corporate security polices. Hadoop is a
distributed framework used for data storage and large-scale processing on clusters using
commodity servers. Adding security to Hadoop is challenging because not all of the
interactions follow the classic client-server pattern.
 In Hadoop, the file system is partitioned and distributed, requiring authorization

checks at multiple points.
 A submitted job is executed at a later time on nodes different than the node on
which the client authenticated and submitted the job.
 Secondary services such as a workflow system access Hadoop on behalf of users.
 A Hadoop cluster scales to thousands of servers and tens of thousands of concurrent
tasks.
A Hadoop-powered "Data Lake" can provide a robust foundation for a new generation of Big
Data analytics and insight, but can also increase the number of access points to an
organization's data. As diverse types of enterprise data are pulled together into a central
repository, the inherent security risks can increase.
Hortonworks understands the importance of security and governance for every business. To
ensure effective protection for its customers, Hortonworks uses a holistic approach based on
five core security features:
 Administration
 Authentication and perimeter security
 Authorization
 Audit
 Data protection
This chapter provides an overview of the security features implemented in the Hortonworks
Data Platform (HDP). Subsequent chapters in this guide provide more details on each of
these security features.
Understanding Data Lake Security
The successful Hadoop journey typically starts with data architecture optimization or new
advanced analytic applications, which leads to the formation of what is known as a Data
Lake. To prevent damage to the company’s business, customers, finances, and reputation, a
Data Lake should meet the same high standards of security as any legacy data environment.
The general consensus in nearly every industry is that data is an essential new driver of
competitive advantage. Hadoop plays a critical role in the modern data architecture by
providing low-cost, large-scale data storage and processing. The successful Hadoop journey
typically starts with data architecture optimization or new advanced analytic applications,
which leads to the formation of what is known as a Data Lake. As new and existing types of
data from machine sensors, server logs, clickstream data, and other sources flow into the
183
Data Lake, it serves as a central repository based on shared Hadoop services that power
deep organizational insights across a broad and diverse set of data.
The need to protect the Data Lake with comprehensive security is clear. As large and growing
volumes of diverse data are channeled into the Data Lake, it will store vital and often highly
sensitive business data. However, the external ecosystem of data and operational systems
feeding the Data Lake is highly dynamic and can introduce new security threats on a regular
basis. Users across multiple business units can access the Data Lake freely and refine,
explore, and enrich its data, using methods of their own choosing, further increasing the risk
of a breach. Any breach of this enterprise-wide data can result in catastrophic consequences:
privacy violations, regulatory infractions, or the compromise of vital corporate intelligence.
To prevent damage to the company’s business, customers, finances, and reputation, a Data
Lake should meet the same high standards of security as any legacy data environment.
Piecemeal protections are no more effective for a Data Lake than they would be in a
traditional repository. Effective Hadoop security depends on a holistic approach that revolves
around five pillars of security: administration, authentication and perimeter security,
authorization, auditing, and data protection.
Security administrators must address questions and provide enterprise-grade coverage

across each of these areas as they design the infrastructure to secure data in Hadoop. If any
of these pillars is vulnerable, it becomes a risk factor in the company’s Big Data environment.
A Hadoop security strategy must address all five pillars, with a consistent implementation
approach to ensure effectiveness.
You cannot achieve comprehensive protection across the Hadoop stack by using an
assortment of point solutions. Security must be an integral part of the platform on which
your Data Lake is built. This bottom-up approach makes it possible to enforce and manage
security across the stack through a central point of administration, thereby preventing gaps
and inconsistencies. This approach is especially important for Hadoop implementations in
which new applications or data engines are always emerging in the form of new Open
Source projects — a dynamic scenario that can quickly exacerbate any vulnerability.
Hortonworks helps customers maintain high levels of protection for enterprise data by
building centralized security administration and management into the infrastructure of the
Hortonworks Data Platform. HDP provides an enterprise-ready data platform with rich
184
capabilities spanning security, governance, and operations. HDP includes powerful data
security functionality that works across component technologies and integrates with
preexisting EDW, RDBMS, and MPP systems. By implementing security at the platform level,
Hortonworks ensures that security is consistently administered to all of the applications
across the stack, simplifying the process of adding or removing Hadoop applications.
Hadoop Security Features
HDP uses Apache Ranger to provide centralized security administration and management.
The Ranger Administration Portal is the central interface for security administration. You can
use Ranger to create and update policies, which are then stored in a policy database.
Ranger plug-ins (lightweight Java programs) are embedded within the processes of each
cluster component. For example, the Ranger plug-in for Apache Hive is embedded within
HiveServer2:
These plug-ins pull policies from a central server and store them locally in a file. When a user
request comes through the component, these plug-ins intercept the request and evaluate it
against the security policy. Plug-ins also collect data from the user request and follow a
separate thread to send this data back to the audit server.
Security Administration
To deliver consistent security administration and management, Hadoop administrators

require a centralized user interface they can use to define, administer and manage security
policies consistently across all of the Hadoop stack components:
185
The Apache Ranger administration console provides a central point of administration for the
other four pillars of Hadoop security.
Authentication and Secure Gateway
Establishing user identity with strong authentication is the basis for secure access in Hadoop.
Users need to reliably identify themselves and then have that identity propagated
throughout the Hadoop cluster to access cluster resources. Hortonworks uses Kerberos for
authentication. Kerberos is an industry standard used to authenticate users and resources
within a Hadoop cluster. HDP also includes Ambari, which simplifies Kerberos setup,
configuration, and maintenance.
Apache Knox Gateway is used to help ensure perimeter security for Hortonworks customers.
With Knox, enterprises can confidently extend the Hadoop REST API to new users without
Kerberos complexities, while also maintaining compliance with enterprise security policies.
Knox provides a central gateway for Hadoop REST APIs that have varying degrees of
authorization, authentication, SSL, and SSO capabilities to enable a single access point for
Hadoop.
186
Authorization
Ranger manages access control through a user interface that ensures consistent policy
administration across Hadoop data access components. Security administrators can define
security policies at the database, table, column, and file levels, and can administer
permissions for specific LDAP-based groups or individual users. Rules based on dynamic
conditions such as time or geolocation can also be added to an existing policy rule. The
Ranger authorization model is pluggable and can be easily extended to any data source
using a service-based definition.
Administrators can use Ranger to define a centralized security policy for the following
Hadoop components:
 HDFS  HBase  Solr

 YARN  Storm  Kafka
 Hive  Knox
187
Ranger works with standard authorization APIs in each Hadoop component and can enforce
centrally administered policies for any method used to access the Data Lake.
Ranger provides administrators with the deep visibility into the security administration
process that is required for auditing. The combination of a rich user interface and deep audit
visibility makes Ranger highly intuitive to use, enhancing productivity for security
administrators.
Audit
188
As customers deploy Hadoop into corporate data and processing environments, metadata
and data governance must be vital parts of any enterprise-ready data lake. For this reason,
Hortonworks established the Data Governance Initiative (DGI) with Aetna, Merck, Target, and
SAS to introduce a common approach to Hadoop data governance into the open source
community. This initiative has since evolved into a new open source project named Apache
Atlas. Apache Atlas is a set of core governance services that enables enterprises to meet their
compliance requirements within Hadoop, while also enabling integration with the complete
enterprise data ecosystem. These services include:
 Dataset search and lineage operations

 Metadata-driven data access control
 Indexed and searchable centralized auditing
 Data lifecycle management from ingestion to disposition
 Metadata interchange with other tools
Ranger also provides a centralized framework for collecting access audit history and
reporting this data, including filtering on various parameters. HDP enhances audit
information that is captured within different components within Hadoop and provides
insights through this centralized reporting capability.
Data Protection
The data protection feature makes data unreadable both in transit over the network and at
rest on a disk. HDP satisfies security and compliance requirements by using both transparent
data encryption (TDE) to encrypt data for HDFS files, along with a Ranger-embedded open
source Hadoop key management store (KMS). Ranger enables security administrators to
manage keys and authorization policies for KMS. Hortonworks is also working extensively
with its encryption partners to integrate HDFS encryption with enterprise-grade key
management frameworks.
Encryption in HDFS, combined with KMS access policies maintained by Ranger, prevents
rogue Linux or Hadoop administrators from accessing data, and supports segregation of
duties for both data access and encryption.
Dynamically Generating Knox Topology Files
Topology files can be dynamically generated from combinations of Provider Configurations

and Descriptors, which can be defined using the Knox Admin UI.
Prior to HDP 3.0, you set up Knox proxy by editing topology files manually. Topology files
consisted of 3 things:
 Provider configurations: e.g., authentication, federation, authentication, authorization,

identity assertion, etc
 HA provider
 Services: component URLs you want to proxy
You configured each of these things in every topology file.
189
As of HDP 3.0, topology files are dynamically generated from combinations of Provider
Configurations and Descriptors, defined using the Knox Admin UI. Additionally, these
provider configurations and descriptors are now shared- you no longer have to specify
configurations (e.g. authentication provider, identity assertion provider, or authorization
provider) for each topology file- you define a Provider Configuration or Descriptor and they
are shared across all topologies you choose. The Admin UI consists of 3 sections:
 Provider Configurations: A named set of providers, e.g., authentication, federation,

authentication, authorization, identity assertion, etc. Provider configurations can be
shared across descriptors/topologies.
 Descriptors: References the Provider Configurations to declare the policy

(authentication, authorization, identity assertion, etc) that goes along with proxying that
cluster. Descriptors cannot be shared across topologies; Descriptors and topologies are 1-
to-1.
 Topologies: Dynamically generated based on the Provider Configurations and

Descriptors you define.
However- the same topologies that were manageable in Ambari previously, still are. Within
the Knox Admin UI, the topologies that are managed by Ambari should be read-only. Within
an Ambari managed cluster, the Knox Admin UI is to be used for creating additional
topologies. When a Knox instance is not managed by Ambari, all topology management will
be done via the Knox Admin UI.
Securing Access to Hadoop Cluster: Apache Knox
The Apache Knox Gateway (“Knox”) is a system to extend the reach of Apache™ Hadoop®
services to users outside of a Hadoop cluster without reducing Hadoop Security. Knox also
simplifies Hadoop security for users who access the cluster data and execute jobs. The Knox
Gateway is designed as a reverse proxy.
Establishing user identity with strong authentication is the basis for secure access in Hadoop.
Users need to reliably identify themselves and then have that identity propagated
throughout the Hadoop cluster to access cluster resources.
Layers of Defense for a Hadoop Cluster
 Authentication: Kerberos
Hortonworks uses Kerberos for authentication. Kerberos is an industry standard used to
authenticate users and resources within a Hadoop cluster. HDP also includes Ambari,
which simplifies Kerberos setup, configuration, and maintenance.
 Perimeter Level Security: Apache Knox
Apache Knox Gateway is used to help ensure perimeter security for Hortonworks
customers. With Knox, enterprises can confidently extend the Hadoop REST API to new
users without Kerberos complexities, while also maintaining compliance with enterprise
security policies. Knox provides a central gateway for Hadoop REST APIs that have varying
190
degrees of authorization, authentication, SSL, and SSO capabilities to enable a single
access point for Hadoop.
 Authorization: Ranger
OS Security: Data Encryption and HDFS
191
Top 5 Big Data Vendors
Top five vendors offering Big Data Hadoop solutions are:
 Cloudera
 Amazon Web Services Elastic MapReduce Hadoop Distribution
 Microsoft
 MapR
 IBM InfoSphere Insights
Let’s get a fair idea about all these vendors.
Cloudera & Hortonworks
This ranks top over all the Big Data vendors for making Hadoop a reliable Big Data platform.
Cloudera Hadoop vendor has around 350+ paying customers including US army, Allstate,
and Monsanto.
Cloudera occupies 53 percent of Hadoop market, followed by 11 percent by MapR, and 16

percent by Hortonworks. Cloudera’s customers value the marketable add-on tools such as
Cloudera Manager, Navigator, and Impala.
CDH Overview
CDH is the most complete, tested, and popular distribution of Apache Hadoop and related
projects. CDH delivers the core elements of Hadoop – scalable storage and distributed
computing – along with a Web-based user interface and vital enterprise capabilities. CDH is
Apache-licensed open source and is the only Hadoop solution to offer unified batch
processing, interactive SQL and interactive search, and role-based access controls.
CDH provides:
 Flexibility—Store any type of data and manipulate it with a variety of different

computation frameworks including batch processing, interactive SQL, free text search,
machine learning and statistical computation.
 Integration—Get up and running quickly on a complete Hadoop platform that works

with a broad range of hardware and software solutions.
 Security—Process and control sensitive data.
 Scalability—Enable a broad range of applications and scale and extend them to suit
your requirements.
192
 High availability—Perform mission-critical business tasks with confidence.
 Compatibility—Leverage your existing IT infrastructure and investment.
Apache Hive Overview in CDH
Hive data warehouse software enables reading, writing, and managing large datasets in
distributed storage. Using the Hive query language (HiveQL), which is very similar to SQL,
queries are converted into a series of jobs that execute on a Hadoop cluster through
MapReduce or Apache Spark.
Users can run batch processing workloads with Hive while also analyzing the same data for
interactive SQL or machine-learning workloads using tools like Apache Impala or Apache
Spark—all within a single platform.
As part of CDH, Hive also benefits from:
 Unified resource management provided by YARN
 Simplified deployment and administration provided by Cloudera Manager
193
 Shared security and governance to meet compliance requirements provided by
Apache Sentry and Cloudera Navigator
Use Cases for Hive
Because Hive is a petabyte-scale data warehouse system built on the Hadoop platform, it is a
good choice for environments experiencing phenomenal growth in data volume. The
underlying MapReduce interface with HDFS is hard to program directly, but Hive provides an
SQL interface, making it possible to use existing programming skills to perform data
preparation.
Hive on MapReduce or Spark is best-suited for batch data preparation or ETL:
 You must run scheduled batch jobs with very large ETL sorts with joins to prepare
data for Hadoop. Most data served to BI users in Impala is prepared by ETL developers
using Hive.
 You run data transfer or conversion jobs that take many hours. With Hive, if a
problem occurs partway through such a job, it recovers and continues.
 You receive or provide data in diverse formats, where the Hive SerDes and variety of
UDFs make it convenient to ingest and convert the data. Typically, the final stage of the
ETL process with Hive might be to a high-performance, widely supported format such as
Parquet.
Hive Components
Hive consists of the following components:
 The Metastore Database
 HiveServer2
The Metastore Database
The metastore database is an important aspect of the Hive infrastructure. It is a separate
database, relying on a traditional RDBMS such as MySQL or PostgreSQL, that holds metadata
about Hive databases, tables, columns, partitions, and Hadoop-specific information such as
the underlying data files and HDFS block locations.
The metastore database is shared by other components. For example, the same tables can
be inserted into, queried, altered, and so on by both Hive and Impala. Although you might
see references to the "Hive metastore", be aware that the metastore database is used
broadly across the Hadoop ecosystem, even in cases where you are not using Hive itself.
The metastore database is relatively compact, with fast-changing data. Backup, replication,
and other kinds of management operations affect this database. See Configuring the Hive
Metastore for CDH for details about configuring the Hive metastore.
Cloudera recommends that you deploy the Hive metastore, which stores the metadata for
Hive tables and partitions, in "remote mode." In this mode the metastore service runs in its
own JVM process and other services, such as HiveServer2, HCatalog, and Apache Impala
communicate with the metastore using the Thrift network API.
194
See Starting the Hive Metastore in CDH for details about starting the Hive metastore service.
HiveServer2
HiveServer2 is a server interface that enables remote clients to submit queries to Hive and
retrieve the results. HiveServer2 supports multi-client concurrency, capacity planning
controls, Sentry authorization, Kerberos authentication, LDAP, SSL, and provides support for
JDBC and ODBC clients.
HiveServer2 is a container for the Hive execution engine. For each client connection, it
creates a new execution context that serves Hive SQL requests from the client. It supports
JDBC clients, such as the Beeline CLI, and ODBC clients. Clients connect to HiveServer2
through the Thrift API-based Hive service.
See Configuring HiveServer2 for CDH for details on configuring HiveServer2 and see Starting,

Stopping, and Using HiveServer2 in CDH for details on starting/stopping the HiveServer2
service and information about using the Beeline CLI to connect to HiveServer2. For details
about managing HiveServer2 with its native web user interface (UI), see Using HiveServer2
Web UI in CDH.
How Hive Works with Other Components
Hive integrates with other components, which serve as query execution engines or as data
stores:
 Hive on Spark
 Hive and HBase
 Hive on Amazon S3
 Hive on Microsoft Azure Data Lake Store

Hive on Spark
Hive traditionally uses MapReduce behind the scenes to parallelize the work, and perform
the low-level steps of processing a SQL statement such as sorting and filtering. Hive can also
use Spark as the underlying computation and parallelization engine. See Running Apache
Hive on Spark in CDH for details about configuring Hive to use Spark as its execution engine
and see Tuning Apache Hive on Spark in CDH for details about tuning Hive on Spark.
Hive and HBase

Apache HBase is a NoSQL database that supports real-time read/write access to large
datasets in HDFS. See Using Apache Hive with HBase in CDH for details about configuring
Hive to use HBase. For information about running Hive queries on a secure HBase server,
see Using Hive to Run Queries on a Secure HBase Server.
Hive on Amazon S3
Use the Amazon S3 filesystem to efficiently manage transient Hive ETL (extract-transform-
load) jobs. For step-by-step instructions to configure Hive to use S3 and multiple scripting
examples, see Configuring Transient Hive ETL Jobs to Use the Amazon S3 Filesystem. To
optimize how Hive writes data to and reads data from S3-backed tables and partitions,
195
see Tuning Hive Performance on the Amazon S3 Filesystem. For information about setting up
a shared Amazon Relational Database Service (RDS) as your Hive metastore, see Configuring
a Shared Amazon RDS as an HMS for CDH.
Hive on Microsoft Azure Data Lake Store

In CDH 5.12 and higher, both Hive on MapReduce2 and Hive on Spark can access tables on
Microsoft Azure Data Lake store (ADLS). In contrast to Amazon S3, ADLS more closely
resembles native HDFS behavior, providing consistency, file directory structure, and POSIX-
compliant ACLs. See Configuring ADLS Gen1 Connectivity for information about configuring
and using ADLS with Hive on MapReduce2.
How Impala Works with CDH
The following graphic illustrates how Impala is positioned in the

broader Cloudera environment:
The Impala solution is composed of the following components:
 Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can
all interact with Impala. These interfaces are typically used to issue queries or complete
administrative tasks such as connecting to Impala.
 Hive Metastore - Stores information about the data available to Impala. For example,
the metastore lets Impala know what databases are available and what the structure of
those databases is. As you create, drop, and alter schema objects, load data into tables,
and so on through Impala SQL statements, the relevant metadata changes are
automatically broadcast to all Impala nodes by the dedicated catalog service introduced
in Impala 1.2.
 Impala - This process, which runs on DataNodes, coordinates and executes queries.
Each instance of Impala can receive, plan, and coordinate queries from Impala clients.
Queries are distributed among Impala nodes, and these nodes then act as workers,
executing parallel query fragments.
196
 HBase and HDFS - Storage for data to be queried.
Queries executed using Impala are handled as follows:
1. User applications send SQL queries to Impala through ODBC or JDBC, which provide
standardized querying interfaces. The user application may connect to any impalad in the
cluster. This impalad becomes the coordinator for the query.
2. Impala parses the query and analyzes it to determine what tasks need to be
performed by impalad instances across the cluster. Execution is planned for optimal
efficiency.
3. Services such as HDFS and HBase are accessed by local impalad instances to provide
data.
4. Each impalad returns data to the coordinating impalad , which sends these results to
the client.
Primary Impala Features
Impala provides support for:
 Most common SQL-92 features of Hive Query Language (HiveQL)

including SELECT, joins, and aggregate functions.
 HDFS, HBase, and Amazon Simple Storage System (S3) storage, including:
o HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile.
o Compression codecs: Snappy, GZIP, Deflate, BZIP.
 Common data access interfaces including:
o JDBC driver.
o ODBC driver.
o Hue Beeswax and the Impala Query UI.
 impala-shell command-line interface.
 Kerberos authentication.
Apache Kudu Overview
197
Apache Kudu is a columnar storage manager developed for the Hadoop platform. Kudu
shares the common technical properties of Hadoop ecosystem applications: It runs on
commodity hardware, is horizontally scalable, and supports highly available operation.
Apache Kudu is a top-level project in the Apache Software Foundation.
Kudu's benefits include:
 Fast processing of OLAP workloads.
 Integration with MapReduce, Spark, Flume, and other Hadoop ecosystem

components.
 Tight integration with Apache Impala, making it a good, mutable alternative to using
HDFS with Apache Parquet.
 Strong but flexible consistency model, allowing you to choose consistency

requirements on a per-request basis, including the option for strict serialized consistency.
 Strong performance for running sequential and random workloads simultaneously.
 Easy administration and management through Cloudera Manager.
 High availability. Tablet Servers and Master use the Raft consensus algorithm, which
ensures availability as long as more replicas are available than unavailable. Reads can be
serviced by read-only follower tablets, even in the event of a leader tablet failure.
 Structured data model.
By combining all of these properties, Kudu targets support applications that are difficult or
impossible to implement on currently available Hadoop storage technologies. Applications
for which Kudu is a viable solution include:
 Reporting applications where new data must be immediately available for end users
 Time-series applications that must support queries across large amounts of historic
data while simultaneously returning granular queries about an individual entity
 Applications that use predictive models to make real-time decisions, with periodic
refreshes of the predictive model based on all historical data
Kudu-Impala Integration
Apache Kudu has tight integration with Apache Impala, allowing you to use Impala to insert,
query, update, and delete data from Kudu tablets using Impala's SQL syntax, as an alternative
to using the Kudu APIs to build a custom Kudu application. In addition, you can use JDBC or
198
ODBC to connect existing or new applications written in any language, framework, or
business intelligence tool to your Kudu data, using Impala as the broker.
 CREATE/ALTER/DROP TABLE - Impala supports creating, altering, and dropping

tables using Kudu as the persistence layer. The tables follow the same internal/external
approach as other tables in Impala, allowing for flexible data ingestion and querying.
 INSERT - Data can be inserted into Kudu tables from Impala using the same
mechanisms as any other table with HDFS or HBase persistence.
 UPDATE/DELETE - Impala supports the UPDATE and DELETE SQL commands to

modify existing data in a Kudu table row-by-row or as a batch. The syntax of the SQL
commands is designed to be as compatible as possible with existing solutions. In addition
to simple DELETE or UPDATE commands, you can specify complex joins in
the FROM clause of the query, using the same syntax as a regular SELECT statement.
 Flexible Partitioning - Similar to partitioning of tables in Hive, Kudu allows you to

dynamically pre-split tables by hash or range into a predefined number of tablets, in
order to distribute writes and queries evenly across your cluster. You can partition by any
number of primary key columns, with any number of hashes, a list of split rows, or a
combination of these. A partition scheme is required.
 Parallel Scan - To achieve the highest possible performance on modern hardware,

the Kudu client used by Impala parallelizes scans across multiple tablets.
 High-efficiency queries - Where possible, Impala pushes down predicate evaluation

to Kudu, so that predicates are evaluated as close as possible to the data. Query
performance is comparable to Parquet in many workloads.
Example Use Cases
Streaming Input with Near Real Time Availability

A common business challenge is one where new data arrives rapidly and constantly, and the
same data needs to be available in near real time for reads, scans, and updates. Kudu offers
the powerful combination of fast inserts and updates with efficient columnar scans to enable
real-time analytics use cases on a single storage layer.
Time-Series Application with Widely Varying Access Patterns

A time-series schema is one in which data points are organized and keyed according to the
time at which they occurred. This can be useful for investigating the performance of metrics
over time or attempting to predict future behavior based on past data. For instance, time-
series customer data might be used both to store purchase click-stream history and to
predict future purchases, or for use by a customer support representative. While these
different types of analysis are occurring, inserts and mutations might also be occurring
individually and in bulk, and become available immediately to read workloads. Kudu can
handle all of these access patterns simultaneously in a scalable and efficient manner.
199
Kudu is a good fit for time-series workloads for several reasons. With Kudu's support for
hash-based partitioning, combined with its native support for compound row keys, it is
simple to set up a table spread across many servers without the risk of "hotspotting" that is
commonly observed when range partitioning is used. Kudu's columnar storage engine is also
beneficial in this context, because many time-series workloads read only a few columns, as
opposed to the whole row.
In the past, you might have needed to use multiple datastores to handle different data
access patterns. This practice adds complexity to your application and operations, and
duplicates your data, doubling (or worse) the amount of storage required. Kudu can handle
all of these access patterns natively and efficiently, without the need to off-load work to
other datastores.
Predictive Modeling
Data scientists often develop predictive learning models from large sets of data. The model
and the data might need to be updated or modified often as the learning takes place or as
the situation being modeled changes. In addition, the scientist might want to change one or
more factors in the model to see what happens over time. Updating a large set of data
stored in files in HDFS is resource-intensive, as each file needs to be completely rewritten. In
Kudu, updates happen in near real time. The scientist can tweak the value, re-run the query,
and refresh the graph in seconds or minutes, rather than hours or days. In addition, batch or
incremental algorithms can be run across the data at any time, with near-real-time results.
Combining Data In Kudu With Legacy Systems

Companies generate data from multiple sources and store it in a variety of systems and
formats. For instance, some of your data might be stored in Kudu, some in a traditional
RDBMS, and some in files in HDFS. You can access and query all of these sources and
formats using Impala, without the need to change your legacy systems.
Related Information
 Apache Kudu Concepts and Architecture
 Apache Kudu Installation and Upgrade
 Kudu Security Overview
 More Resources for Apache Kudu
Apache Sentry Overview
Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides
the ability to control and enforce precise levels of privileges on data for authenticated users
and applications on a Hadoop cluster. Sentry currently works out of the box with Apache
Hive, Hive Metastore/HCatalog, Apache Solr, Impala, and HDFS (limited to Hive table data).
Sentry is designed to be a pluggable authorization engine for Hadoop components. It allows

you to define authorization rules to validate a user or application’s access requests for
200
Hadoop resources. Sentry is highly modular and can support authorization for a wide variety
of data models in Hadoop.
Apache Spark Overview
Apache Spark is a general framework for distributed computing that offers high performance
for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and
consists of Spark core and several related projects.
You can run Spark applications locally or distributed across a cluster, either by using
an interactive shell or by submitting an application. Running Spark applications interactively
is commonly performed during the data-exploration phase and for ad hoc analysis.
To run applications distributed across a cluster, Spark requires a cluster manager. In CDH 6,
Cloudera supports only the YARN cluster manager. When run on YARN, Spark application
processes are managed by the YARN ResourceManager and NodeManager roles. Spark
Standalone is no longer supported.
The Apache Spark 2 service in CDH 6 consists of Spark core and several related projects:
Spark SQL
Module for working with structured data. Allows you to seamlessly mix SQL queries
with Spark programs.
Spark Streaming
API that allows you to build scalable fault-tolerant streaming applications.
MLlib
API that implements common machine learning algorithms.

The Cloudera Enterprise product includes the Spark features roughly corresponding to
the feature set and bug fixes of Apache Spark 2.4. The Spark 2.x service was previously
shipped as its own parcel, separate from CDH.
In CDH 6, the Spark 1.6 service does not exist. The port of the Spark History Server is
18088, which is the same as formerly with Spark 1.6, and a change from port 18089
formerly used for the Spark 2 parcel.
Unsupported Features
The following Spark features are not supported:
 Apache Spark experimental features/APIs are not supported unless stated otherwise.
 Using the JDBC Datasource API to access Hive or Impala is not supported
 ADLS not Supported for All Spark Components. Microsoft Azure Data Lake Store
(ADLS) is a cloud-based filesystem that you can access through Spark applications.
201
Spark with Kudu is not currently supported for ADLS data. (Hive on Spark is available
for ADLS in CDH 5.12 and higher.)
 IPython / Jupyter notebooks is not supported. The IPython notebook system

(renamed to Jupyter as of IPython 4.0) is not supported.
 Certain Spark Streaming features not supported. The mapWithState method is

unsupported because it is a nascent unstable API.
 Thrift JDBC/ODBC server is not supported
 Spark SQL CLI is not supported
 GraphX is not supported
 SparkR is not supported

 Structured Streaming is supported, but the following features of it are not:
o Continuous processing, which is still experimental, is not supported.
o Stream static joins with HBase have not been tested and therefore are not
supported.
 Spark cost-based optimizer (CBO) not supported.
202
Amazon Web Services Elastic MapReduce Hadoop Distribution
Amazon Elastic MapReduce is a part of Amazon Web Services (AWS), and it exists since the
initial times of Hadoop. AWS has a simple-to-utilize and well-arranged data analytic stand
built on influential HDFS structural design. It is one of the highest ranking vendors with the
uppermost market distributions across the globe.
DynamoDB is another major NoSQL database contributed by the AWS Hadoop merchant
that is dropped to run in huge consumer websites.
Amazon Elastic MapReduce (Amazon EMR)
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) tool for big data
processing and analysis. Amazon EMR offers the expandable low-configuration service as an
easier alternative to running in-house cluster computing.
Amazon EMR is based on Apache Hadoop, a Java-based programming framework that

supports the processing of large data sets in a distributed
computing environment. MapReduce is a software framework that allows developers to write
programs that process massive amounts of unstructured data in parallel across a
distributed cluster of processors or stand-alone computers. It was developed at Google for
indexing web pages and replaced their original indexing algorithms and heuristics in 2004.
Amazon EMR processes big data across a Hadoop cluster of virtual servers on Amazon
Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The elastic in EMR's
name refers to its dynamic resizing ability, which allows it to ramp up or reduce resource use
depending on the demand at any given time.
Processing big data with Amazon EMR

Amazon EMR is used for data analysis in log analysis, web indexing, data
warehousing, machine learning, financial analysis, scientific simulation, bioinformatics and
more. EMR also supports workloads based on Apache Spark, Presto and Apache HBase -- the
latter of which integrates with Hive and Pig for additional functionality.
Introduction to AWS EMR
Amazon EMR is a big data platform currently leading in cloud-native platforms for big data
with its features like processing vast amounts of data quickly and at an cost-effective scale
and all these by using open source tools such as Apache Spark, Apache Hive, Apache HBase,
Apache Flink, Apache Hudi and Presto, with auto-scaling capability of Amazon EC2 and
storage scalability of Amazon S3, EMR gives the flexibility to run short-lived clusters that can
automatically scale to meet demand task, or for long-running highly available clusters.
AWS EMR provides many functionalities that makes thing easier for us, some of the
technologies are:
203
1. Amazon EC2
2. Amazon RDS
3. Amazon S3
4. Amazon CloudFront
5. Amazon Auto Scaling
6. Amazon Lambda
7. Amazon Redshift
8. Amazon Elastic MapReduce (EMR)
One of the major services provided by AWS EMR and we are going to deal with is Amazon
EMR.
EMR commonly called Elastic Map Reduce comes over with an easy and approachable way
to deal with the processing of larger chunks of data. Imagine a big data scenario where we
have a huge amount of data and we are performing a set of operations over them, say a
Map-Reduce job is running, one of the major issue the Bigdata application faces is the
tuning of the program, we often find it difficult to fine-tune our program in such a way all
the resource allocated is consumed properly. Due to this above tuning factor, the time taken
for processing increases gradually.
Elastic Map Reduce the service by Amazon, is a web service that provides a framework that
manages all these necessary features needed for Big data processing in a cost-effective, fast,
and secure manner. From cluster creation to data distribution over various instances all these
things are easily managed under Amazon EMR. The services here are on-demand means we
can control the numbers based on the data we have that makes if cost-efficient and scalable.
Reasons for Using AWS EMR
So Why Using AMR what makes it better from others. We often encounter a very basic
problem where we are unable to allocate all the resources available over the cluster to any
application, AMAZON EMR taking care of these problems and based on the size of data and
the demand of application it allocates the necessary resource. Also, being Elastic in nature
we can change it accordingly. EMR has huge application support be it Hadoop, Spark, HBase
204
that makes it easier for Data processing. It supports various ETL operations quickly and cost-
effectively. It Can also be used over for MLIB in Spark. We can perform various machine
learning algorithms inside it. Be it Batch data or Real-Time Streaming of Data EMR is capable
to organize and process both types of Data.
Working of AWS EMR
1. The Clusters are the central component in the Amazon EMR architecture. They are a
collection of EC2 Instances called Nodes. Each node has their specific roles within the
cluster termed as Node type and based on their roles we can classify them in 3 types:
 Master Node
 Core Node
 Task Node
2. The Master Node as the name suggests is the master that is responsible for
managing the cluster, running the components and distribution of data over the
nodes for processing. It just keeps tracks whether everything is properly managed
and running fine and works on in the case of failure.
3. The Core Node has the responsibility of running the task and store the data in HDFS
in the cluster. All the processing parts are handled by the core Node and the data
after that processing is put to the desired HDFS location.
4. The Task Node being optional only has the job to run the task this doesn’t store the
data in HDFS.
5. Whenever after submitting a job, we have several methods to choose how the works
need to be completed. Being it from termination of the cluster after job completion
to a long-running cluster using EMR console and CLI to submit steps we have all the
privilege to do so.
6. We can directly Run the Job on the EMR by connecting it with the master node
through the interfaces and tools available that run jobs directly on the cluster.
7. We can also run our data in various steps with the help of EMR, all we have to do is
submit one or more ordered steps in the EMR cluster. The Data is stored as a file and
is processed in a sequential manner. Starting it from “Pending state to Completed
state” we can trace the processing steps and find the errors also being it from ‘Failed
to be Canceled’ all these steps can be easily traced back to this.
205
8. Once all the instance is terminated the completed state for the cluster is achieved.
Architecture for AWS EMR
The architecture of EMR introduces itself starting from the storage part to the Application
part.
 The very first layer comes with the storage layer which includes different file systems
used with our cluster. Be It from HDFS to EMRFS to local file system these all are used
for data storage over the entire application. Caching of the intermediate results
during MapReduce processing can be achieved with the help of these technologies
that come with EMR.
 The Second layer comes with the Resource Management for the cluster, this layer is
responsible for resource management for the clusters and nodes over the application.
This basically helps as the management tools that helps to evenly distribute the data
over cluster and proper managing. The Default resource Management tool that EMR
uses is YARN that was introduced in Apache Hadoop 2.0. It centrally manages the
resources for multiple data processing frameworks. It takes care of all the information
that is needed for the cluster well-running being it from node health to resource
distribution with memory management.
 The Third layer comes with the Data processing Framework, this layer is responsible
for the analysis and the processing of data. there are many frameworks supported by
EMR that plays an important role in parallel and efficient data processing. Some of
the framework it supports and we are aware of is APACHE HADOOP, SPARK, SPARK
STREAMING, etc.
 The Fourth layer coms with the Application and programs such as HIVE, PIG,
streaming library, ML Algorithms that are helpful for processing and managing large
data sets.
Advantages of AWS EMR
Let us now check some of the benefits of using EMR:
1. High Speed: Since all the resources are utilized properly the Processing time for the
query is comparatively faster than the other data processing tools have a much clear
picture.
2. Bulk Data Processing: Be larger the data size EMR has the capability for processing
of huge amount of data in ample time.
206
3. Minimal Data Loss: Since data are distributed over the cluster and processed
parallelly over the network, there is a minimum chance for data loss and well, the
accuracy rate for the processed data is better.
4. Cost-Effective: Being cost-effective it is cheaper than any other alternative available

that makes it strong over the industry usage. Since the pricing is less we can
accommodate over large amounts of data and can process them within budget.
5. AWS Integrated: It is integrated with all the services of AWS that makes easy
availability under a roof so the security, storage, networking everything is integrated
in one place.
6. Security: It comes with an amazing Security group to control the inbound and
outbound traffic also the use of IAM Roles makes it more secure as it comes up with
various permissions that make data secure.
7. Monitoring and deployment: we have proper monitoring tools for all the
application that is running over EMR clusters that makes it transparent and easy for
analysis portion also it comes with an auto-deployment feature where the application
is configured and deployed automatically.
There are a lot more advantages to having EMR as a better choice other cluster computation
method.
207
Microsoft Hadoop Distribution
Based on the current Hadoop distribution strategy of the vendors, Microsoft is an IT business
not prominent for free foundation software solutions, still trying to make this platform work
on Windows. It is offered as community cloud manufactured goods Microsoft Azure’s
HDInsight mainly built to work with Azure.
An additional specialty in Microsoft is that its PolyBase feature helps customers hunt for data
on the SQL Server during the implementation of the queries.
Apache Hadoop architecture in HDInsight
Apache Hadoop includes two core components: the Apache Hadoop Distributed File System
(HDFS) that provides storage, and Apache Hadoop Yet Another Resource Negotiator
(YARN) that provides processing. With storage and processing capabilities, a cluster becomes
capable of running MapReduce programs to perform the desired data processing.
An HDFS is not typically deployed within the HDInsight cluster to provide storage. Instead,
an HDFS-compatible interface layer is used by Hadoop components. The actual storage
capability is provided by either Azure Storage or Azure Data Lake Storage. For Hadoop,
MapReduce jobs executing on the HDInsight cluster run as if an HDFS were present and so
require no changes to support their storage needs. In Hadoop on HDInsight, storage is
outsourced, but YARN processing remains a core component. For more information,
see Introduction to Azure HDInsight.
This article introduces YARN and how it coordinates the execution of applications on
HDInsight.
Apache Hadoop YARN basics
YARN governs and orchestrates data processing in Hadoop. YARN has two core services that
run as processes on nodes in the cluster:
 ResourceManager
 NodeManager
The ResourceManager grants cluster compute resources to applications like MapReduce
jobs. The ResourceManager grants these resources as containers, where each container
consists of an allocation of CPU cores and RAM memory. If you combined all the resources
available in a cluster and then distributed the cores and memory in blocks, each block of
resources is a container. Each node in the cluster has a capacity for a certain number of
containers, therefore the cluster has a fixed limit on the number of containers available. The
allotment of resources in a container is configurable.
When a MapReduce application runs on a cluster, the ResourceManager provides the

application the containers in which to execute. The ResourceManager tracks the status of
208
running applications, available cluster capacity, and tracks applications as they complete and
release their resources.
The ResourceManager also runs a web server process that provides a web user interface to
monitor the status of applications.
When a user submits a MapReduce application to run on the cluster, the application is
submitted to the ResourceManager. In turn, the ResourceManager allocates a container on
available NodeManager nodes. The NodeManager nodes are where the application actually
executes. The first container allocated runs a special application called the ApplicationMaster.
This ApplicationMaster is responsible for acquiring resources, in the form of subsequent
containers, needed to run the submitted application. The ApplicationMaster examines the
stages of the application, such as the map stage and reduce stage, and factors in how much
data needs to be processed. The ApplicationMaster then requests (negotiates) the resources
from the ResourceManager on behalf of the application. The ResourceManager in turn
grants resources from the NodeManagers in the cluster to the ApplicationMaster for it to use
in executing the application.
The NodeManagers run the tasks that make up the application, then report their progress
and status back to the ApplicationMaster. The ApplicationMaster in turn reports the status of
the application back to the ResourceManager. The ResourceManager returns any results to
the client.
YARN on HDInsight
All HDInsight cluster types deploy YARN. The ResourceManager is deployed for high
availability with a primary and secondary instance, which runs on the first and second head
nodes within the cluster respectively. Only the one instance of the ResourceManager is active
at a time. The NodeManager instances run across the available worker nodes in the cluster.
209
Hadoop in Azure HDInsight
Azure HDInsight is a fully managed, full-spectrum, open-source analytics service in the cloud
for enterprises. The Apache Hadoop cluster type in Azure HDInsight allows you to use
the Apache Hadoop Distributed File System (HDFS), Apache Hadoop YARN resource
management, and a simple MapReduce programming model to process and analyze batch
data in parallel. Hadoop clusters in HDInsight are compatible with Azure Blob storage, Azure
Data Lake Storage Gen1, or Azure Data Lake Storage Gen2.
To see available Hadoop technology stack components on HDInsight, see Components and

versions available with HDInsight. To read more about Hadoop in HDInsight, see the Azure
features page for HDInsight.
A basic word count MapReduce job example is illustrated in the following diagram:
210
The output of this job is a count of how many times each word occurred in the text.
 The mapper takes each line from the input text as an input and breaks it into words.
It emits a key/value pair each time a word occurs of the word is followed by a 1. The
output is sorted before sending it to reducer.
 The reducer sums these individual counts for each word and emits a single key/value
pair that contains the word followed by the sum of its occurrences.
MapReduce can be implemented in various languages. Java is the most common
implementation, and is used for demonstration purposes in this document.
Use Spark & Hive Tools for Visual Studio Code
Learn how to use Apache Spark & Hive Tools for Visual Studio Code. Use the tools to create
and submit Apache Hive batch jobs, interactive Hive queries, and PySpark scripts for Apache
Spark. First we'll describe how to install Spark & Hive Tools in Visual Studio Code. Then we'll
walk through how to submit jobs to Spark & Hive Tools.
Spark & Hive Tools can be installed on platforms that are supported by Visual Studio Code.
Note the following prerequisites for different platforms.
Prerequisites
The following items are required for completing the steps in this article:
211
 An Azure HDInsight cluster. To create a cluster, see Get started with HDInsight. Or use
a Spark and Hive cluster that supports an Apache Livy endpoint.
 Visual Studio Code.
 Mono. Mono is required only for Linux and macOS.
 A PySpark interactive environment for Visual Studio Code.
 A local directory. This article uses C:\HD\HDexample.
Install Spark & Hive Tools
After you meet the prerequisites, you can install Spark & Hive Tools for Visual Studio Code
by following these steps:
1. Open Visual Studio Code.
2. From the menu bar, navigate to View > Extensions.
3. In the search box, enter Spark & Hive.
4. Select Spark & Hive Tools from the search results, and then select Install:
5. Select Reload when necessary.
Open a work folder
To open a work folder and to create a file in Visual Studio Code, follow these steps:
212
1. From the menu bar, navigate to File > Open Folder... > C:\HD\HDexample, and
then select the Select Folder button. The folder appears in the Explorer view on the
left.
2. In Explorer view, select the HDexample folder, and then select the New File icon

next to the work folder:
3. Name the new file by using either the .hql (Hive queries) or the .py (Spark script) file
extension. This example uses HelloWorld.hql.
Set the Azure environment
For a national cloud user, follow these steps to set the Azure environment first, and then use
the Azure: Sign In command to sign in to Azure:
1. Navigate to File > Preferences > Settings.
2. Search on the following string: Azure: Cloud.
3. Select the national cloud from the list:
213
Connect to an Azure account
Before you can submit scripts to your clusters from Visual Studio Code, user can either sign
in to Azure subscription, or link a HDInsight cluster. Use the Ambari username/password or
domain joined credential for ESP cluster to connect to your HDInsight cluster. Follow these
steps to connect to Azure:
1. From the menu bar, navigate to View > Command Palette..., and enter Azure: Sign
In:
2. Follow the sign-in instructions to sign in to Azure. After you're connected, your Azure
account name shows on the status bar at the bottom of the Visual Studio Code
window.
Link a cluster
Link: Azure HDInsight

You can link a normal cluster by using an Apache Ambari-managed username, or you can
link an Enterprise Security Pack secure Hadoop cluster by using a domain username (such
as: user1@contoso.com).
1. From the menu bar, navigate to View > Command Palette..., and enter Spark /

Hive: Link a Cluster.
2. Select linked cluster type Azure HDInsight.
3. Enter the HDInsight cluster URL.
4. Enter your Ambari username; the default is admin.
5. Enter your Ambari password.
6. Select the cluster type.
7. Set the display name of the cluster (optional).
8. Review OUTPUT view for verification.
214
The linked username and password are used if the cluster both logged in to the Azure
subscription and linked a cluster.
Link: Generic Livy endpoint
Hive: Link a Cluster.
2. Select linked cluster type Generic Livy Endpoint.
3. Enter the generic Livy endpoint. For example: http://10.172.41.42:18080.
4. Select authorization type Basic or None. If you select Basic:
1. Enter your Ambari username; the default is admin.
2. Enter your Ambari password.
5. Review OUTPUT view for verification.
List clusters

Hive: List Cluster.
2. Select the subscription that you want.
3. Review the OUTPUT view. This view shows your linked cluster (or clusters) and all the
clusters under your Azure subscription:
Set the default cluster
1. Reopen the HDexample folder that was discussed earlier, if closed.
2. Select the HelloWorld.hql file that was created earlier. It opens in the script editor.
3. Right-click the script editor, and then select Spark / Hive: Set Default Cluster.
4. Connect to your Azure account, or link a cluster if you haven't yet done so.
5. Select a cluster as the default cluster for the current script file. The tools automatically
update the .VSCode\settings.json configuration file:
215
Submit interactive Hive queries and Hive batch scripts
With Spark & Hive Tools for Visual Studio Code, you can submit interactive Hive queries and
Hive batch scripts to your clusters.
2. Select the HelloWorld.hql file that was created earlier. It opens in the script editor.
3. Copy and paste the following code into your Hive file, and then save it:
HiveQL
SELECT * FROM hivesampletable;

5. Right-click the script editor and select Hive: Interactive to submit the query, or use
the Ctrl+Alt+I keyboard shortcut. Select Hive: Batch to submit the script, or use the
Ctrl+Alt+H keyboard shortcut.
6. If you haven't specified a default cluster, select a cluster. The tools also let you submit
a block of code instead of the whole script file by using the context menu. After a few
moments, the query results appear in a new tab:
216
o RESULTS panel: You can save the whole result as a CSV, JSON, or Excel file to
a local path or just select multiple lines.
o MESSAGES panel: When you select a Line number, it jumps to the first line of

the running script.
Submit interactive PySpark queries
Users can perform PySpark interactive in the following ways:
Using the PySpark interactive command in PY file

Using the PySpark interactive command to submit the queries, follow these steps:
2. Create a new HelloWorld.py file, following the earlier steps.
3. Copy and paste the following code into the script file:
Python
from operator import add

lines =
spark.read.text("/HdiSamples/HdiSamples/FoodInspectionData/README").rdd.map(lam
bda r: r[0])
counters = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
217
coll = counters.collect()
sortedCollection = sorted(coll, key = lambda r: r[1], reverse = True)
for i in range(0, 5):

print(sortedCollection[i])
4. The prompt to install PySpark/Synapse Pyspark kernel is displayed in the lower right
corner of the window. You can click on Install button to proceed for the
PySpark/Synapse Pyspark installations; or click on Skip button to skip this step.
5. If you need to install it later, you can navigate to File > Preference > Settings, then

uncheck HDInsight: Enable Skip Pyspark Installation in the settings.
6. If the installation is successful in step 4, the "PySpark installed successfully" message

box is displayed in the lower right corner of the window. Click on Reload button to
reload the window.
218
7. From the menu bar, navigate to View > Command Palette... or use the Shift + Ctrl
+ P keyboard shortcut, and enter Python: Select Interpreter to start Jupyter Server.
8. Select the python option below.
9. From the menu bar, navigate to View > Command Palette... or use the Shift + Ctrl
+ P keyboard shortcut, and enter Developer: Reload Window.
11. Select all the code, right-click the script editor, and select Spark: PySpark
Interactive / Synapse: Pyspark Interactive to submit the query.
219
12. Select the cluster, if you haven't specified a default cluster. After a few moments,
the Python Interactive results appear in a new tab. Click on PySpark to switch the
kernel to PySpark / Synapse Pyspark, and the code will run successfully. If you want
to switch to Synapse Pyspark kernel, disabling auto-settings in Azure portal is
encouraged. Otherwise it may take a long while to wake up the cluster and set synapse
kernel for the first time use. If The tools also let you submit a block of code instead of
the whole script file by using the context menu:
220
13. Enter %%info, and then press Shift+Enter to view the job information (optional):
The tool also supports the Spark SQL query:
Perform interactive query in PY file using a #%% comment

1. Add #%% before the Py code to get notebook experience.
221
2. Click on Run Cell. After a few moments, the Python Interactive results appear in a
new tab. Click on PySpark to switch the kernel to PySpark/Synapse PySpark, then, click
on Run Cell again, and the code will run successfully.
Leverage IPYNB support from Python extension
1. You can create a Jupyter Notebook by command from the Command Palette or by
creating a new .ipynb file in your workspace. For more information, see Working with
Jupyter Notebooks in Visual Studio Code
2. Click on Run cell button, follow the prompts to Set the default spark pool (strongly
encourage to set default cluster/pool every time before opening a notebook) and
then, Reload window.
222
3. Click on PySpark to switch kernel to PySpark / Synapse Pyspark, and then click
on Run Cell, after a while, the result will be displayed.
Submit PySpark batch job
1. Reopen the HDexample folder that you discussed earlier, if closed.
2. Create a new BatchFile.py file by following the earlier steps.
223
3. Copy and paste the following code into the script file:
Python
from __future__ import print_function

import sys
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PythonWordCount")\
.getOrCreate()
lines =
spark.read.text('/HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv').rdd.map
(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' '))\
.map(lambda x: (x, 1))\
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop()
5. Right-click the script editor, and then select Spark: PySpark Batch, or Synapse:
PySpark Batch*.
6. Select a cluster/spark pool to submit your PySpark job to:
After you submit a Python job, submission logs appear in the OUTPUT window in Visual
Studio Code. The Spark UI URL and Yarn UI URL are also shown. If you submit the batch job
to an Apache Spark pool, the Spark history UI URL and the Spark Job Application UI URL are
also shown. You can open the URL in a web browser to track the job status.
224
Integrate with HDInsight Identity Broker (HIB)
Connect to your HDInsight ESP cluster with ID Broker (HIB)

You can follow the normal steps to sign in to Azure subscription to connect to your
HDInsight ESP cluster with ID Broker (HIB). After sign-in, you'll see the cluster list in Azure
Explorer. For more instructions, see Connect to your HDInsight cluster.
Run a Hive/PySpark job on an HDInsight ESP cluster with ID Broker (HIB)

For run a hive job, you can follow the normal steps to submit job to HDInsight ESP cluster
with ID Broker (HIB). Refer to Submit interactive Hive queries and Hive batch scripts for more
instructions.
For run a interactive PySpark job, you can follow the normal steps to submit job to HDInsight
ESP cluster with ID Broker (HIB). Refer to Submit interactive PySpark queries for more
instructions.
For run a PySpark batch job, you can follow the normal steps to submit job to HDInsight ESP
cluster with ID Broker (HIB). Refer to Submit PySpark batch job for more instructions.
Apache Livy configuration
Apache Livy configuration is supported. You can configure it in

the .VSCode\settings.json file in the workspace folder. Currently, Livy configuration only
supports Python script. For more information, see Livy README.
How to trigger Livy configuration
Method 1
1. From the menu bar, navigate to File > Preferences > Settings.
2. In the Search settings box, enter HDInsight Job Submission: Livy Conf.
3. Select Edit in settings.json for the relevant search result.
Method 2
Submit a file, and notice that the .vscode folder is automatically added to the work folder.
You can see the Livy configuration by selecting .vscode\settings.json.
 The project settings:
225
Integrate with Azure HDInsight from Explorer
You can preview Hive Table in your clusters directly through the Azure HDInsight explorer:
1. Connect to your Azure account if you haven't yet done so.
2. Select the Azure icon from leftmost column.
3. From the left pane, expand AZURE: HDINSIGHT. The available subscriptions and
clusters are listed.
4. Expand the cluster to view the Hive metadata database and table schema.
5. Right-click the Hive table. For example: hivesampletable. Select Preview.
226
6. The Preview Results window opens:
 RESULTS panel
227
You can save the whole result as a CSV, JSON, or Excel file to a local path, or just select
multiple lines.
 MESSAGES panel
1. When the number of rows in the table is greater than 100, you see the
following message: "The first 100 rows are displayed for Hive table."
2. When the number of rows in the table is less than or equal to 100, you see the
following message: "60 rows are displayed for Hive table."
3. When there's no content in the table, you see the following message: "0 rows
are displayed for Hive table."
In Linux, install xclip to enable copy-table data.
Additional features
Spark & Hive for Visual Studio Code also supports the following features:
 IntelliSense autocomplete. Suggestions pop up for keywords, methods, variables,

and other programming elements. Different icons represent different types of objects:
 IntelliSense error marker. The language service underlines editing errors in the Hive
script.
 Syntax highlights. The language service uses different colors to differentiate

variables, keywords, data type, functions, and other programming elements:
228
Reader-only role
Users who are assigned the reader-only role for the cluster can't submit jobs to the
HDInsight cluster, nor view the Hive database. Contact the cluster administrator to upgrade
your role to HDInsight Cluster Operator in the Azure portal. If you have valid Ambari
credentials, you can manually link the cluster by using the following guidance.
Browse the HDInsight cluster

When you select the Azure HDInsight explorer to expand an HDInsight cluster, you're
prompted to link the cluster if you have the reader-only role for the cluster. Use the
following method to link to the cluster by using your Ambari credentials.
Submit the job to the HDInsight cluster

When submitting job to an HDInsight cluster, you're prompted to link the cluster if you're in
the reader-only role for the cluster. Use the following steps to link to the cluster by using
Ambari credentials.
Link to the cluster

1. Enter a valid Ambari username.
2. Enter a valid password.
229
You can use Spark / Hive: List Cluster to check the linked cluster:
230
HPE Ezmeral Data Fabric (formerly MapR Data Platform)
MapR, the company, was commonly seen as the third horse – or should that be elephant? –
in a race with Cloudera and Hortonworks. The latter two have merged, while MapR has,
essentially, gone out of business.
It was announced on 5 August 2019 that Hewlett Packard Enterprise (HPE) has acquired all
MapR assets for an undisclosed sum.
MapR technologies have been used to allow Hadoop to perform well with potential and
minimal effort. Their linchpin, the MapR filesystem that inherits HDFS API, is fully read/write
and can save trillions of files.
MapR has done more than any other vendor to deliver reliable and efficient distribution for
huge cluster implementation.
HPE Ezmeral Data Fabric XD Distributed File and Object Store
XD Cloud-Scale Data Store provides exabyte-scale data store for building intelligent
applications with the Data-fabric Converged Data Platform. XD includes all the functionality
you need to manage large amounts of conventional data.
Why XD?
XD can be installed on SSD- and HDD-based servers. It includes the filesystem for data

storage, data management, and data protection, support for mounting and accessing the
clusters using NFS and the FUSE-based POSIX (basic, platinum, or PACC) clients, and support
for accessing and managing data using HDFS APIs. The cluster can be managed using the
Control System and monitored using Data-fabric Monitoring (Spyglass initiative). XD is the
only Cloud-Scale Data store that enables you to build a fabric of exabyte scale. XD supports
trillions of files, 100s of 1000s of client nodes and can run on Edge Cluster, on-prem data
centers and the public cloud.
231
Accessing filesystem with C Applications
MapR provides a modified version of libhdfs that supports access to the MapR filesystem.
You can develop applications with C that read files, write to files, change file permissions and
file ownership, create and delete files and directories, rename files, and change the access
and modification times of files and directories.
libMapRClient supports and makes modifications to hadoop-2.x version of libhdfs. The API

reference notes which APIs are supported by hadoop-2.x.
libMapRClient’s version of libhdfs contains the following changes and additions:
 There are no calls to a JVM, so applications run faster and more efficiently.
 Changes to APIs
o hadoop-2.x: Support for hdfsBuilder structures for connections to HDFS is

limited. Some of the parameters are ignored.
o hadoop-2.x: hdfsGetDefaultBlockSize(): If the filesystem that the client is

connected to is an instance of filesystem, the returned value is 256 MB,
regardless of the actual setting.
232
o hadoop-2.x: hdfsCreateDirectory(): The parameters for buffer size, replication,
and block size are ignored for connections to MapR filesystem.
o hadoop-2.x: hdfsGetDefaultBlockSizeAtPath(): If the filesystem that the client is

connected to is an instance of filesystem, the returned value is 256 MB,
regardless of the actual setting.
o hadoop-2.x: hdfsOpenFile(): The parameters for buffer size and replication are

ignored for connections to MapR filesystem.
 APIs that are unique to libMapRClient for hadoop-2.x
o hdfsCreateDirectory2()
o hdfsGetNameContainerSizeBytes()
o hdfsOpenFile2()
o hdfsSetRpcTimeout()
o hdfsSetThreads()
Compiling and Running a Java Application
You can compile and run the Java application using JAR files from the MapR Maven
repository or from the MapR installation.
Using JARs from the MapR Maven Repository
MapR Development publishes Maven artifacts from version 2.1.2 onward

at https://repository.mapr.com/maven/. When compiling for MapR 6.1, add the following
dependency to the pom.xml file for your project:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.0-mapr-1808</version>
</dependency>
This dependency will pull the rest of the dependencies from the MapR Maven repository the
next time you do a mvn clean install.The JAR that includes the maprfs library is a dependency
for the hadoop-common artifact.
233
For a complete list of MapR-provided artifacts and further details, see Maven Artifacts for the
HPE Ezmeral Data Fabric.
Using JARs from the MapR Installation
The maprfs library is included in the hadoop classpath. Add the hadoop classpath to the
JAVA classpath when you compile and run the Java application.
 To compile the sample code, use the following command:
javac -cp $(hadoop classpath) MapRTest.java
 To run the sample code, use the following command:
java -cp .:$(hadoop classpath) MapRTest /test
Copy Data Using the hdfs:// Protocol
Describes the procedure to copy data from a HDFS cluster to a data-fabric cluster using the
hdfs:// protocol.
Before you can copy data from an HDFS cluster to a data-fabric cluster using the hdfs://
protocol, you must configure the data-fabric cluster to access the HDFS cluster. To do this,
complete the steps listed in Configuring a MapR Cluster to Access an HDFS Cluster for the
security scenario that best describes your HDFS and data-fabric clusters and then complete
the steps listed under Verifying Access to an HDFS Cluster.
You also need the following information:
 <NameNode> - the IP address or hostname of the NameNode in the HDFS cluster
 <NameNode Port> - the port for connecting to the NameNode in the HDFS cluster
 <HDFS path> - the path to the HDFS directory from which you plan to copy data
 <MapRFilesystem path> - the path in the data-fabric cluster to which you plan to

copy HDFS data
 <file> - a file in the HDFS path
To copy data from HDFS to data-fabric filesystem using the hdfs:// protocol, complete the

following steps:
1. Run the following hadoop command to determine if the data-fabric cluster can read

the contents of a file in a specified directory on the HDFS cluster:
234
hadoop fs -cat <NameNode>:<NameNode port>/<HDFS path>/<file>
Example
hadoop fs -cat hdfs://nn1:8020/user/sara/contents.xml
2. If the data-fabric cluster can read the contents of the file, run the distcp command to

copy the data from the HDFS cluster to the data-fabric cluster:
hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path>

maprfs://<MapRFilesystem path>
Example
hadoop distcp hdfs://nn1:8020/user/sara maprfs:///user/sara
Copying Data Using NFS for the HPE Ezmeral Data Fabric
Describes how to copy files from one data-fabric cluster to another using NFS for the HPE
Ezmeral Data Fabric.
If NFS for the HPE Ezmeral Data Fabric is installed on the data-fabric cluster, you can mount
the data-fabric cluster to the HDFS cluster and then copy files from one cluster to the other
using hadoop distcp. If you do not have NFS for the HPE Ezmeral Data Fabric installed and a
mount point configured, see Accessing Data with NFS v3 and Managing the Data Fabric NFS
Service.
To perform a copy using distcp via NFS for the HPE Ezmeral Data Fabric, you need the
following information:
 <MapR NFS Server> - the IP address or hostname of the NFS server in the data-
fabric cluster
 <maprfs_nfs_mount> - the NFS export mount point configured on the data-

fabric cluster; default is /mapr
 <hdfs_nfs_mount> - the NFS for the HPE Ezmeral Data Fabric mount point

configured on the HDFS cluster
 <NameNode> - the IP address or hostname of the NameNode in the HDFS cluster
 <NameNode Port> - the port on the NameNode in the HDFS cluster
 <HDFS path> - the path to the HDFS directory from which you plan to copy data
 <MapR filesystem path> - the path in the data-fabric cluster to which you plan to

copy HDFS data
235
To copy data from HDFS to the data-fabric filesystem using NFS for the HPE Ezmeral Data
Fabric, complete the following steps:
1. Mount HDFS.
Issue the following command to mount the data-fabric cluster to the HDFS NFS for

the HPE Ezmeral Data Fabric mount point:
mount <Data Fabric NFS Server>:/<maprfs_nfs_mount> /<hdfs_nfs_mount>
Example
mount 10.10.100.175:/mapr /hdfsmount
2. Copy data.
a. Issue the following command to copy data from the HDFS cluster to the data-
fabric cluster:
hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path>

file:///<hdfs_nfs_mount>/<MapR filesystem path>
Example
hadoop distcp hdfs://nn1:8020/user/sara/file.txt file:///hdfsmount/user/sara
b. Issue the following command from the data-fabric cluster to verify that the

file was copied to the data-fabric cluster:
hadoop fs -ls /<MapR filesystem path>
Example
hadoop fs -ls /user/sara
HPE Ezmeral Data Fabric Database
HPE Ezmeral Data Fabric Database is an enterprise-grade, high-performance, NoSQL database

management system. You can use it for real-time, operational analytics capabilities.
Why HPE Ezmeral Data Fabric Database?
HPE Ezmeral Data Fabric Database is built into the data-fabric platform. It requires no

additional process to manage, leverages the same architecture as the rest of the platform,
and requires minimal additional management.
236
How Do I Get Started?
Based on your role, review the HPE Ezmeral Data Fabric Database documentation. The
following table identifies useful resources based on your role.
237
IBM InfoSphere Insights
IBM assimilates a capital of key data management parts and analytics assets into open-
source distribution. The company has also launched a determined, open-source project
Apache System ML for Machine Learning.
With IBM BigInsights, customers get to market in a very fast pace with their apps integrating
advanced Big Data Analytics.
Hadoop vendors endure developing over time with rising universal implementation of
technologies relating to Big Data and with increasing retailers’ profits. However, these
Hadoop merchants are facing a rough struggle in the Big Data world, and it is complicated
for the firms to select the best-suited tool for the organization out of a wide range of
players.
IBM InfoSphere BigInsights
InfoSphere® BigInsights™ is a software platform for discovering, analyzing, and visualizing

data from disparate sources. You use this software to help process and analyze the volume,
variety, and velocity of data that continually enters your organization every day.
InfoSphere BigInsights helps your organization to understand and analyze massive volumes

of unstructured information as easily as smaller volumes of information. The flexible platform
is built on an Apache Hadoop open source framework that runs in parallel on commonly
available, low-cost hardware. You can easily scale the platform to analyze hundreds of
terabytes, petabytes, or more of raw data that is derived from various sources. As
information grows, you add more hardware to support the influx of data.
InfoSphere BigInsights helps application developers, data scientists, and administrators in

your organization quickly build and deploy custom analytics to capture insight from data.
This data is often integrated into existing databases, data warehouses, and business
intelligence infrastructure. By using InfoSphere BigInsights, users can extract new insights
from this data to enhance knowledge of your business.
InfoSphere BigInsights incorporates tooling for numerous users, speeding time to value and
simplifying development and maintenance:
 Software developers can use the Eclipse-based plug-in to develop custom text
analytic functions to analyze loosely structured or largely unstructured text data.
 Administrators can use the web-based management console to inspect the status of
the software environment, review log records, assess the overall health of the system, and
more.
238
 Data scientists and business analysts can use the data analysis tool to explore and
work with unstructured data in a familiar spreadsheet-like environment.
Analytic Applications
InfoSphere® BigInsights™ provides distinct capabilities for discovering and analyzing

business insights that are hidden in large volumes of data. These technologies and features
combine to help your organization manage data from the moment that it enters your
enterprise.
239
By combining these technologies, InfoSphere BigInsights extends the Hadoop open source
framework with enterprise-grade security, governance, availability, integration into existing
data stores, tools that simplify developer productivity, and more.
Hadoop is a computing environment built on top of a distributed, clustered file system that
is designed specifically for large-scale data operations. Hadoop is designed to scan through
large data sets to produce its results through a highly scalable, distributed batch processing
system. Hadoop comprises two main components: a file system, known as the Hadoop
Distributed File System (HDFS), and a programming paradigm, known as Hadoop
240
MapReduce. To develop applications for Hadoop and interact with HDFS, you use additional
technologies and programming languages such as Pig, Hive, Jaql, Flume, and many others.
Apache Hadoop helps enterprises harness data that was previously difficult to manage and
analyze. InfoSphere BigInsights features Hadoop and its related technologies as a core
component.
MapReduce
MapReduce applications can process large data sets in parallel by using a large number of
computers, known as clusters.
In this programming paradigm, applications are divided into self-contained units of work.
Each of these units of work can be run on any node in the cluster. In a Hadoop cluster, a
MapReduce program is known as a job. A job is run by being broken down into pieces,
known as tasks. These tasks are scheduled to run on the nodes in the cluster where the data
exists.
Applications submit jobs to a specific node in a Hadoop cluster, which is running a program
known as the JobTracker. The JobTracker program communicates with the NameNode to
determine where all of the data required for the job exists across the cluster. The job is then
broken into map tasks and reduce tasks for each node in the cluster to work on. The
JobTracker program attempts to schedule tasks on the cluster where the data is stored,
rather than sending data across the network to complete a task. The MapReduce framework
and the Hadoop Distributed File System (HDFS) typically exist on the same set of nodes,
which enables the JobTracker program to schedule tasks on nodes where the data is stored.
As the name MapReduce implies, the reduce task is always completed after the map task. A

MapReduce job splits the input data set into independent chunks that are processed
by map tasks, which run in parallel. These bits, known as tuples, are key/value pairs.
The reduce task takes the output from the map task as input, and combines the tuples into a
smaller set of tuples.
A set of programs that run continuously, known as TaskTracker agents, monitor the status of
each task. If a task fails to complete, the status of that failure is reported to the JobTracker
program, which reschedules the task on another node in the cluster.
This distribution of work enables map tasks and reduce tasks to run on smaller subsets of

larger data sets, which ultimately provides maximum scalability. The MapReduce framework
also maximizes parallelism by manipulating data stored across multiple clusters. MapReduce
applications do not have to be written in Java™, though most MapReduce programs that run
natively under Hadoop are written in Java.
Hadoop MapReduce runs on the JobTracker and TaskTracker framework from Hadoop
version 1.1.1, and is integrated with the new Common and HDFS features from Hadoop
241
version 2.2.0. The MapReduce APIs used in InfoSphere® BigInsights™ are compatible with
Hadoop version 2.2.0.
InfoSphere BigInsights supports various scenarios
Predictive modeling
Predictive modeling refers to uncovering patterns to help make business decisions, such as
forecasting a propensity for fraud, or determining how pricing affects holiday candy sales
online. Finding patterns has traditionally been at the core of many businesses,
but InfoSphere® BigInsights™ provides new methods for developing predictive models.
Banking: Fraud reduction
A major banking institution developed fraud models to help mitigate risk of credit card
fraud. However, the models took up to 20 days to develop by using traditional methods. By
understanding fraud patterns nearly one month after an incident occurred, the bank was
only partially able to detect fraud.
The bank used InfoSphere BigInsights to create models to help uncover patterns of events
within a customer's life that correlate to fraud. Events such as divorce, home foreclosure, and
job loss could be tracked and incorporated into fraud models. These additional insights
helped the bank to develop more accurate models in a fraction of the time. More robust and
current fraud models helped the bank to detect existing fraud patterns and stop additional
fraud before incurring losses.
Healthcare: Improved patient care
A healthcare insurance provider completed analysis on more than 400 million insurance
claims to determine the potential dangers of interacting drugs. The IT organization
developed a system to analyze large sets of patient data against a database of drugs,
including their interactions with other drugs. Because the patient reference and treatment
data is complex and intricately nested, the analysis could take more than 100 hours per data
set.
The insurance provider used InfoSphere BigInsights to develop a solution that reduced

analysis from 100 hours to 10 hours. By cross-referencing a list of prescriptions for each
patient with an external service for known interaction problems with different drugs, the
service can flag potential conflicts of drug to drug, drug to disease, and drug to allergy
interactions. The insurance provider can provide more quality recommendations for each
patient and lower the overall cost of care.
Retail: Targeted marketing
A large retailer wanted to better understand customer behavior in physical stores versus
behavior in the online marketplace to improve marketing in both spaces. Gaining insight
required exploring massive amounts of web logs and data from physical stores. All of this
242
information was available in the data warehouse, but the retailer could not efficiently map
the disparate data.
By using InfoSphere BigInsights, the retailer parsed varying formats of web log files and
mapped that information in the data warehouse, linking buying behavior online with
behavior in physical stores. Predictive analytics helped to distinguish patterns across these
dimensions. BigSheets was used to visualize and interact with the results to determine which
elements to track. The data from these elements was cleansed in extract-transform-load
(ETL) jobs and loaded into a data warehouse, enabling the retailer to act on information
about buying habits. The retailer used the results to develop more targeted marketing, which
led to increased sales.
Consumer sentiment insight
Developing consumer insight involves uncovering consumer sentiments for brand, campaign,
and promotions management. InfoSphere® BigInsights™ helps to derive customer
sentiment from social media messages, product forums, online reviews, blogs, and other
customer-driven content.
Retail: Consumer goods
A large company specializing in drinks spent a substantial portion of its budget to market its
brands. The company had competitive products in the soft drink, bottled water, and sports
drink markets. To better understand customer sentiment and brand perception, the company
wanted to track information in online social media forums. Positive and negative
commentary, discussion around products, and perception of spokespersons needed to be
gauged to help the company better target its marketing campaigns and promotions.
The company used a third-party tool to provide a view of this information, but concerns
arose about the validity of the results. The tool could not analyze large volumes of
information, so the overall view was only a portion of the data that could be incorporated. In
addition, the software could not understand inherent meanings within text, or focus on
relevant topics without stumbling over literal translations of misspellings, syntax, or jargon.
By using InfoSphere BigInsights, the company aggregated massive amounts of information

from social media sites such as blogs, message boards, forums, and news feeds. The
company used the InfoSphere BigInsights text analytics capabilities to sift through the
collected information and find relevant discussions about their products. The analysis
determined which discussions were favorable or not favorable, including whether the
conversation was about one of the company spokespersons, or about another person with
the same name. Having the right granularity and relevance from a large sample of data
helped the company to obtain useful insights that were used to improve customer sentiment
and enhance brand perception across their line of products.
Research and business development
243
Research and development involves designing and developing products that adapt to
competitor offerings, customer demand, and innovation. InfoSphere® BigInsights™ helps to
collect and analyze information that is critical to implementing future projects, and turning
project visions into reality.
Government: Data archiving
A government library wanted to preserve the digital culture of the nation as related websites
are published, modified, or removed daily. The information would be incorporated into a
web archive portal that researchers and historians would use to explore preserved web
content. Research analysts classified several thousand websites with the country extension by
using manual methods and customized tools. The estimate to manually archive the entire
web domain, comprising more than 4 million websites, would be costly.
The government library used InfoSphere BigInsights and a customized classification module

to electronically classify and tag web content and create visualizations across numerous
commodity computers in parallel. The solution drastically reduces the cost of archiving
websites and helps to archive and preserve massive numbers of websites. As websites are
added or modified, the content is automatically updated and archived so that researchers
and historians can explore and generate new data insights.
Financial services: Acquisition management
A major credit card company wanted to automate the process of measuring the value of
potential acquisitions. This comprehensive analysis needed to include public and private
information, including patents and trademarks, company press releases, annual and quarterly
reports, corporate genealogies, and IP ownership and patents ranked by citation. Because
mergers and acquisitions take considerable time to complete, the company needed ongoing
tracking capabilities. In addition, the solution had to include the ability to present
information in a graphical format that business professionals can comprehend easily.
The company used InfoSphere BigInsights to improve intellectual property analysis for

mergers and acquisitions by gathering information about millions of patents from the web.
Millions of citations were also extracted, and then correlated with the patent records. The
correlations were the basis for a ranking value measurement. The more a patent is
referenced, the greater value or usefulness it has.
The analysis can be completed in hours, compared with weeks that would be required if
manual methods were used. Ongoing tracking and refresh capabilities allow business
professionals to review updates as they occur, providing current and comprehensive insights
into potential acquisitions.
244
8 Applications of Big Data in Real Life
Big Data has totally changed and revolutionized the way businesses and organizations work.
In this section, we will go deep into the major Big Data applications in various sectors and
industries and learn how these sectors are being benefitted by these applications.
In this era where every aspect of our day-to-day life is gadget oriented, there is a huge
volume of data that has been emanating from various digital sources.
Needless to say, we have faced a lot of challenges in the analysis and study of such a huge
volume of data with the traditional data processing tools. To overcome these challenges,
some big data solutions were introduced such as Hadoop. These big data tools really helped
realize the applications of big data.
More and more organizations, both big and small, are leveraging from the benefits provided
by big data applications. Businesses find that these benefits can help them grow fast. There
are lots of opportunity coming in this area, want to become master in Big Data check out
this Big Data Hadoop Training?
Big Data in Education Industry
Education industry is flooding with huge amounts of data related to students, faculty,
courses, results, and what not. Now, we have realized that proper study and analysis of this
data can provide insights which can be used to improve the operational effectiveness and
working of educational institutes.
Following are some of the fields in the education industry that have been transformed by big
data-motivated changes:
 Customized and Dynamic Learning Programs
Customized programs and schemes to benefit individual students can be created using the
data collected on the bases of each student’s learning history. This improves the overall
student results.
 Reframing Course Material
Reframing the course material according to the data that is collected on the basis of what a
student learns and to what extent by real-time monitoring of the components of a course is
beneficial for the students.
 Grading Systems
New advancements in grading systems have been introduced as a result of a proper analysis
of student data.
 Career Prediction
245
Appropriate analysis and study of every student’s records will help understand each
student’s progress, strengths, weaknesses, interests, and more. It would also help in
determining which career would be the most suitable for the student in future.
The applications of big data have provided a solution to one of the biggest pitfalls in the
education system, that is, the one-size-fits-all fashion of academic set-up, by contributing in
e-learning solutions.
Example
The University of Alabama has more than 38,000 students and an ocean of data. In the past
when there were no real solutions to analyze that much of data, some of them seemed
useless. Now, administrators are able to use analytics and data visualizations for this data to
draw out patterns of students revolutionizing the university’s operations, recruitment, and
retention efforts.
Big Data in Healthcare Industry
Healthcare is yet another industry which is bound to generate a huge amount of data.
Following are some of the ways in which big data has contributed to healthcare:
 Big data reduces costs of treatment since there is less chances of having to perform

unnecessary diagnosis.
 It helps in predicting outbreaks of epidemics and also in deciding what preventive

measures could be taken to minimize the effects of the same.
 It helps avoid preventable diseases by detecting them in early stages. It prevents

them from getting any worse which in turn makes their treatment easy and effective.
 Patients can be provided with evidence-based medicine which is identified and

prescribed after doing research on past medical results.
Example
Wearable devices and sensors have been introduced in the healthcare industry which can
provide real-time feed to the electronic health record of a patient. One such technology is
from Apple.
Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to
empower the iPhone users to store and access their real-time health records on their phones.
Big Data in Government Sector
Governments, be it of any country, come face to face with a very huge amount of data on
almost daily basis. The reason for this is, they have to keep track of various records and
databases regarding their citizens, their growth, energy resources, geographical surveys, and
246
many more. All this data contributes to big data. The proper study and analysis of this data,
hence, helps governments in endless ways. Few of them are as follows:
Welfare Schemes
 In making faster and informed decisions regarding various political programs
 To identify areas that are in immediate need of attention
 To stay up to date in the field of agriculture by keeping track of all existing land and
livestock.
 To overcome national challenges such as unemployment, terrorism, energy resources

exploration, and much more.
Cyber Security
 Big Data is hugely used for deceit recognition.
 It is also used in catching tax evaders.
Example
Food and Drug Administration (FDA) which runs under the jurisdiction of the Federal
Government of USA leverages from the analysis of big data to discover patters and
associations in order to identify and examine the expected or unexpected occurrences of
food-based infections.
Big Data in Media and Entertainment Industry
With people having access to various digital gadgets, generation of large amount of data is
inevitable and this is the main cause of the rise in big data in media and entertainment
industry.
Other than this, social media platforms are another way in which huge amount of data is
being generated. Although, businesses in the media and entertainment industry have
realized the importance of this data, and they have been able to benefit from it for their
growth.
Some of the benefits extracted from big data in the media and entertainment industry
are given below:
 Predicting the interests of audiences
 Optimized or on-demand scheduling of media streams in digital media distribution

platforms
 Getting insights from customer reviews
247
 Effective targeting of the advertisements
Example
Spotify, an on-demand music providing platform, uses Big Data Analytics, collects data from
all its users around the globe, and then uses the analyzed data to give informed music
recommendations and suggestions to every individual user.
Amazon Prime that offers, videos, music, and Kindle books in a one-stop shop is also big on
using big data.
Big Data in Weather Patterns
There are weather sensors and satellites deployed all around the globe. A huge amount of
data is collected from them, and then this data is used to monitor the weather and
environmental conditions.
All of the data collected from these sensors and satellites contribute to big data and can be
used in different ways such as:
 In weather forecasting
 To study global warming
 In understanding the patterns of natural disasters
 To make necessary preparations in the case of crises
 To predict the availability of usable water around the world
Example
IBM Deep Thunder, which is a research project by IBM, provides weather forecasting through
high-performance computing of big data. IBM is also assisting Tokyo with the improved
weather forecasting for natural disasters or predicting the probability of damaged power
lines.
Big Data in Transportation Industry
Since the rise of big data, it has been used in various ways to make transportation more
efficient and easy. Following are some of the areas where big data contributes to
transportation.
 Route planning: Big data can be used to understand and estimate users’ needs on
different routes and on multiple modes of transportation and then utilize route
planning to reduce their wait time.
248
 Congestion management and traffic control: Using big data, real-time estimation
of congestion and traffic patterns is now possible. For examples, people are using
Google Maps to locate the least traffic-prone routes.
 Safety level of traffic: Using the real-time processing of big data and predictive
analysis to identify accident-prone areas can help reduce accidents and increase the
safety level of traffic.
Example
Let’s take Uber as an example here. Uber generates and uses a huge amount of data
regarding drivers, their vehicles, locations, every trip from every vehicle, etc. All this data is
analyzed and then used to predict supply, demand, location of drivers, and fares that will be
set for every trip.
And guess what? We too make use of this application when we choose a route to save fuel
and time, based on our knowledge of having taken that particular route sometime in the
past. In this case, we analyzed and made use of the data that we had previously acquired on
account of our experience, and then we used it to make a smart decision. It’s pretty cool that
big data has played parts not only in big fields but also in our smallest day-to-day life
decisions too.
Big Data in Banking Sector
The amount of data in the banking sector is skyrocketing every second. According to GDC
prognosis, this data is estimated to grow 700 percent by the end of the next year. Proper
study and analysis of this data can help detect any and all illegal activities that are being
carried out such as:
o Misuse of credit/debit cards
o Venture credit hazard treatment
o Business clarity
o Customer statistics alteration
o Money laundering
o Risk mitigation
Example
Various anti-money laundering software such as SAS AML use Data Analytics in Banking for
the purpose of detecting suspicious transactions and analyzing customer data. Bank of
America has been a SAS AML customer for more than 25 years.
Big Data in Transforming Real Estate
249
Till the involvement of big data in real estate, real estate had been using the same boring
and outdated ways of networking. Not only real-estate, even other businesses carried on
with old terms for accessing their transactions. But today the real estate industry is also
welcoming big data. Like the retail market, banking sectors and health services who have
been using big data to make advancements in their services and who have been using big
data in solving their internal problems, in the same way, real estate have also started to use it
for its own advancements. Still, real estate agents question whether big data can really offer
better facilities than the intellectuals who knows the in and outs of property sales. But by
analyzing all the records of big data in education, health, technology and other services, and
by viewing certain examples we have come to a conclusion that application of big data can
really change the real-estate game.
Here are few of the ways how big data can bring a change in the real estate industry:
 Developing a Great Residential Society
 Providing the Real Estate Buyers with Standard Information
 Even Those Smart Property Agents Cannot Fool You Anymore
 Property Dealers cannot make Fool of the Investors Anymore
 Help Real Estate Developers to Satisfy every Need of the Customer
 Making Construction Administration Better
Developing a Great Residential Society
Not only with the essential information, but Big data is also helping real estate with newer
and interesting techniques in planning a strategy for society building. In order to make a
residential society, a proper planning has to be done. In this part, big data is helping a lot, by
doing research and letting property developers know about the quality of the environment, a
balanced atmosphere, and much more which is a must for the construction of a residential
society. So the big data analytics provide the real estate builders with information regarding
health care and energy competence needed for a proper construction development. Data
even plays a big role in the engineering concepts, with the help of which, the civil engineers
get better information to engineer those buildings with stronger basements and dig ups.
Providing the Real Estate Sellers and Buyers with Standard Information
Today using big data, companies are able to build the perfect performance insights of
businesses. Without the involvement of big data technologies, real estate businesses used to
make use of unprocessed data which never really gave them completely accurate results. The
application of big data has turned real estate into a completely clear business and along with
such transparency, business analytics also gave them a new turn. The processing of the
unprocessed data and other relations are advanced by using the big data analytics.
Managing of the information extracted from the outputs of censuses, the consequences of
250
consumer appraisals, catalogues of homes for sale and lease, data of the geographic
information systems etc are all done by Big data analytics today. The big data analytic tools
study about the patterns customers use while buying properties, and these data helps in
improving the real estate sales and also helps them appreciate home-value fashions within a
scrupulous environment. From the information generated by big data about the real estates,
people come to know easily about the different buildings and houses available for rent as
well as for sale within a place.
You do not have to worry about paying those huge bulks of money hiring a broker to find
out any property in your neighbourhood. All that you have to do is to check the websites
which tell us about real estate houses and properties for rent, lease and also for sales. Big
data has made all the information about all the sales in every corner of a region available on
the internet. Even if you hire a real estate agent, you can always confirm with the prices by
comparing it with similar properties on the internet. A quick survey in the World Wide Web
will allow you to come to a better conclusion protecting you from the huge loss you might
have gone through. A real- estate agent holds the complete knowledge of all the buyer
incentives, exclusive offers going on in the market and all other insights. Hiring a good real
estate agent is really helpful but do not forget to make use of a little of the internet to have a
cross check of what the agent is taking you through.
Property Dealers cannot make Fool of the Investors Anymore
With the help of information gained due to big data analytics, today even the investment
sectors including the banks can find out whether the exclusive offers given by the real-estate
businesses are really worthy. Today due to the transparency of real estates, which is because
of the business insights big data provides to its users, people are too aware of which
properties they should buy and which they should not. Even big data has given the banking
sectors full rights to view the real estate insights, as a result of which, before investing any
money, the investing sectors make sure that every rupee is worthy for that particular
property.
Help Real Estate Developers to Satisfy every Need of the Customers
As big data helps customers in experiencing safe buying of the products, in the same way, it
gives a complete idea of the search procedures customers follow while finding the
properties. Taking the help of those search patterns of the customers, big data helps the real
estate developers to promote their businesses the way the customer wants to see. The
information from different sources allows the real estate developers to have a deeper
understanding of what customers want and accordingly they can implement those
requirements in developing their properties so as to satisfy the customer demands.
Making Construction Administration Better
Big data has been forwarding its helping hand towards construction management also. As a
result of which the real estate developers are able to keep track of all the projects related to
construction. With the help of sensors and cameras, every moment in the construction place
can be tracked resulting in secured businesses. The application of smart devices in the real
251
estate projects seems to be a bit costly, but the results it gives seems to be worthy for such
payments.
Conclusion
No wonder, there is so much hype for big data, given all of its applications. The importance
of big data lies in how an organization is using the collected data and not in how much data
they have been able to collect. There are Big Data solutions that make the analysis of big
data easy and efficient. These Big Data solutions are used to gain benefits from the heaping
amounts of data in almost all industry verticals.
252
Kick-start Your Career in Big Data and Hadoop
We are sure, like everybody else you have heard about how Big Data is taking the world by
storm and how a career in Hadoop can really take your future places. But we are also sure,
like everybody else you have scores of questions but just don’t know who to ask or get
guidance from.
A recent study from the global consulting powerhouse Mckinsey has revealed that for the
first time the exchange of cross border data is contributing more to the global GDP than the
global exchange of goods (international trade to be precise). In short, data is the new trade
that the world is enamored to. Do you know that the cross-border data exchange via the
undersea internet cables has grown 45 times since 2005? Now this is further expected to
grow another 9 times within the next five years. All that is too much of data on the grandest
of the scales to even fathom, right?
So the natural question that you might ask is where does all this data emanate from? Well,
the biggest contributor is internet videos, our penchant for watching them on YouTube or
Facebook, then there are our online conversations and interactions on social networking
sites, our ecommerce shopping data, financial and credit card information, all the website
content, blogs, infographics, statistics and so on and so forth. But wait, we now have a new
contributor to this league and this comes from the various machines which are connected to
the internet in the grand scheme of things called the Internet of Things. In future IoT will be
the biggest contributor of data exchange going from one end of the globe to the other at
the speed of light over our internet networks. So that is all about Big Data for you in a
nutshell.
So now your doubt about what comprises Big Data and how big it really is has been cleared
once and for all, we can move on. So this should naturally convince you to take up a career in
Big Data and Hadoop. The next question that you might have – what is Hadoop and why do I
need to learn it? Hadoop is a software framework for processing of large amounts of data on
a level that is just too overwhelming for regular database processing tools or software, for
that matter. Scalability is one of the hallmarks of Hadoop as it can be replicated from a single
machine to thousands, as and when the need arises. Hadoop is agile, resilient, versatile
among other features making it the most important Big Data processing framework available
today.
Prerequisites for a career in Hadoop
Let us start by saying that the first and foremost thing is a love for working with data, an
inquisitive mind, a mindset that is ready to work out of the box and a creative thought
process. This is not to confuse you in any manner but all we want to say is that the technical
knowhow and the programming skills can all be acquired with the right frame of mind in a
very short duration of time.
So if you are really keen on having a career in Hadoop then we suggest you start by learning
the nuances of any programming language be it Java, Perl, Python, Ruby or even C. Once
253
you have a basic framework of how programming languages work and how to write a highly
effective algorithm then you are good to go to begin learning Hadoop.
Career prospects upon completion of Hadoop Certification
Let us reiterate, since the amount of data is only going to increase in the future and the field
of Big Data hasn’t yet been saturated, you can expect a stellar career growth opportunity
with the right skill sets and domain level expertise of Hadoop. In India, with just a few years
of experience professionals in Big Data can command a salary of upwards of Rs. 10 Lakhs per
annum. While there are many different roles and job opportunities in the Big Data domain,
you can choose one that suits you the best among the ones mentioned herein – Hadoop
Architect, Hadoop Developer, Data Scientist, Data Visualizer, Research Analyst, Data
Engineer, Data Analyst, and Code Tester.
Industries that are currently hiring Hadoop professionals
Today we have reached a phase where working with huge amounts of data is imperative for
organizations regardless of their industry vertical and customer segmentation. So expect a
whole host of industry verticals vying for your attention in their pursuit of hiring the best
talent in Big Data and Hadoop. Some of the popular business sectors currently hiring are
banking, insurance, ecommerce, hospitality, manufacturing, marketing, advertising, social
media, healthcare, transportation, and the list can be almost endless. There will be an
overwhelming need for at least 1.5 million Big Data professionals and analysts by 2018 in the
United States alone. Know that most of these could be filled by talented Indian professionals
since India is the biggest talent pool in the IT sector currently in the world!
How all that Big Data is being used by business enterprises?
Today our world is firmly in a knowledge economy. It is no longer about how much capital or
manpower your business organization has. But it inevitably boils down to how much data,
information and knowledge that you have in the grand scheme of things. Data is the new
arms race and the global multinational enterprises are the new superpowers.
Let’s consider a few examples to elucidate this:
254
Some of the largest technology companies of the world viz. Google, Apple, Amazon, EBay
and Facebook run purely on the power of Big Data that they are accessing and making sense
of in order to gain a definitive advantage over their rivals. It is vital to emphasize that more
data is not just something about that is a little bit extra – more distinctively signifies new,
more signifies better and more even signifies something radically different. Did you
know that using IBM Watson cognitive computing technology the doctors in the United
States were able to detect new symptoms of cancer the ones which they never knew existed
in the first place. All this thanks to Big Data!
Among other things, Big Data helps organizations to understand their customers better,
detect patterns, segregate demographics, anticipate sales, and better pitch the products and
services to the customers in a more personalized and user-friendly manner. It assists
enterprises to predict the future, understand growth opportunities, look for newer markets
and better ways to service the customers among a million other things that only Big Data
can afford these forward-thinking enterprises.
Finally, remember that only 0.5% of all the data that is available today has ever been
analyzed or utilized. So the whole sea of information lies right in front of us and that needs
the unflinching support of all individuals in their professional capacities like you to steer the
global economy to the next orbit of growth and prosperity. So are you game to take up a
career in Big Data and Hadoop now. One thing we can assure you is that with Big Data and
Hadoop your career will never be the same again.
255
Most Valuable Data Science Skills Of 2020
“Information is the oil of the 21st century, and analytics is the combustion engine.”–
Peter Sondergaard, SVP, Gartner Research.
So you have heard how data science is one of the best jobs of the 21st century. Since it is a
relatively young domain the scope and opportunities in this field are aplenty for people with
the right set of skills and qualifications.
2019 could be the defining year in the Data Science domain.
Today regardless of the size of a business or industry type there is a need for quality
professionals who can decipher all that unstructured and strategize business goals. So let’s
delve deeper into the skills that are needed to pursue a career in the Big Data sphere.
It’s a no-brainer that technical skills are a must for any data scientist. Having the right
education matters a lot. It could be an Engineering degree, a Master’s degree or even a PhD
in a field of your choice. Having an analytical bent of mind goes a long way in securing your
future in this arena.
Some of The Popular Fields of Science That Are Much Sought-After
 Statistics and Mathematics
 Computer Science
 Analytical Reasoning
256
When it comes to analytical skills, most companies are looking for people with skill either in
SAS or R programming. Though for data science, R is the most preferred analytical tool.
Technical Expertise and Coding Skills
Knowledge of Hadoop: since Hadoop is the most popular Big Data framework it is

expected that you have a good understanding of it. This includes MapReduce processing
and working with Hadoop Distributed File System (HDFS).
Programming Skills: there are various programming languages most common of which are
Python, Java, C, or Perl. Having a good grasp of any of these coding languages will be very
useful.
SQL database: SQL is the most common way of getting information from a database and
updating it. Candidates need to know how to query in SQL since it is the most preferred
language for RDBMS.
NoSQL: the range of today’s data collected is so wide and diverse that SQL alone cannot
provide all the solutions. This is where NoSQL takes over in order to make sense of
databases that are not in tabular form and are thus more complex.
Non-Technical Expertise
Intellectual Quest: though the technical skills are vital for a successful career in the data
science domain, it is by far not the only requirement. The candidates should have a strong
thirst for knowledge and initiative to use their intelligence to parse a problem. It is the skill to
not only understand “what is” but rather “what can be” when it comes to Big Data
applications.
Strong business acuity: all data science personnel will be working in a business
environment and hence a clear understanding of the business domain is a must-have skill.
Knowledge of what real world problems your organization is trying to solve is expected plus
a knack to deploy data in newer ways so that your organization can benefit in hitherto
unheard ways.
Excellent communication: a large part of the job of a data scientist involves communicating
with different departments in order to get the work done. Sometimes he has to be the liaison
between the technical and non-technical staff. Thus a complete knowledge of the industry is
a must. Apart from that he has to have good management and people skills in order to take
all stakeholders into confidence.
Skill Set Advancement
Online courses: a lot of online training courses and tutorials are available in order to help
freshers and seasoned professionals alike to make it big in the data science domain.
257
Professional Certification: companies like IBM, Cisco are at the forefront of ensuring that
the right candidates get the right jobs. Hence they are provided industry recognized
certifications upon completion of certain courses and training for the worthy candidates.
Reputable Hackathons: if you are living in a city that has a vibrant IT ecosystem (like San
Francisco in the USA and Bangalore in India) then chances are that you will have regular
Hackathons wherein the programmers and other technical professionals meet and work on
short intense projects that have huge real world significance.
Big Data Job Responsibilities and Skills
Hadoop is the new data warehouse. It is the new source of data within the enterprise. There is
a premium on people who know enough about the guts of Hadoop to help companies take
advantage of it. – James Koibelus, Analyst at Forrester Research
Big Data and Big Data jobs are everywhere. Let’s leave the clichés behind and cut to the
chase: a Hadoop professional can earn an average salary of $112,000 per year and, in San
Francisco, it can go up to $160,000. Now that we have your undivided attention, let us delve
into what exactly is meant by being a Hadoop professional and what the roles and
responsibilities of a Hadoop professional are.
General Skills Expected from Hadoop Professionals
 Ability to work with huge volumes of data so as to derive Business Intelligence
 Knowledge to analyze data, uncover information, derive insights, and propose data-
driven strategies
 Knowledge of OOP languages like Java, C++, and Python
 Understanding of database theories, structures, categories, properties, and best

practices
 Knowledge of installing, configuring, maintaining, and securing Hadoop
 Analytical mind and ability to learn-unlearn-relearn concepts
What Are the Various Job Roles Under the Hadoop Domain?
 Hadoop Developer
 Hadoop Architect
 Hadoop Administrator
 Hadoop Tester
258
 Data Scientist
The US alone faces a shortage of 1.4–1.9 million Big Data Analysts!
Hadoop Developer Roles and Responsibilities
The primary job of a Hadoop Developer involves coding. They are basically software
programmers, working in the Big Data Hadoop domain. They are adept at coming up with
the design concepts that are used for creating extensive software applications. They are
masters of computer procedural languages.
A professional Hadoop Developer can expect an average salary of US$100,000 per annum!
Below are duties you can expect as part of your Hadoop Developer work routine:
 Have knowledge of the agile methodology for delivering software solutions
 Design, develop, document, and architect Hadoop applications
 Manage and monitor Hadoop log files
 Develop MapReduce coding that works seamlessly on Hadoop clusters
 Have working knowledge of SQL, NoSQL, data warehousing, and DBA
 Be an expert in newer concepts like Apache Spark and Scala programming
 Acquire complete knowledge of the Hadoop ecosystem and Hadoop Common
 Seamlessly convert hard-to-grasp technical requirements into outstanding designs
 Design web services for swift data tracking and query data at high speeds
 Test software prototypes, propose standards, and smoothly transfer them to

operations
Most companies estimate that they’re analyzing a mere 12 percent of the data they have!
Hadoop Architect Roles and Responsibilities
A Hadoop Architect, as the name suggests, is someone who is entrusted with the
tremendous responsibility of dictating where the organization will go in terms of Big Data
Hadoop deployment. He is involved in planning, designing, and strategizing the roadmap
and deciding how the organization moves forward.
Below are duties you can expect as part of your Hadoop Architect work routine:
259
 Have hands-on experience in working with Hadoop distribution platforms like
Hortonworks, Cloudera, MapR, and others
 Take end-to-end responsibility of the Hadoop life cycle in the organization
 Be the bridge between Data Scientists, Engineers, and the organizational needs
 Do in-depth requirement analyses and exclusively choose the work platform
 Acquire full knowledge of Hadoop architecture and HDFS
 Have working knowledge of MapReduce, HBase, Pig, Java, and Hive
 Ensure to choose a Hadoop solution that would be deployed without any hindrance
75 percent of companies are investing or planning to invest in Big Data already – Gartner
Hadoop Administrator Roles and Responsibilities
Hadoop Administrator is also a very prominent role as he/she is responsible for ensuring that
there is no roadblock to the smooth functioning of the Hadoop framework. The roles and
responsibilities of this job profile resemble that of a System Administrator. A complete
knowledge of the hardware ecosystem and Hadoop architecture is critical.
A certified Hadoop Administrator can expect an average salary of US$123,000 per year!
Below are duties you can expect as part of your Hadoop Administrator work routine:
 Manage and maintain Hadoop clusters for uninterrupted job
 Check, back-up, and monitor the entire system, routinely
 Ensure that the connectivity and network are always up and running
 Plan for capacity upgrading or downsizing as and when the need arises
 Manage HDFS and ensure that it is working optimally at all times
 Secure the Hadoop cluster in a foolproof manner
 Regulate the administration rights depending on the job profile of users
 Add new users over time and discard redundant users smoothly
 Have full knowledge of HBase for efficient Hadoop administration
 Be proficient in Linux scripting and also in Hive, Oozie, and HCatalog
260
For a Fortune 1000 company, a 10 percent increase in data accessibility can result in US$65
million additional income!
Hadoop Tester Roles and Responsibilities
The job of a Hadoop Tester has become extremely critical since Hadoop networks are
getting bigger and more complex with each passing day. This poses some new problems
when it comes to viability and security and ensuring that everything works smoothly without
any bugs or issues. A Hadoop Tester is primarily responsible for troubleshooting Hadoop
applications and rectifying any problem that he/she discovers at the earliest before it
becomes seriously threatening.
An expert Hadoop Testing Professional can earn a salary of up to US$132,000 per annum!
Below are duties you can expect as part of your Hadoop Tester work routine:
 Construct and deploy both positive and negative test cases
 Discover, document, and report bugs and performance issues
 Ensure that MaReduce jobs are running at peak performance
 Check if the constituent Hadoop scripts like HiveQL and Pig Latin are robust
 Have expert knowledge of Java to efficiently do the MapReduce testing
 Understand MRUnit, and JUnit testing frameworks
 Be fully proficient in Apache Pig and Hive
 Be an expert to work with the Selenium Testing Automation tool
 Be able to come up with contingency plans in case of breakdown
Data Scientist Roles and Responsibilities
Data Scientist is a much sought-after job role in the market today and there aren’t enough
qualified professionals to take up this high-paying job that enterprises are ready to offer.
What makes a Data Scientist such a hot commodity in the jobs market? Well, a part of the
allure lies in the fact that a Data Scientist wears multiple hats over the course of a typical day
at office. He is a scientist, an artist, and a magician!
The average salary of a Data Scientist is US$123,000 per annum!
So far, only less than 0.5 percent of all data is ever analyzed and used!
Data Scientists are basically Data Analysts with wider responsibilities. Below are duties you
can expect as part of your Data Scientist work routine:
261
 Master different techniques of analyzing data, completely
 Expect to solve real business issues backed by solid data
 Tailor the data analytics ecosystem to suit the specific business needs
 Have a strong grip of mathematics and statistics
 Keep the big picture in mind at all times to know what needs to be done
 Develop data mining architecture, data modeling standards, and more
 Have an advanced knowledge of SQL, Hive, and Pig
 Be an expert to work with R, SPSS, and SAS
 Acquire the ability to corroborate actions with data and insights
 Have creativity to do things that can make wonders for the business
 Have top-notch communication skills to connect with everybody on-board in the

organization
262
Big Data Terminologies You Must Know
We will discuss the terminology related to Big Data ecosystem. This will give you a complete
understanding of Big Data and its terms.
Over time, Hadoop has become the nucleus of the Big Data ecosystem, where many new
technologies have emerged and have got integrated with Hadoop. So it’s important that,
first, we understand and appreciate the nucleus of modern Big Data architecture.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers, using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Components of the Hadoop Ecosystem
Let’s begin by looking at some of the components of the Hadoop ecosystem:
Hadoop Distributed File System (HDFS™):

This is a distributed file system that provides high-throughput access to application
data. Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and
distributed throughout the cluster. In this method, the map and reduce functions can be
executed on smaller subsets of your larger data sets, and this provides the scalability needed
for Big Data processing.
MapReduce:
MapReduce is a programming model specifically implemented for processing large data sets
on Hadoop cluster. This is the core component of the Hadoop framework, and it is the only
execution engine available for Hadoop 1.0.
The MapReduce framework consists of two parts:
1. A function called ‘Map’, which allows different points in the distributed cluster to distribute
their work.
2. A function called ‘Reduce’, which is designed to reduce the final form of the clusters’
results into one output.
The main advantage of the MapReduce framework is its fault tolerance, where periodic
reports from each node in the cluster are expected as soon as the work is completed.
The MapReduce framework is inspired by the ‘Map’ and ‘Reduce’ functions used in functional
programming. The computational processing occurs on data stored in a file system or within
a database, which takes a set of input key values and produces a set of output key values.
263
Each day, numerous MapReduce programs and MapReduce jobs are executed on Google’s
clusters. Programs are automatically parallelized and executed on a large cluster of
commodity machines.
Map Reduce is used in distributed grep, distributed sort, Web link-graph reversal, Web
access log stats, document clustering, Machine Learning and statistical machine translation.
Pig:
Pig is a data flow language that allows users to write complex MapReduce operations in
simple scripting language. Then Pig then transforms those scripts into a MapReduce job.
Hive:
Apache Hive data warehouse software facilitates querying and managing large datasets
residing in distributed storage. Hive provides a mechanism for querying the data using a
SQL-like language called HiveQL. At the same time, this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
Sqoop:
Enterprises that use Hadoop often find it necessary to transfer some of their data from
traditional relational database management systems (RDBMSs) to the Hadoop ecosystem.
Sqoop, an integral part of Hadoop, can perform this transfer in an automated fashion.
Moreover, the data imported into Hadoop can be transformed with MapReduce before
exporting them back to the RDBMS. Sqoop can also generate Java classes for
programmatically interacting with imported data.
Sqoop uses a connector-based architecture that allows it to use plugins to connect with
external databases.
Flume:
Flume is a service for streaming logs into Hadoop. Apache Flume is a distributed, reliable,
and available service for efficiently collecting, aggregating, and moving large amounts of
streaming data into the Hadoop Distributed File System (HDFS).
Storm:
Storm is a distributed, real-time computation system for processing large volumes of high-
velocity data. Storm is extremely fast and can process over a million records per second per
node on a cluster of modest size. Enterprises harness this speed and combine it with other
data-access applications in Hadoop to prevent undesirable events or to optimize positive
outcomes.
Kafka:
Apache Kafka supports a wide range of use cases such as a general-purpose messaging
system for scenarios where high throughput, reliable delivery, and horizontal scalability are
important. Apache Storm and Apache HBase both work very well in combination with Kafka.
264
Oozie:
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. The Oozie Workflow
jobs are Directed Acyclical Graphs (DAGs) of actions, whereas the Oozie Coordinator jobs are
recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.
Oozie is integrated with the rest of the Hadoop stack and supports several types of Hadoop
jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and
Distcp) as well as system-specific jobs (such as Java programs and shell scripts). Oozie is a
scalable, reliable and extensible system.
Spark:
Apache Spark is a fast, in-memory data processing engine for distributed computing clusters
like Hadoop. It runs on top of existing Hadoop clusters and accesses the Hadoop data store
(HDFS).
Apache Solr:
Apache Solr is a fast, open-source Java search server. Solr enables you to easily create search
engines that search websites, databases, and files for Big Data.
Apache Yarn:
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology. YARN is one of the key features in the second-generation Hadoop 2 version of
the Apache Software Foundation’s open-source distributed processing framework. Originally
described by Apache as a redesigned resource manager, YARN is now characterized as a
large-scale, distributed operating system for big data applications.
Tez:
Tez is an execution engine for Hadoop that allows jobs to meet the demands for fast
response times and extreme throughput at petabyte scale. Tez represents computations as a
dataflow graphs and can be used with Hadoop 2 YARN.
Apache Drill:
Apache Drill is an open-source, low-latency query engine for Hadoop that delivers secure,
interactive SQL analytics at petabyte scale. With the ability to discover schemas on the go,
Drill is a pioneer in delivering self-service data exploration capabilities on data stored in
multiple formats in files or NoSQL databases. By adhering to ANSI SQL standards, Drill does
not require a learning curve and integrates seamlessly with visualization tools.
Apache Phoenix:
Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and co-
ordinates the running of those scans to produce a regular JDBC result set. Apache Phoenix
enables OLTP and operational analytics in Hadoop for low-latency applications by combining
the best of both worlds. Apache Phoenix is fully integrated with other Hadoop products such
as Spark, Hive, Pig, Flume, and Map Reduce.
265
Cloud Computing:
Cloud Computing is a type of computing that relies on sharing computing resources rather
than having local servers or personal devices to handle applications. Cloud Computing is
comparable to grid computing, a type of computing where the unused processing cycles of
all computers in a network are harnessed to solve problems that are too processor-intensive
for any single machine.
In Cloud Computing, the word cloud (also phrased as “the cloud”) is used as a metaphor for
the Internet, hence the phrase cloud computing means “a type of Internet-based computing”
in which different services such as servers, storage and applications are delivered to an
organization’s computers and devices via the Internet.
NoSQL:
The NoSQL database, also called Not Only SQL, is an approach to data management and
database design that’s useful for very large sets of distributed data. This database system is
non-relational, distributed, open-source and horizontally scalable. NoSQL seeks to solve the
scalability and big-data performance issues that relational databases weren’t designed to
address.
Apache Cassandra:
Apache Cassandra is an open-source distributed database system designed for storing and
managing large amounts of data across commodity servers. Cassandra can serve as both a
real-time operational data store for online transactional applications and a read-intensive
database for large-scale business intelligence (BI) systems.
SimpleDB:
Amazon Simple Database Service (SimpleDB), also known as a key value data store, is a
highly available and flexible non-relational database that allows developers to request and
store data, with minimal database management and administrative responsibility.
This service offers simplified access to a data store and query functions that let users
instantly add data and effortlessly recover or edit that data.
SimpleDB is best used by customers who have a relatively simple data requirement, like data
storage. For example, a business might use cookies to track visitors that visit its company
website. Some applications might read the cookies to get the visitor’s identifier and look up
the feeds they’re interested in. Amazon SimpleDB gives users the option to store tens of
attributes for a million customers, but not thousands of attributes for a single customer.
Google BigTable:
Google’s BigTable is a distributed, column-oriented data store created by Google Inc. to
handle very large amounts of structured data associated with the company’s Internet search
and Web services operations.
266
BigTable was designed to support applications requiring massive scalability; from its first
iteration, the technology was intended to be used with petabytes of data. The database was
designed to be deployed on clustered systems and uses a simple data model that Google
has described as “a sparse, distributed, persistent multidimensional sorted map.” Data is
assembled in order by row key, and indexing of the map is arranged according to row,
column keys, and timestamps. Here, compression algorithms help achieve high capacity.
MongoDB:
MongoDB is a cross-platform, document-oriented database. Classified as a NoSQL database,
MongoDB shuns the traditional table-based relational database structure in favor of JSON-
like documents with dynamic schemas (MongoDB calls the format BSON), making the
integration of data in certain types of applications easier and faster.
MongoDB is developed by MongoDB Inc. and is published as free and open-source software
under a combination of the GNU Affero General Public License and the Apache License. As of
July 2015, MongoDB is the fourth most popular type of database management system, and
the most popular for document stores.
HBase:
Apache HBase (Hadoop DataBase) is an open-source NoSQL database that runs on the top
of the database and provides real-time read/write access to those large data sets.
HBase scales linearly to handle huge data sets with billions of rows and millions of columns,
and it easily combines data sources that use a wide variety of different structures and
schema. HBase is natively integrated with Hadoop and works seamlessly alongside other
data access engines through YARN.
Neo4j:
Neo4j is a graph database management system developed by Neo Technology, Inc. Neo4j is
described by its developers as an ACID-compliant transactional database with native graph
storage and processing. According to db-engines.com, Neo4j is the most popular graph
database.
Couch DB:
CouchDB is a database that completely embraces the web. It stores your data with JSON
documents. It accesses your documents and queries your indexes with your web browser, via
HTTP. It indexes, combines, and transforms your documents with JavaScript.
CouchDB works well with modern web and mobile apps. You can even serve web apps
directly out of CouchDB. You can distribute your data, or your apps, efficiently using
CouchDB’s incremental replication. CouchDB supports master-master setups with automatic
conflict detection.
267
Key Takeaways
Data has intrinsic value. But it’s of no use until that value is discovered. Equally important:
How truthful is your data and how much can you rely on it?
Today, big data has become capital. Think of some of the world’s biggest tech companies. A
large part of the value they offer comes from their data, which they’re constantly analyzing
to produce more efficiency and develop new products.
Recent technological breakthroughs have exponentially reduced the cost of data storage and
compute, making it easier and less expensive to store more data than ever before. With an
increased volume of big data now cheaper and more accessible, you can make more
accurate and precise business decisions.
Finding value in big data isn’t only about analyzing it (which is a whole other benefit). It’s an
entire discovery process that requires insightful analysts, business users, and executives who
ask the right questions, recognize patterns, make informed assumptions, and predict
behavior.
The importance of big data doesn’t revolve around how much data you have, but what you do
with it. You can take data from any source and analyze it to find answers that enable 1) cost
reductions, 2) time reductions, 3) new product development and optimized offerings, and 4)
smart decision making. When you combine big data with high-powered analytics, you can
Align business-related tasks.
268

Taming Big Data

Uploaded by

Copyright:

Available Formats

You might also like

Taming Big Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Taming Big Data

Uploaded by

Copyright:

Available Formats

Table of Contents:

Introduction to Big Data--------------------------------------------------------------------------------------------- 3

Categorizing Data as Big Data

We have five Vs:

Emp Emp Name Gende Department Salary

Major Sectors Using Big Data Every Day

Big Data in Healthcare

Maintaining customer relationships is the most important in the e-commerce industry. E-

Why is Big Data so important?

What is Big Data Analytics?

How is Big Data Analytics used today?

Benefits of Big Data Analytics

It is a framework which is based on java programming. It is intended to work upon from a

The Apache Hadoop software library based framework that gives permissions to distribute

Why Apache Hadoop?

Hadoop Distributed File System (HDFS)

The daemons of HDFS are as follows:

MapReduce is a patented software framework introduced by Google to support distributed

Working of the MapReduce Architecture

How does Hadoop work?

How does Yahoo! use Hadoop Architecture?

 VirtualBox/VMWare/Cloudera: Any of these can be used for installing the

Hadoop Installation on Windows

Note: If you are working on Linux, then skip to Step 9.

Step 1: Installing VMware Workstation

Step 2: Installing CentOS

Step 3: Setting up CentOS in VMware 12

When you open VMware, the following window pops up:

Click on Create a New Virtual Machine

1. Under Other Storage Options, select I would like to make additional space

1. Select the partition scheme here as Standard Partition

o Next, click on Begin Installation

 You will see the login screen as below:

Step 9: Downloading and Installing Java 8

tar -xvf jdk-8u101-linux-i586.tar.gz

tar -xvf hadoop-2.7.3.tar.gz

You can have multiple versions as well.

Note: You have to become a root user first to be able to edit ~/.bashrc.

o Now, exit from this window by entering the command :wq!

You will see the following window:

o Exit using Esc and the command :wq!

o Exit using Esc and the command :wq!

Step 15: Checking Hadoop

The Hadoop High-level Architecture

The Apache Hadoop Module

Apache Ambari Apache Spark Sqoop

How does Hadoop Work?

The Challenges facing Data at Scale and the Scope of Hadoop

Big Data are categorized into:

Comparison to Existing Database Technologies

Hadoop Features and Characteristics

 Distributed Processing– The data storage is maintained in a distributed manner

Hadoop Design Principles

The following are the design principles on which Hadoop works:

ZooKeeper is basically a technology for coordinating everything on your cluster. So, it is a

In the next section, we will be learning about HDFS in detail.

Why do you need another file system?

Now, we will talk about HDFS, by working with Example.txt which is a 514 MB file.

Let’s see the architecture of HDFS.