Big Data

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 130

HADOOP MAPREDUCE HIVE Impala

PIG SQOOP FLUME HBASE


CASSANDRA SPARK KAFKA STORM
ZOOKEEPER OOZIE INPUT FORMAT STREAMING
Hadoop CLI Commands
1) Hadoop fs –: Even hadoop CLI Command should start this command only.
Syntax:- root @local host # hadoop fs – command
2) Hadoop fs –ls/ To see all the directories and files in the hadoop file system.
Syntax:- root @local host # hadoop fs – ls/
3) Hadoop Fs –ls/ user: To see all the directories and files of user in the hadoop file system.
Syntax:- root @local host # hadoop fs – ls/users
4) Hadoop fs –ls/ user/root: To see the list of directories and files of /user/root directory in hadoop file system
Syntax:- root @local host # hadoop fs – ls/user/root
5) Hadoop fs –mkdir: To create a new directory in the hadoop file system
A) Syntax:- root @local host # hadoop fs – mkdir test.
The directory fest will be created in default directory of hadoop file system i.e/user/root
To check, the command is
Root @local host # hadoop fs – ls/user/root
Output will be:
Drwar-xr-x –root supergroup 0 2011-07-29 12:25
B) Syntax:- root @local host # hadoop fs – mkdir/user/root/test1/user/root/fest
The new directory will be crested in the specified pash.
6) Hadoop Fs –ls/user/root /grcp di name: Used to return the lines which has the name specified after grcp
Example:- root @local host # hadoop fs –ls/user/root / frcp test
Output: drwxr-xr-x-root super group 2011-07-09 12:25 /user/root / test
Drwxr-xr-x-root super group 2011-07-09 12:32 /user/root / test
7) Hadoop fs –put: To more the file from local system to Hadoop file system.
Syntax:- root @local host # hadoop fs – put <source path><destination path>
Example:- root @local host # hadoop fs –put input.txt user/test
8) Hadoop Fs –cat: To see the output of the file.
Syntax:- root @local host # hadoop fs – cat <filename>
9) Hadoop fs –lsr: To see the recursive list of files and directories
Syntax:- root @local host # hadoop fs – lsr /user
10) Hadoop fs –du: To see the size of the files
Syntax:- root @local host # hadoop fs – du<file name>
11) Hadoop fs –chmod 777: To give the full permissions to the file
Syntax:- root @local host # hadoop fs – chmod 777<file name>
12) Hadoop fs –copy from local: To copy the file from local system to HDFS
Syntax:- root @local host # hadoop fs – copy from local input-txt/user/rock fest
13) Hadoop fs –get: To get one files from HDFS to the local file system
Syntax:- root @local host # hadoop fs – get<source path><destional path>
14) Hadoop fs –copy to local: To copy the files from HDFS to the local file system
Syntax:- root @local host # hadoop fs – copy to local<HDFS path><local path the file name the file>
15) Hadoop fs –move from local: To move files from local to hdfs
Ex:-1#hadobp fs –move from local test/file1.txt hdfstest.
2#hadobp fs –move from local test/* hdfstest
16) Hadoop fs –move tolocal: Is not possible, because there are many replicas of the file is present on HDFS. So
moving is not possible.
17) Hadoop Fs –rm: To remove the files from directory i.e same as in local
Ex: a. #hadobp fs –rm hdfs test/*-*
Remove all files from the directory
B. #hadobp fs –rm hdfs test/file1.txt
Remove the files 1.true from the directory.
18) Hadoop fs –rmv: To remove the dir.
Ex: #hadobp fs –rmv hdfs test
Directory name
19) Hadoop fs –mv: Used to move files from one directory to other directory in hdfs. I.e files not available in
source directory.
20) Hadoop fs –cp: To copy the files from one directory to another directory.
Ex: #hadobp fs –cp hdfs test1/* hdfstest
All files
21) Hadoop fs –touch z: It will create the dummy file with
Ex: #hadobp fs –touch z hdfs test/file1.txt
ll
What is Big Data?
Big Data is extremely huge volume of data.
Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional
computing techniques
What is considered huge – 10 GB, 100 GB or 100 TB?
Well, there is no straight hard number that defines huge or Big Data. There are 2 reasons why there is no straight
forward answer.
First, what is considered Big or huge today in terms is data size or volume need not be considered as Big a year from
now. It is very much a moving target. Second, it is all relative. What you and I consider to be Big may not be the
case for companies like Google and Facebook. Hence for the above 2 reasons, it is very hard to put a number to
define big data volume.

What do the V’s of Big Data denote?


Volume: Scale of data
Obvious factor. In 3 months our startup has 100,000 active users and our volume is 1 TB. If we have positive growth
at the same rate at the end of the year we will have 400,000 users and our volume will be 4 TB. End of year 2 with
the same growth rate, we will have 8 TB. What if we doubled or tripled our user base every 3 months? So the
bottom line is we should not just look at the volume when we think of Big Data we should also look at the rate
in which our data grows. In other words, we should watch the velocity or speed of our data growth.
Velocity
This is an important factor. Velocity tells you how fast your data is growing. If your data volume stays at 1 TB for
a year all you need is a good database but if the growth rate is 1 TB every week then you have to think about a
scalable solution. Most of the time Volume and Velocity is all you need to decide whether you have a Big Data
problem or not.
Variety: Different forms of data
Variety adds one more dimension to your analysis. Our data in traditional databases are highly structured
i.e. Rows and columns. But take for instance our hypothetical startup mail service, it receives data in various formats
– text for mail messages, images and videos as attachments. When you have data coming in to your system in
different formats and you have to process or analyze the data in different formats traditional database
systems are sure to fail and when combined with high volume and velocity you for sure have a Big Data
problem.
Veracity refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy
are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as
the reliability and accuracy of content) but big data and analytics technology now allows us to work with these type
of data. The volumes often make up for the lack of quality or accuracy.
Value: Then there is another V to take into account when looking at Big Data: Value! It is all well and good having
access to big data but unless we can turn it into value it is useless. So you can safely argue that 'value' is the most
important V of Big Data. It is important that businesses make a business case for any attempt to collect and leverage
big data. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of
costs and benefits.
What are types of Data in bigdata?
Structured Data
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form
of structured data is a database where specific information is stored in tables, that is, rows and columns. They
have relational key and can be easily mapped into pre-designed fields. Today, those data are the most processed in
development and the simpliest way to manage information.
Traditional data RDBMS, schemas, structures
Semi structured data
Semi-structured data is information that doesn’t reside in a relational database but that does have some
organizational properties that make it easier to analyze. With some process you can store them in relation
database (it could be very hard for some kind of semi structured data), but the semi structure exist to ease space,
clarity or compute.
Examples of semi-structured: CSV but XML and JSON documents are semi structured documents, nosql databases
are considered as semi structured.
Unstructured data: strong one
Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos,
documents, email, logs and random text. It is not in the form of rows and columns.
Unstructured data represent around 80% of data. Examples include e-mail messages, word processing documents,
videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while
these sorts of files may have an internal structure, they are still considered « unstructured » because the data they
contain doesn’t fit neatly in a database.

Here are some examples of machine-generated unstructured data:


 Satellite images: This includes weather data or the data that the government captures in its satellite
surveillance imagery. Just think about Google Earth, and you get the picture.
 Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.
 Photographs and video: This includes security, surveillance, and traffic video.
 Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.
Can you give some examples of Big Data?

 Let’s talk about science first. The large Hadron Collider at CERN produce about 1 Petabyte of data every
second, mostly sensor data. Their volume is so huge they don’t even retain or store all the data they
produce.
 NASA gathers about 1.73 Gigabytes of data every hour about weather, geo location data etc.
 Let’s talk about the government. NSA is known for its controversial data collection programs and guess
how much NSA’s data center at Utah can house in terms of volume? —- A Yottabyte of data that is, 1
Trillion Terabytes of data, pretty massive isn’t it?
 In March of 2012 Obama’s administration announced to spend $200 million dollars in Big Data initiatives.
Even though we cannot technically classify the next one under government, it’s an interesting use case so
we included it anyway. Obama’s 2nd term election campaign used big data analytics which gave them a
competitive edge to win the election.
 Next let’s look at the private sector. With the advent of social media like Facebook, Twitter, linkedin etc.
There is no scarcity of data. Ebay is known to have at 30 PB cluster and Facebook 30 PB.
 Let’s say you shop at amazon.com, Amazon is not only capturing data when you click checkout, and your
every click on their website is tracked to bring a personalized shopping experience. When Amazon shows
you recommendations, big data analytics is at work behind the scenes.
What is a commodity hardware? Does commodity hardware include RAM?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on
Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on
RAM.
What are different types of filesystem?
Filesystem is used for to control how to stored and retrieved data. They are different file system, each have different
structure and logic, properties of speed, flexibility, security, size and more. Disk filesystems are filesystems put on
hard-drives and memory cards. Such filesystems are designed for this type of hardware. Common examples include
NTFS, ext3, HFS+, UFS, XFS, and HDFS. Flash drives commonly use disk filesystems like FAT32
What is difference between GFS and HDFS?
Google filesystem is a distributed file system developed by Google and specially designed to provide efficient,
reliable access to data using large clusters of commodity servers. Files are divided into chunks of 64 megabytes, and
are usually appended to or read and only extremely rarely overwritten or shrunk. Compared with traditional file
systems, GFS is designed and optimized to run on data centers to provide extremelyhigh data throughputs, low
latency and survive individual server failures. Inspired by GFS, the open source Hadoop Distributed File System
(HDFS) stores large files across multiple machines. It achieves reliability by replicating the data across multiple
servers. Similarly to GFS, data is stored on multiple geo-diverse nodes. The file system is built from a cluster of data
nodes, each of which serves blocks of data over the network using a block protocol specific to HDFS. In order to
perform the certain operations in GFS and HDFS a programming model is required. GFS has its own programming
model called Mapreduce. It is an open-source programming model developed by Google Inc.Apache adopted the
ideas of Google Mapreduce and developed Hadoop Mapreduce.
How would codecs useful to hadoop?
A codecs is the implementation of compression-decompression algorithm. In hadoop codecs is represented by an
implementation of compressioncodec interface.
Can large files are supported for compression in mapreduce?
For large files,you should not use compression format that not supporting for spliting on whole file,because you lose
locality and make mapreduce applications very ineffcient.
Big Data Technologies
Big data technologies are important in providing more accurate analysis, which may lead to more concrete
decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.
While looking into the technologies that handle big data, we examine the following two classes of technology:
Operational Big Data
This include systems like mongodb that provide operational capabilities for real-time, interactive workloads where
data is primarily captured and stored.
Nosql Big Data systems are designed to take advantage of new cloud computing architectures that have emerged
over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational
big data workloads much easier to manage, cheaper, and faster to implement.
Some nosql systems can provide insights into patterns and trends based on real-time data with minimal coding and
without the need for data scientists and additional infrastructure.
Analytical Big Data
This includes systems like Massively Parallel Processing (MPP) database systems and mapreduce that provide
analytical capabilities for retrospective and complex analysis that may touch most or all of the data.
Mapreduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL,
and a system based on mapreduce that can be scaled up from single servers to thousands of high and low end
machines.
These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical Systems
Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads


Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology Nosql Mapreduce, MPP Database

Big Data Challenges


The major challenges associated with big data are as follows:

 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
What are the problems that come with Big Data?
 Big data comes with big problems. Let’s talk about few problems you may run in to when you deal with
Big Data.
 Since the datasets are huge, you need to find a way to store them as efficient as possible, I am not just
talking about efficiency just in terms of storage space but also efficiency in storing the dataset that is
suitable for computation.
 Another problem when you deal with big dataset you should worry about data loss due to corruption in data
or due to hardware failure and you need have proper recovery strategies in place.
 The main purpose of storing data is to analyze them and how much time does it take to analyze and provide
a solution to a problem using your big data is a million dollar question. What’s good in storing the data
when you cannot analyze or process the data in reasonable time right? With big datasets computation with
reasonable execution times is a challenge.
 Finally, cost. You are going to need a lot of storage space. So the storage solution that you plan to use
should be cost effective.

That is exactly what Hadoop offers!!!

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very
large data sets on computer clusters built from commodity hardware.
Hadoop can handle huge volume of data, it can store data efficiently both in terms of storage and computation, it has
good recovery solution for data loss and above all it can horizontally scale, so as your data gets bigger you add more
nodes and Hadoop takes care of the rest. That simple.
Above all, Hadoop is cost effective – meaning we don’t need any specialized hardware to run Hadoop and hence
great for even startups.
Hadoop vs. Traditional Solutions

RDBMS
What about Hadoop vs. RDBMS? Is Hadoop a replacement for Database?
The straight answer is – no. There are things Hadoop is good at and there are things that database is good at.
RDBMS works exceptionally well with volume in the low terabytes whereas with Hadoop the volume we speak is in
terms of Petabytes.
 Hadoop can work with Dynamic schema and can support files with many different formats whereas the
database schema is very strict and not so flexible and cannot handle multiple formats
 Database solutions can scale vertically meaning you can add more resources to the existing solution but
will not horizontally scaling that is you cannot bring down the execution time of a query by simply adding
more computers.
 Finally the cost, database solutions can get expensive very quick when you increase the volume of data you
are trying to process. Whereas Hadoop offers a cost effective solution.
 In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
 RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on
huge volumes of data.
 Database cluster in RDBMS uses the same data files in shared storage whereas in Hadoop the storage is
independent of each processing node.
 Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an
approach to store huge amount of data in the distributed file system and process it.
 RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful
when you want Big data in one shot and perform analysis on that later.

Hadoop is a batch processing system. It is not as interactive as a database. You cannot expect millisecond response
times with Hadoop as you would expect in a database. With Hadoop you write the file or dataset once and operate or
analyze the data multiple times whereas with the database you can read and write multiple times.
What is Hadoop?
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very
large data sets on computer clusters built from commodity hardware.
Hadoop can handle huge volume of data, it can store data efficiently both in terms of storage and computation, it has
good recovery solution for data loss and above all it can horizontally scale, so as your data gets bigger you add more
nodes and Hadoop takes care of the rest.
Why do we need Hadoop?
Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to
store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present
in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to
analyze the data present in different machines at different locations very quickly and in a very cost effective way. It
uses the concept of mapreduce which enables it to divide the query into small parts and process them in parallel.
This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about why
Hadoop is gaining so much popularity!
Give a brief overview of Hadoop history.
In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published mapreduce, GFS
papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000
node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for
Hadoop.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The
equivalent of Daemon in Windows is services and in Dos is TSR.
List the various Hadoop daemons and their roles in a Hadoop cluster.
Namenode: It is the Master node which is responsible for storing the meta data for all the files and directories. It has
information around blocks that make a file, and where those blocks are located in the cluster.
Datanode: It is the Slave node that contains the actual data. It reports information of the blocks it contains to the
namenode in a periodic fashion. The datanode manages the storage attached to a node, of which there can be
multiple nodes in a cluster.
Secondary Namenode: It periodically merges changes in the namenode with the edit log so that it doesn’t grow too
large in size. It also keeps a copy of the image which can be used in case of failure of namenode.
Jobtracker: This is a daemon that runs on a Namenode for submitting and tracking mapreduce jobs in Hadoop. It
assigns the tasks to different task trackers.
Tasktracker: This is a daemon that runs on Datanodes. Task Trackers manage the execution of individual tasks on
the slave node. It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e.Tasktracker
per datanode performs the actual work.
Resourcemanager (Hadoop 2.x): It is the central authority that manages resources and schedules applications
running on top of YARN.
Nodemanager (Hadoop 2.x): It runs on slave machines, and is responsible for launching the application’s
containers, monitoring their resource usage (CPU, memory, disk, network) and reporting these to the
resourcemanager.
Jobhistoryserver (Hadoop 2.x): It maintains information about mapreduce jobs after the applicationmaster
terminates.
Applicationmaster performs the role of negotiating resources from the Resource Manager and working with the
Node Manager(s) to execute and monitor the tasks. Applicationmaster requests containers for all map tasks and
reduce tasks.Once Containers are assigned to tasks, applicationmaster starts containers by notifying its Node
Manager. Applicationmaster collects progress information from all tasks and aggregate values are propagated to
Client Node or user. Applicationmaster is specific to a single application which is a single job in classic mapreduce
or a cycle of jobs. Once the job execution is completed, applicationmaster will no longer exist.
What does ‘jps’ command do?
It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of namenode,
datanode , secondary namenode, Jobtracker and Task tracker.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
What is a namenode and what is a datanode?
Datanode is the place where the data actually resides before any processing takes place. Namenode is the master
node that contains file system metadata and has information about - which file maps to which block locations and
which blocks are stored on the datanode.
How to restart Namenode?
Step-1. Click on stop-all.sh and then click on start-all.sh OR
Step-2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-
0.20-namenode start (press enter).
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. Standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file
system. Stand-alone mode is suitable only for running mapreduce programs during development. It is one of the
least used environments.
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on
the same machine.
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a
Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running
and another host on which datanode is running and then there are machines on which tasktracker/nodemanager is
running. We have separate masters and separate slaves in this distribution.
What does /etc /init.d do?
/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX
specific, and nothing to do with Hadoop.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’50070′, for job tracker is ’50030′ and for task tracker is ’50060′.
What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:
1. Core-site.xml
2. Hdfs-site.xml
3. Mapred-site.xml
The Hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, Masters and Slaves are all available under ‘conf’
directory of Hadoop installation directory.
Core-site.xml and hdfs-site.xml:
The core-site.xml file informs Hadoop daemon where namenode runs in the cluster. It contains the configuration
settings for Hadoop Core such as I/O settings that are common to HDFS and mapreduce.

The hdfs-site.xml file contains the configuration settings for HDFS daemons; the namenode, the Secondary
namenode, and the datanodes. Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS. The actual number of replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:
1. Hadoop-default.xml which was renamed to 2. Hadoop-site.xml.
These files are written in xml format. We have certain properties in these xml files, which consist of name and
value. But these files do not exist now.
Which are the three main hdfs-site.xml properties?
The three main hdfs-site.xml properties are:
1. Dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk
or onto the remote.
2. Dfs.data.dir which gives you the location where the data is going to be stored.
3. Fs.checkpoint.dir which is for secondary Namenode.
What does hadoop-metrics.properties file do?
Hadoop-metrics.properties is used for ‘Reporting‘ purposes. It controls the reporting for Hadoop. The default
status is ‘not to report‘.
Mapred-site.xml:

The mapred-site.xml file contains the configuration settings for mapreduce daemons; the job tracker and the task-
trackers.

Defining mapred-site.xml:

Per-Proccess Run Time Environment:

Hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.
This file offers a way to provide customer parameters for each of the servers. Hadoop-env.sh is sourced by the entire
Hadoop core scripts provided in the ‘conf/’ directory of the installation.
Environment variables
Exporthadoop_DATANODE_HEAPSIZE=128″
Exporthadoop_TASKTRACKER_HEAPSIZE=512″
What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.
What does the command mapred.job.tracker do?
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.
What is Data Locality in Hadoop
Data Locality in Hadoop refers to the proximity of the data with respect to the Mapper tasks working on the data.
Why is Data Locality important?
When a dataset is stored in HDFS, it is divided in to blocks and stored across the datanodes in the Hadoop cluster.
When a mapreduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits).
When the data is not available for the Mapper in the same node where it is being executed, the data needs to be
copied over the network from the datanode which has the data to the datanode which is executing the Mapper task.
Imagine a mapreduce job with over 100 Mappers and each Mapper is trying to copy the data from another datanode
in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy
the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the
data than to move the data closer to the computation.
How is data proximity defined?
When a jobtracker (mrv1) or applicationmaster (mrv2) receive a request to run a job, it looks at which nodes in the
cluster has sufficient resources to execute the Mappers and Reducers for the job. At this point serious consideration
is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper
is located.

Data Local
When the data is located on the same node as the Mapper working on the data, it is referred to as Data Local. In this
case the proximity of the data is closer to the computation. The jobtracker (mrv1) or applicationmaster
(mrv2) prefers the node which has the data that is needed by the Mapper to execute the Mapper.

Rack Local
Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data
due to resource constraints on a busy cluster. In such instances it is preferred to run the Mapper on a different node
but on the same rack as the node which has the data. In this case, the data will be moved between nodes from the
node with the data to the node executing the Mapper with in the same rack.

Different Rack
In a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different rack is chosen to
execute the Mapper and the data will be copied from the node which has the data to the node executing the Mapper
between racks. This is the least preferred scenario.

Default Replica Placement Policy of hdfs?


The very first block will be stored on the same node as the client which is trying to upload the file.
The send replica block will be stored on a node in a different rack which is not the same rack where the first block is
stored.
The third replica block will be stored on a node in the same rack as the second replica but on a different node.

What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks
in a single location.
On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the
client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block
should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist
in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.
Do we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
What if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid
such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be
done by changing the value in replication factor which is set to 3 by default.
How do you define rack awareness in Hadoop?
It is the manner in which the Namenode decides how blocks are placed, based on rack definitions to minimize
network traffic between datanodes within the same rack. Let’s say we consider replication factor 3 (default), the
policy is that for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is
known as the Replica Placement Policy.
Hadoop Architecture
Hadoop framework includes following four modules:
 Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries
provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to
start Hadoop.
 Hadoop YARN: This is a framework for job scheduling and cluster resource management.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access
to application data.
 Hadoop mapreduce: This is YARN-based system for parallel processing of large data sets.
Hadoop has 2 core components – HDFS & mapreduce.
HDFS
HDFS stands for Hadoop Distributed File System and it takes care of all your storage related complexities like
splitting your dataset into blocks, takes care of replicating each block to more than one node and also keep track of
which block is stored on which node.
Mapreduce
Mapreduce is a programming model and it takes care of all the computational complexities like bringing all the
intermediate results from every single node to offer a consolidated output…
What is distcp?
Distcp (distributed copy) is a tool used for large inter/intra-cluster copying. Distcp is very efficient because it uses
mapreduce to copy the files or datasets and this means the copy operation is distributed across multiple nodes in
your cluster and hence it is very effective as opposed to a hadoop fs -cp operation.
How does it work?
Distcp expands a list of files and directories and distribute the work between multiple Map tasks, each of the Map
task will copy a partition of the files specified in the source list.
Syntax
When you are trying to copy files between 2 clusters both HDFS should be on the same version or the higher version
must be backward compatible.
Hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination
How to change default block size in HDFS?
In the older versions of Hadoop the default block size was 64 MB and in the newer versions the default block size is
128 MB.
Why would you want to make the block size of specific dataset from 128 to 256 MB?
A single HDFS block (64 MB or 128 MB or ...) Will be written to disk sequentially.
When you write the data sequentially there is a fair chance that the data will be written into contiguous space on disk
which means that data will be written next to each other in a continuous fashion.
When a data is laid out in the disk in continuous fashion it reduces the number of disk seeks during the read
operation resulting in an efficient read. So that is why block size in HDFS is huge when compared to the other file
systems.
Let’s say you have a dataset which is 2 Petabytes in size. Having a 64 MB block size for this dataset will result in 31
million+ blocks which would put stress on the Name Node to manage all that blocks. Having a lot of blocks will
also result in a lot of mappers during mapreduce execution. So in this case you may decide to increase the block size
just for that dataset.
Changing the block size when you upload a file in HDFS is very simple.
--Create directory if it does not exist
Hadoop fs -mkdir blksize
--Copy a file to HDFS with default block size (128 MB)
Hadoop fs -copyfromlocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv blksize
--Override the default block size with 265 MB
Hadoop fs -D dfs.blocksize=268435456 -copyfromlocal /hirw-starterkit/hdfs/commands/dwp-payments-
april10.csv blksize/dwp-payments-april10_256MB.csv
HDFS Block Placement Policy
When a file is uploaded in to HDFS it will be divided in to blocks. HDFS will have to decide where to place these
individual blocks in the cluster. HDFS block placement policy dictates a strategy of how and where to place
replica blocks in the cluster.
Inputsplit vs Block
The central idea behind mapreduce is distributed processing and hence the most important thing is to divide the
dataset in to chunks and you have separate process working on the dataset on every chunk of data. The chunks are
called input splits and the process working on the chunks (inputsplits) are called Mappers.
Are inputsplits Same As Blocks?
Inputsplit is not the same as the block.
A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the
dataset will be 128 MB except for the last block which could be less than the block size if the file size is not
entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a
logical record ends.
Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. Huge
records)
So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block
size which is 128 MB. However the 2nd record cannot fit in the block, so the record number 2 will start in block 1
and will end in block 2.
If you assign a mapper to a block 1, in this case, the Mapper cannot process Record 2 because block 1 does not
have the complete record 2. That is exactly the problem inputsplit solves. In this case inputsplit 1 will have both
record 1 and record 2. Inputsplit 2 does not start with Record 2 since Record 2 is already included in the Input
Split 1. So inputsplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still
inputsplit 2 will have the whole of record 3.
Blocks are physical chunks of data store in disks where as inputsplit is not physical chunks of data. It is a Java
class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows
where to start reading and where to stop reading. The start location of an inputsplit can start in a block and end in
another block.
Inputsplit respect logical record boundary and that is why it becomes very important. During mapreduce
execution Hadoop scans through the blocks and create inputsplits and each inputsplit will be assigned to individual
mappers for processing.
What Is Replication Factor?
Replication factor dictates how many copies of a block should be kept in your cluster for being fault tolerant. The
replication factor is 3 by default and hence any file you create in HDFS will have a replication factor of 3 and each
block from the file will be copied to 3 different nodes in your cluster.
Change Replication Factor – Why?
Let’s say you have a 1 TB dataset and the default replication factor is 3 in your cluster. Which means each block
from the dataset will be replicated 3 times in the cluster. Let’s say this 1 TB dataset is not that critical for you,
meaning if the dataset is corrupted or lost it would not cause a business impact. In that case you can set the
replication factor on just this dataset to 1 leaving the other files or datasets in HDFS untouched.
Use the -setrep command to change the replication factor for files that already exist in HDFS. -R flag would
recursively change the replication factor on all the files under the specified folder
The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below
command-
$hadoop fs –setrep –w 5 /my/test_dir
Replication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting
crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different
places. Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the
other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no
chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is
no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of
fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also.
So even if one or two of the systems collapse, the file is still available on the third system.
Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be
replicated on the other two?
Since there are 3 nodes, when we send the mapreduce programs, calculations will be done only on the original data.
The master node will know which node exactly has that particular data. In case, if one of the nodes is not
responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can
the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node
will figure out what is the actual amount of space required, how many block are being used, how much space is
available, and it will allocate the blocks accordingly.
HDFS
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a Name
Node that manages the file system metadata and Data Nodes that store the actual data. Clients contact Name Node
for file metadata or file modifications and perform actual file I/O directly with the Data Nodes.
The following are some of the salient features that could be of interest to many users.
 The Name Node and Data nodes have built in web servers that makes it easy to check current status of the
cluster.
 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware.
 HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
 HDFS provides high throughput access to application data and is suitable for applications that have large
data sets.
 New features and improvements are regularly implemented in HDFS. The following is a subset of useful
features in HDFS:
o File permissions and authentication.
o Rack awareness: to take a node’s physical location into account while scheduling tasks and
allocating storage.
o Safemode: an administrative mode for maintenance.
o Fsck: a utility to diagnose health of the file system, to find missing files or blocks.
o Fetchdt: a utility to fetch delegationtoken and store it in a file on the local system.
o Balancer: tool to balance the cluster when the data is unevenly distributed among datanodes.
o Upgrade and rollback: after a software upgrade, it is possible to roll back to HDFS’ state before
the upgrade in case of unexpected problems.
o Secondary namenode: performs periodic checkpoints of the namespace and helps keep the size
of file containing log of HDFS modifications within certain limits at the namenode.
o Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size
of the log stored at the namenode containing changes to the HDFS. Replaces the role previously
filled by the Secondary namenode, though is not yet battle hardened. The namenode allows
multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with
the system.
o Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives
a stream of edits from the namenode and maintains its own in-memory copy of the namespace,
which is always in sync with the active namenode namespace state. Only one Backup node may be
registered with the namenode at once.
Secondary namenode
The namenode stores modifications to the file system as a log appended to a native file system file, edits. When a
namenode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It
then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since namenode
merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster.
Another side effect of a larger edits file is that next restart of namenode takes longer.
The secondary namenode merges the fsimage and the edits log files periodically and keeps edits log size within a
limit. It is usually run on a different machine than the primary namenode since its memory requirements are on the
same order as the primary namenode.
The start of the checkpoint process on the secondary namenode is controlled by two configuration parameters.

 Dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two
consecutive checkpoints, and
 Dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed
transactions on the namenode which will force an urgent checkpoint, even if the checkpoint period has not
been reached.

The secondary namenode stores the latest checkpoint in a directory which is structured the same way as the primary
namenode’s directory. So that the check pointed image is always ready to be read by the primary namenode if
necessary.
Import Checkpoint
The latest checkpoint can be imported to the namenode if all other copies of the image and the edits files are lost. In
order to do that one should:
 Create an empty directory specified in the dfs.namenode.name.dir configuration variable;
 Specify the location of the checkpoint directory in the configuration variable dfs.namenode.checkpoint.dir;
 And start the namenode with -importcheckpoint option.
The namenode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the
namenode directory(s) set in dfs.namenode.name.dir. The namenode will fail if a legal image is contained
in dfs.namenode.name.dir. The namenode verifies that the image in dfs.namenode.checkpoint.dir is consistent, but
does not modify it in any way.
Safemode
During start up the namenode loads the file system state from the fsimage and the edits log file. It then waits for
datanodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas
already exist in the cluster. During this time namenode stays in Safemode. Safemode for the namenode is essentially
a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally
the namenode leaves Safemode automatically after the datanodes have reported that most file system blocks are
available. If required, HDFS could be placed in Safemode explicitly using bin/hdfs dfsadmin -safemode command.
Namenode front page shows whether Safemode is on or off. A more detailed description and configuration is
maintained as javadoc for setsafemode().
How does one switch off the SAFEMODE in HDFS?
You use the command: hadoop dfsadmin –safemode leave
Fsck
HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with
various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for
native file systems, this command does not correct the errors it detects. Normally namenode automatically corrects
most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during
reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as bin/hdfs fsck. Fsck can be run
on the whole file system or on a subset of files.
Fetchdt
HDFS supports the fetchdt command to fetch Delegation Token and store it in a file on the local system. This token
can be later used to access secure server (namenode for example) from a non secure client. Utility uses either RPC
or HTTPS (over Kerberos) to get the token, and thus requires kerberos tickets to be present before the run (run kinit
to get the tickets). The HDFS fetchdt command is not a Hadoop shell command. It can be run as bin/hdfs fetchdt
dtfile. After you got the token you can run an HDFS command without having Kerberos tickets, by
pointing HADOOP_TOKEN_FILE_LOCATION environmental variable to the delegation token file. For command
usage, see fetchdt command.
Recovery Mode
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can
read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special
namenode startup mode called Recovery mode that may allow you to recover most of your data.You can start the
namenode in recovery mode like so: namenode -recover. When in recovery mode, the namenode will interactively
prompt you at the command line about possible courses of action you can take to recover your data. If you don’t
want to be prompted, you can give the -force option. This option will force recovery mode to always select the first
choice. Normally, this will be the most reasonable choice. Because Recovery mode can cause you to lose data, you
should always back up your edit log and fsimage before using it.
Upgrade and Rollback
When Hadoop is upgraded on an existing cluster, as with any software upgrade, it is possible there are new bugs or
incompatible changes that affect existing applications and were not discovered earlier. In any non-trivial HDFS
installation, it is not an option to loose any data, let alone to restart HDFS from scratch. HDFS allows administrators
to go back to earlier version of Hadoop and rollback the cluster to the state it was in before the upgrade. HDFS
upgrade is described in more detail in Hadoop Upgrade Wiki page. HDFS can have one such backup at a time.
Before upgrading, administrators need to remove existing backup using bin/hadoop dfsadmin -
finalizeupgrade command. The following briefly describes the typical upgrade procedure:
 Before upgrading Hadoop software, finalize if there an existing backup. Dfsadmin -upgradeprogress status
can tell if the cluster needs to be finalized.
 Stop the cluster and distribute new version of Hadoop.
 Run the new version with -upgrade option (bin/start-dfs.sh -upgrade).
 Most of the time, cluster works just fine. Once the new HDFS is considered working well (may be after a
few days of operation), finalize the upgrade. Note that until the cluster is finalized, deleting the files that
existed before the upgrade does not free up real disk space on the datanodes.
 If there is a need to move back to the old version,
o Stop the cluster and distribute earlier version of Hadoop.
o Run the rollback command on the namenode (bin/hdfs namenode -rollback).
o Start the cluster with rollback option. (sbin/start-dfs.sh -rollback).
Namenode and datanodes
HDFS has a master/slave architecture. An HDFS cluster consists of a single namenode, a master server that manages
the file system namespace and regulates access to files by clients. In addition, there are a number of datanodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a
file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and
these blocks are stored in a set of datanodes. The namenode executes file system namespace operations like opening,
closing, and renaming files and directories. It also determines the mapping of blocks to datanodes. The datanodes
are responsible for serving read and write requests from the file system’s clients. The datanodes also perform block
creation, deletion, and replication upon instruction from the namenode.
The namenode and datanode are pieces of software designed to run on commodity machines. These machines
typically run a GNU/Linux operating system (OS).
What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send
its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that
there is some problem in datanode or task tracker is unable to perform the assigned task.
How HDFS Works
An HDFS cluster is comprised of a namenode, which manages the cluster metadata, and datanodes that store the
data. Files and directories are represented on the namenode by inodes. Inodes record attributes like permissions,
modification and access times, or namespace and disk space quotas.

The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently
replicated at multiple datanodes. The blocks are stored on the local file system on the datanodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a
datanode failure or disk failure, the namenode creates another replica of the block. The namenode maintains the
namespace tree and the mapping of blocks to datanodes, holding the entire namespace image in RAM.
The namenode does not directly send requests to datanodes. It sends instructions to the datanodes by replying to
heartbeats sent by those datanodes. The instructions include commands to:

 Replicate blocks to other nodes,


 Remove local block replicas,
 Re-register and send an immediate block report, or
 Shut down the node.
--Change replication to 2
Hadoop fs -setrep -R -w 2 replication-test/dwp-payments-april10.csv
Note that if you are trying to set replication factor to a number higher than the number of nodes, the command will
continue to try until it can replicate the desired number of blocks. So it will wait till you add additional nodes to the
cluster.
You can also use dfs.replication property to specify the replication factor when you upload files in to HDFS but this
will only work on the newly created files but not on the existing files.
Namenode and datanode
In this post let’s talk about the 2 important types of nodes and it’s functions in your Hadoop cluster – namenode and
datanode.
Namenode

1. Namenode is the centerpiece of HDFS and also known as the Master


2. Namenode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks
the files across the cluster.
3. Namenode knows the list of the blocks and its location for any given file in HDFS. With this information
namenode knows how to construct the file from blocks.
4. Namenode is so critical to HDFS and when the namenode is down, HDFS/Hadoop cluster is inaccessible
and considered down.
5. Namenode is a single point of failure in Hadoop cluster.
6. Namenode is usually configured with a lot of memory (RAM). Because the block locations are help in
main memory.
Datanode

1. Datanode is responsible for storing the actual data in HDFS and also known as the Slave
2. Namenode and datanode are in constant communication.
3. When a datanode starts up it announce itself to the namenode along with the list of blocks it is responsible
for.
4. When a datanode is down, it does not affect the availability of data or the cluster. Namenode will arrange
for replication for the blocks managed by the datanode that is not available.
5. Datanode is usually configured with a lot of hard disk space. Because the actual data is stored in the
datanode.
Ways To Change Number of Reducers
Update the driver program and set the setnumreducetasks to the desired value on the job object.
Job.setnumreducetasks(5);
There is also a better ways to change the number of reducers, which is by using the mapred.reduce.tasks property.
This is a better option because if you decide to increase or decrease the number of reducers later, you can do so with
out changing the mapreduce program.
-D mapred.reduce.tasks=10
Usage
Hadoop jar /hirw-starterkit/mapreduce/stocks/maxcloseprice-1.0.jar com.hirw.maxcloseprice.maxcloseprice -D
mapred.reduce.tasks=10 /user/hirw/input/stocks output/mapreduce/stocks
Explain what is Speculative Execution?
When Hadoop framework feels that a certain task (Mapper or Reducer) is taking longer on average compared to the
other tasks from the same job, it clones the long running task and run it on another node. This is called Speculative
Execution. In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different
slave node, multiple copies of same map or reduce task can be executed using Speculative Execution.
Is Speculative Execution Always Beneficial?
In some cases this is beneficial because in a cluster with 100s of nodes problems like hardware failure or network
congestion is common and prematurely running a parallel or duplicate task would be better since we won’t be
waiting for the task in problem to be complete.
But in some cases it is probably expected that certain maps or reduce may run a little longer when compared to
others so in such instance it is not always advisable to speculatively execute tasks as it would unnecessarily take up
cluster resources.
How Can I Enable/Disable Speculative Execution?
You can enable and disable both map and reduce side Speculative Execution using the properties –
mapreduce.map.speculative and mapreduce.reduce.speculative
Why is it that in HDFS, Reading is performed in parallel but Writing is not?
Using the mapreduce program, the file can be read by splitting its blocks. But while writing, mapreduce cannot be
applied and no parallel writing is possible. Hence, the incoming values are not yet known to the system.
What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a block in HDFS. The
default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a datanode and verifies them to find any kind of
checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
What is the port number for namenode, Task Tracker and Job Tracker?
Namenode 50070
Job Tracker 50030
Task Tracker 50060
Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the
address where the next part of data chunk is stored.
Whenever a client submits a hadoop job, who receives it?
Namenode receives the Hadoop job which then looks for the data requested by the client and provides the block
information. Jobtracker takes care of resource allocation of the hadoop job to ensure timely completion.
Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker)
or datanode (task tracker).
What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. In HDFS, all the systems will be executing the
tasks assigned to them independently and in parallel. In this way, the HDFS gives good throughput. By reading data
in parallel, we decrease the actual time to read data tremendously.
What is streaming access?
As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important
in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed,
especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a
single record from the data.
Why do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread
across multiple files. This is because Namenode is a very expensive high-performance system, so it is not prudent to
occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using mapreduce in any
programming language which can accept standard input and can produce standard output. It could be Perl, Python,
Ruby and not necessarily be Java. However, customization in mapreduce can only be done using Java and not any
other programming language.
What happens when a datanode fails during the write process?
When a datanode fails during the write process, a new replication pipeline that contains the other datanodes opens
up and the write process resumes from there until the file is closed. Namenode observes that one of the blocks is
under-replicated and creates a new replica asynchronously.
What happens when a datanode fails?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node, all tasks are re-scheduled
Namenode replicates the user’s data to another node
Mention what are the main configuration parameters that user need to specify to run Mapreduce Job?
The user of Mapreduce framework needs to specify
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
How does namenode tackle datanode failures?
All the data nodes periodically send notifications a.k.a Heartbeat signal to the namenode, which implies that the
datanode is alive. Apart from Heartbeat, namenode also receives Block report from datanodes, which consists of all
the blocks on a datanode. In case namenode does not receive this, it marks that datanode as a dead node. As soon as
the datanode is marked as non-functional or dead, block transfer is initiated to the datanode with which replication
was done initially.

MAPREDUCE
What is mapreduce?
Mapreduce is a programming model for processing on the distributed datasets on the clusters of computer.
Mapreduce Features:
 Distributed programming complexity is hidden
 Built in fault-tolerance
 Programming model is language independent
 Parallelization and distribution are automatic
 Enable data local processing
What are ‘maps’ and ‘reduces’?
‘Maps ‘and ‘Reduces ‘are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input
location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local
machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the
final output.
What are the Key/Value Pairs in Mapreduce framework?
Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and
output data to mapreduce framework should be in key/value pairs only.
What are the constraints to Key and Value classes in Mapreduce?
Any data type that can be used for a Value field in a mapper or reducer must implement
org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.
By default, Key fields should be comparable with each other. So, these must implement hadoop’s
org.apache.hadoop.io.writablecomparable Interface which in turn extends hadoop’s Writable interface and
java.lang.Comparableinterfaces
What will hadoop do when a task is failed in a list of suppose 50 spawned tasks?
It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times
then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured
with below properties in mapred-site.xml file.
Mapreduce.map.maxattempts
Mapreduce.reduce.maxattempts
The default value for the above two properties is 4 only.
Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB,
266 MB and 500 MB then how many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows
1 split for 256 KB file
2 splits for 266 MB file (1 split of size 256 MB and another split of size 10 MB)
2 splits for 500 MB file (1 Split of size 256 MB and another of size 244 MB)
How can you set random number of mappers and reducers for a Hadoop job?
Mappers and reducers are calculated by Hadoop, based on the DFS block size. It is possible to set an upper limit for
the mappers using conf.setnummaptasks (int num) function. However, it is not possible to set it to a lower value than
the one calculated by Hadoop.
During command line execution of jar, use the following command to set the number of mappers and reducers --D
mapred.map.tasks=4 and -D mapred.reduce.tasks=3
The above command will allocate 4 mappers and 3 reducers for the task.
What happens if the number of reducers is 0?
The output of mappers will be stored in separate file on HDFS.
What are the steps involved in mapreduce framework?
 Firstly, the mapper input key/value pairs maps to a set of intermediate key/value pairs.
 Maps are the individual tasks that transform input records into intermediate records.
 The transformed intermediate records do not need to be of the same type as the input records.
 A given input pair maps to zero or many output pairs.
 The Hadoop mapreduce framework creates one map task for each inputsplit generated by the inputformat for the
job.
 It then calls map(writablecomparable, Writable, Context) for each key/value pair in the inputsplit for that task.
 All intermediate values associated with a given output key are grouped passed to the Reducers.
Where is the Mapper Output stored?
The mapper output is stored on the Local file system of each individual mapper nodes. The intermediate data is
cleaned up after the Hadoop Job completes.
Detail description of the Reducer and its phases?
Reducer reduces a set of intermediate values, which has same key to a smaller set of values. The framework then
calls reduce().
Shuffle:
Sorted output (Mapper) à Input (Reducer). Framework then fetches the relevant partition of the output of all the
mappers.
Sort:
The framework groups Reducer inputs by keys. The shuffle and sort phases occur simultaneously; while map-
outputs are being fetched they are merged.
Reduce:
Reduce(writablecomparable, Iterable<Writable>, Context) method is called for each <key, (list of values)> pair in
the grouped inputs.
Secondary Sort:
Grouping the intermediate keys are required to be different from those for grouping keys before reduction, then
Job.setsortcomparatorclass(Class).
The output of the reduce task is typically written using Context.write(writablecomparable, Writable).
What does jobconf class do?
Mapreduce needs to logically separate different jobs running on the same cluster. ‘Job conf class‘helps to do job
level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive
and represent the type of job that is being executed. Jobconf specifies mapper, Combiner, partitioner, Reducer,
inputformat, outputformat implementations and other advanced job faets liek Comparators.
What does conf. Setmapper Class do?
Conf.setmapper class sets the mapper class and all the stuff related to map job such as reading a data and generating
a key-value pair out of the mapper.
What are the primary phases of a Reducer or Explain about the partitioning, shuffle and sort phase
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and
also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate
outputs of map tasks to the reducer as inputs is referred to as Shuffling.
Sort Phase- Hadoop mapreduce automatically sorts the set of intermediate keys on a single node before they are
given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer
instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper
instance that generated it.
What does a mapreduce partitioner do and how the user can control which key will go to which reducer?
A mapreduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly
distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which
reducer is responsible for a particular key.
The key to decide the partition uses hash function. Default partitioner is hashpartitioner.
A custom partitioner is implemented to control, which keys go to which Reducer.
Public class samplepartitioner extends Partitioner<Text, Text> {
@Override
Public int getpartition(Text key, Text value, int numreducetasks) {
}
}
The function returns the partition number using the numreducetasks is the number of fixed reducers.
How number of partitioners and reducers are related?
The total numbers of partitions are the same as the number of reduce tasks for the job
How can we control particular key should go in a specific reducer?
By using a custom partitioner.
How to write a custom partitioner for a Hadoop mapreduce job?
Steps to write a Custom Partitioner for a Hadoop mapreduce Job-
 A new class must be created that extends the pre-defined Partitioner Class.
 Getpartition method of the Partitioner class must be overridden.
 The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop mapreduce or the
custom partitioner can be added to the job by using the set method of the partitioner class.
In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer?
In this case default partitioner is used, which does all the work of hashing and partitioning assignment to the reducer.
What is a Combiner?
A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a
particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of mapreduce by
reducing the quantum of data that is required to be sent to the reducers.
Explain the differences between a combiner and reducer.
Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and
produces the output to reducers input. It is usually used for network optimization when the map generates greater
number of outputs. Unlike a reducer, the combiner has a constraint that the input or output key and value types must
match the output types of the Mapper. Combiners can operate only on a subset of keys and values i.e. Combiners
can be executed on functions that are commutative. Combiner functions get their input from a single mapper
whereas reducers can get data from multiple mappers as a result of partitioning.
What are combiners and its purpose?
 Combiners are used to increase the efficiency of a mapreduce program. It can be used to aggregate intermediate map
output locally on individual mapper outputs.
 Combiners can help reduce the amount of data that needs to be transferred across to the reducers.
 Reducer code as a combiner if the operation performed is commutative and associative.
 Hadoop may or may not execute a combiner.
But the question is, is it always a good idea to reuse reducer program for combiner?
Reducer for Combiner – Good use case
Let’s say we are writing a mapreduce program to calculate maximum closing price for each symbol from a stocks
dataset. The mapper program will emit the symbol as the key and closing price as the value for each stock record
from the dataset. The reducer will be called once for each stock symbol and with a list of closing prices. The reducer
will then loop through all the closing prices for the symbol and will calculate the maximum closing price from the
list of closing prices for that symbol.
Assume Mapper 1 processed 3 records for symbol ABC with closing prices – 50, 60 and 111. Let’s also assume that
Mapper 2 processed 2 records for symbol ABC with closing prices – 100 and 31.
Now the reducer will receive five closing prices for symbol ABC - 50, 60, 111, 100 and 31. The job of the reducer
is very simple it will simply loop through all the 5 closing prices and will calculate the maximum closing price to be
111.
We can use the same reducer program for combiner after each Mapper. The combiner on mapper 1 will process 3
closing prices - 50, 60 and 111 and will emit only 111 since it is the maximum closing price of the 3 values which is
111. The combiner on mapper 2 will process 2 closing prices - 100 and 31 and will emit only 100 since it is the
maximum closing price of the 2 values which is 100.
Now with combiner reducer will only process 2 closing prices for symbol ABC which is 111 from Mapper 1 and
100 from Mapper 2 and will calculate the maximum closing price as 111 from both the values.
As we can see the output is the same with and with out the combiner hence in this case reusing the reducer as a
combiner worked with no issues.
Reducer for Combiner – Bad use case
Let’s say we are writing a mapreduce program to calculate the average volume for each symbol from a stocks
dataset. The mapper program will emit the symbol as the key and volume as the value for each stock record from the
dataset. The reducer will be called once for each stock symbol and with a list of volumes. The reducer will then loop
through all the volumes for the symbol and will calculate the average volume from the list of volumes for that
symbol.
Assume Mapper 1 processed 3 records for symbol ABC with volumes – 50, 60 and 111. Let’s also assume that
Mapper 2 processed 2 records for symbol ABC with volumes – 100 and 31.
Now the reducer will receive five volume values for symbol ABC - 50, 60, 111, 100 and 31. The job of the reducer
is very simple it will simply loop through all the 5 volumes and will calculate the average volume to be 70.4
50 + 60 + 111 + 100 + 31 / 5 = 352 / 5 = 70.4
Let’s see what happens if we use the same reducer program as combiner after each Mapper. The combiner on
mapper 1 will process 3 volumes - 50, 60 and 111 and will calculate the average of the 3 volumes 73.66
The combiner on mapper 2 will process 2 volumes - 100 and 31 and will calculate the average volume of the 2
values which is 65.5.
Now with the combiner in place, reducer will only process 2 average volumes for symbol ABC which is 73.66 from
Mapper 1 and 65.5 from Mapper 2 and will calculate the average volume of symbol ABC as 73.66 + 65.5 /2 = 69.58
which is incorrect as the correct average volume is 70.4
So as we can see Reducer cannot always be reused for Combiner. So when ever you decide to reuse reducer for
combiner ask yourself this question – will my output be the same with and without the use of combiner?
Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are
complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case
we can try couple of things.
1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the
reducer.
Scenario 2 - A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways
to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. In Pig joins, the
LEFT side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to
the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting
– mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.
1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map
side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase
altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig
and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data
Scenario 4 – The Shuffle process is the heart of a mapreduce program and it can be tweaked for performance
improvement.
1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your
mapreduce output) you can increase the memory available for Map to perform the Shuffle by increasing the
value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the
keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by
setting the mapred.inmem.merge.threshold to 0
What are the four basic parameters of a mapper?
The four basic parameters of a mapper are longwritable, text, text and intwritable. The first two represent input
parameters and the second two represent intermediate output parameters.
What are the four basic parameters of a reducer?
The four basic parameters of a reducer are text, intwritable, text, intwritable. The first two represent intermediate
input parameters and the second two represent final output parameters.
What do the master class and the output class do?
Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output
location.
What is side data distribution in Mapreduce framework?
The extra read-only data needed by a mapreduce job to process the main data set is called as side data.
There are two ways to make side data available to all the map or reduce tasks.
Job Configuration
Distributed cache
How to distribute side data using job configuration?
Side data can be distributed by setting an arbitrary key-value pairs in the job configuration using the various setter
methods onconfiguration object.
In the task, we can retrieve the data from the configuration returned by Context ’s getconfiguration() method.
When can we use side data distribution by Job Configuration and when it is not supposed?
Side data distribution by job configuration is useful only when we need to pass a small piece of meta data to
map/reduce tasks.
We shouldn’t use this mechanism for transferring more than a few KB’s of data because it put pressure on the
memory usage, particularly in a system running hundreds of jobs.
Explain what is distributed Cache in mapreduce Framework?
Distributed Cache is an important feature provided by mapreduce framework. When you want to share some files
across all nodes in Hadoop Cluster, distributedcache is used. The files could be an executable jar files or simple
properties file.
To save network bandwidth, files are normally copied to any particular node once per job.
How to supply files or archives to mapreduce job in distributed cache mechanism?
The files that need to be distributed can be specified as a comma-separated list of uris as the argument to the -
files option in hadoop job command. Files can be on the local file system, on HDFS.
Archive files (ZIP files, tar files, and gzipped tar files) can also be copied to task nodes by distributed cache by
using -archives option.these are un-archived on the task node.
The -libjars option will add JAR files to the classpath of the mapper and reducer tasks.
Jar command with distributed cache
$ hadoop jar example.jar exampleprogram -files Inputpath/example.txt input/filename /output/
How distributed cache works in Mapreduce Framework?
When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified
by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at
this point.local.cache.size property can be configured to setup cache size on local disk of node managers. Files are
localized under the${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.
Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?
Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50
map or reduce tasks, it will use the same file copy from distributed cache.
On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from
HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing
the same file from node manager’s Local FS is much faster than from HDFS data nodes.
What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache
during run time of the application?
Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not
the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as
changes are not allowed to distributed cached files.
Compare RDBMS with Hadoop mapreduce.

Feature RDBMS Mapreduce

Traditional RDBMS can handle upto Hadoop mapreduce can handle upto
Size of Data gigabytes of data. petabytes of data or more.

Updates Read and Write multiple times. Read many times but write once model.

Schema Static Schema that needs to be pre-defined. Has a dynamic schema

Processing Supports both batch and interactive


Model processing. Supports only batch processing.

Scalability Non-Linear Linear

When is it not recommended to use mapreduce paradigm for large scale data processing?
It is not suggested to use mapreduce for iterative processing use cases, as it is not cost effective, instead Apache Pig
can be used for the same.
Where is Mapper output stored?
The intermediate key value data of the mapper output will be stored on local file system of the mapper nodes. This
directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the
intermediate will be cleaned up.
What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
Jobtracker and tasktracker
Jobtracker and tasktracker are 2 essential process involved in mapreduce execution in mrv1 (or Hadoop version
Both processes are now deprecated in mrv2 (or Hadoop version 2) and replaced by Resource Manager,
applicationmaster and Node Manager Daemons.
Job Tracker -
1. Jobtracker process runs on a separate node and not usually on a datanode.
2. Jobtracker is an essential Daemon for mapreduce execution in mrv1. It is replaced by
resourcemanager/applicationmaster in mrv2.
3. Jobtracker receives the requests for mapreduce execution from the client.
4. Jobtracker talks to the namenode to determine the location of the data.
5. Jobtracker finds the best tasktracker nodes to execute tasks based on the data locality (proximity of the
data) and the available slots to execute a task on a given node.
6. Jobtracker monitors the individual tasktrackers and the submits back the overall status of the job back to the
client.
7. When the jobtracker is down, HDFS will still be functional but the mapreduce execution cannot be started
and the existing mapreduce jobs will be halted.
Tasktracker -
1. Tasktracker runs on datanode. Mostly on all datanodes.
2. Tasktracker is replaced by Node Manager in mrv2.
3. Mapper and Reducer tasks are executed on datanodes administered by tasktrackers.
4. Tasktrackers will be assigned Mapper and Reducer tasks to execute by jobtracker.
5. Tasktracker will be in constant communication with the jobtracker signalling the progress of the task in
execution.
6. Tasktracker failure is not considered fatal. When a tasktracker becomes unresponsive, jobtracker will
assign the task executed by the tasktracker to another node.
Explain how jobtracker schedules a task?
The task tracker sends out heartbeat messages to Jobtracker usually every few minutes to make sure that jobtracker
is active and functioning. The message also informs jobtracker about the number of available slots, so the jobtracker
can stay upto date with where in the cluster work can be delegated
What is a taskinstance?
The actual Hadoop mapreduce jobs that run on each slave node are referred to as Task instances. Every task instance
has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.
Is it possible to rename the output file?
Yes, this can be done by implementing the multiple format output class.
When should you use a reducer?
It is possible to process the data without a reducer but when there is a need to combine the output from multiple
mappers – reducers are used. Reducers are generally used when shuffle and sort are required.
What is identity Mapper and identity reducer?
Identity Mapper is a default Mapper class provided by hadoop. When no mapper is specified in Mapreduce job, then
this mapper will be executed. It doesn’t process/manipulate/ perform any computation on input data rather it simply
writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.identitymapper.
Identity reducer passes on the input key/value pairs into output directory. Its class name
is org.apache.hadoop.mapred.lib.identityreducer. When no reducer class is specified in Mapreduce job, then this
class will be picked up by the job automatically.
What do you understand by the term Straggler?
A map or reduce task that takes long time to finish is referred to as straggler.
What do you understand by chain Mapper and chain Reducer?
Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in
a chain fashion, within a single map task. In this chained pattern execution, first mapper output will become input
for second mapper and second mappers output to third mapper, and so on until the last mapper.
Its class name is org.apache.hadoop.mapreduce.lib.chainmapper.
Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can
be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of
mappers will be run followed by a single reducer.
Its class name is org.apache.hadoop.mapreduce.lib.chainreducer.
How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes?
In Chain Mapper,
Chainmapper.addmapper() method is used to add mapper classes.
In chainreducer,
Chainreducer.setreducer() method is used to specify the single reducer class.
What are the core changes in Hadoop 2.0?
Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in
which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the
mapreduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel
processing frameworks. Hadoop 2.x allows workable and fine grained resource configuration leading to efficient
and better cluster utilization so that the application can scale to process larger number of jobs.
List the difference between Hadoop 1 and Hadoop 2.
In Hadoop 1.x, Namenode is the single point of failure. In Hadoop 2.x, we have Active and Passive Namenodes. If
the active Namenode fails, the passive Namenode takes charge. High availability is achieved in Hadoop 2.x.
Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple
applications in Hadoop, all sharing a common resource. MR2 is a distributed application that runs the mapreduce
framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in
Hadoop 1.x.
 Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster
management whereas in in Hadoop 1.x, mapreduce is responsible for both processing and cluster management
 Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
 Hadoop 2.x works on containers and can also run generic tasks whereas Hadoop 1.x works on the concept of slots
What are active and passive Namenodes?
In Hadoop-2.x, we have two Namenodes – Active Namenode and Passive Namenode. Active Namenode is the
Namenode which runs in the cluster. Passive Namenode is a standby Namenode, which has similar data as active
Namenode. When the active Namenode fails, the passive Namenode replaces the active Namenode in the cluster.
Hence, the cluster is never without a Namenode and so it never fails.
What comes in Hadoop 2.0 and mapreduce V2 (YARN)?
Namenode: HA and Federation
Jobtracker: Cluster and application resource
What is Apache Hadoop YARN?
YARN is a large scale distributed system for running big data applications which is part of Hadoop 2.0.
YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or
Mapreduce 2 or mrv2.
It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by
splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.
Is YARN a replacement of Hadoop mapreduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports mapreduce
and is also referred to as Hadoop 2.0 or mapreduce 2.
What are the additional benefits YARN brings in to Hadoop?
 Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource. In
Hadoop mapreduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The
same container can be used for Map and Reduce tasks leading to better utilization.
 YARN is backward compatible so all the existing mapreduce jobs.
 Using YARN, one can even run applications that are not based on the mapreduce model
What are the main components of Job flow in YARN architecture?
Mapreduce job flow on YARN involves below components.
A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The mapreduce applicationmaster, which coordinates the tasks running in the mapreduce job.
The HDFS file system is used for sharing job files between the above entities.
How can native libraries be included in YARN jobs?
There are two ways to include native libraries in YARN jobs-
1) By setting the -D java.library.path on the command line but in this case there are chances that the native libraries
might not be loaded correctly and there is possibility of errors.
2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.
What is the fundamental idea behind YARN?
In YARN (Yet Another Resource Allocator), jobtracker responsibility is split into:
 Resource management
 Job scheduling/monitoring having separate daemons.
Yarn supports additional processing models and implements a more flexible execution engine.

What mapreduce framework consists of?


Resourcemanager (RM)
 Global resource scheduler
 One master RM
Nodemanager (NM)
 One slave NM per cluster-node.
Applicationmaster (AM)
 One AM per application
 Runs in Container
Container
 RM creates Containers upon request by AM
 Application runs in one or more containers
What are the two main components of resourcemanager?
Scheduler
It allocates the resources (containers) to various running applications: Container elements such as memory, CPU,
disk etc.
Applicationmanager
It accepts job-submissions, negotiating for container for executing the application specific applicationmaster and
provides the service for restarting the applicationmaster container on failure.
What is the function of nodemanager?
The nodemanager is the resource manager for the node (Per machine) and is responsible for containers, monitoring
their resource usage (cpu, memory, disk, network) and reporting the same to the resourcemanager
What is the role of applicationmaster in YARN architecture?
 Applicationmaster performs the role of negotiating resources from the Resource Manager and working with
the Node Manager(s) to execute and monitor the tasks status and progress.
 Applicationmaster requests containers for all map tasks and reduce tasks. Containers are assigned to tasks,
once task is done applicationmaster starts notifying its Node Manager. Applicationmaster collects progress
information from all tasks and aggregate values are propagated to Client Node or user.
 Applicationmaster is specific to a single application which is a single job in classic mapreduce or a cycle of
jobs. Once the job execution is completed, applicationmaster will no longer exist.
What is HDFS Federation?
All clients has to go through Namenode to perform any READ and WRITE operation in HDFS. Since Namenode
has the entire metadata of HDFS in a big cluster Namenode can become huge in volume its memory becomes a
limiting factor and will start to slow down. Hence Namenode can become the bottleneck and could cause
performance issues.
To tackle this issue, HDFS Federation was introduced and you can add multiple Namenodes to a cluster. Each
Namenode is responsible to manage a portion of the filesystem there by sharing the workload of the cluster.
For instance, let’s say we have 2 teams – Marketing and Research in our company funding the Hadoop cluster. You
can create a Namespace called /marketing which will be managed by one Namenode and another Namespace under
/research which will be managed by another Namenode.
The advantage of this is that you don’t have to run two different Hadoop clusters. You are able to run a single
Hadoop infrastructure but one Namnode will manage all the files under /marketing and another namenode will
manage all the files under /research.driver
Will both Namenodes share information?
No. Each Namenode is only responsible for it’s assigned namespace and will not share metadata or information
between them and also will not communicate with one another. When a Namenode managing /marketing goes down
it will affect all the files under /marketing and users will still be able to access HDFS and files under /research since
Namenode managing /research is full functional.
Why data could get corrupted in HDFS?
Data could get corrupted during storage or processing the data. In an environment like Hadoop data (blocks) gets
transferred over the network quite often and there is always a chance of data corruption during transmission. There
is a also a high chance of data getting corrupted due to failures in hard disk.
How does HDFS detects corruption?
HDFS detects corruption using checksums. Think of checksum as a unique value that is calculated for a piece of
data using the data itself. When ever data comes in to HDFS, checksum is calculated on the data. Now blocks in
HDFS are usually validated during read, write and at periodic intervals against the checksum that was initially stored
with the checksums that are calculated on the data at any given time. If the checksum is different during any of the
checks then the block of data is deemed corrupted.
When does HDFS verifies checksums?
When a client requests to read the data from HDFS, the client is responsible for verifying the checksums against the
data that is returned by datanodes with the check that will be calculated by the client. This way we can be sure that
the data is not corrupted when it was stored on disk in the datanodes.
In addition to verifying the data during read and write to HDFS, datanodes also run a background process called
datablockscanner which scans the blocks stored in the datanode for any possible corruptions due to issues with the
hard disks.
How does HDFS fix corrupted data?
This is very simple. HDFS is built ground up to handle failures. By default, each block in HDFS is replicated on 3
different nodes across the cluster. So when a block corruption is identified HDFS simply arrange to copy a good
block from one of the replicated nodes to the node with the corrupted block.
What is meant by Map side and Reduce side join in Hadoop?
Map Side Join
As the name suggests, if a join is performed at mapper side it is termed as Map Side Join. To perform this join, the
data has to be partitioned equally, sorted by the same key and records for the key should be in the same partition.
Reduce Side Join
It is much simpler than Map Side Join, as reducers get the structured data after processing, unlike Map Side join
which require data to be sorted and partitioned. Reduce Side joins are not as efficient as Map Side joins because they
have to go through the sort and shuffle phase.
Join implementation depends on the size of the dataset and how they are partitioned. If the size of dataset is too large
to be distributed across all the nodes in a cluster or it is too small to be distributed - in either case, Side Data
Distribution technique is used.
Explain a mapreduce program or What are the main components of mapreduce Job
A mapreduce program consists of the following 3 parts:
The Driver code runs on the client machine and is responsible for building the configuration of the job and
submitting it to the Hadoop Cluster. The Driver code will contain the main() method that accepts arguments from
the command line.
The Mapper code reads the input files as <Key,Value> pairs and emits key value pairs. The Mapper class extends
mapreducebase and implements the Mapper interface. The Mapper interface expects four generics, which define the
types of the input and output key/value pairs. The first two parameters define the input key and value types, the
second two define the output key and value types.
The Reducer code reads the outputs generated by the different mappers as <Key,Value> pairs and emits key value
pairs. The Reducer class extends mapreducebase and implements the Reducer interface. The Reducer interface
expects four generics, which define the types of the input and output key/value pairs. The first two parameters define
the intermediate key and value types, the second two define the final output key and value types. The keys are
writablecomparables, the values are Writables
What are Hadoop Writables?
Hadoop Writables allows Hadoop to read and write the data in a serialized form for transmission as compact binary
files. This helps in straightforward random access and higher performance. Hadoop provides in built classes, which
implement Writable: Text, intwritable, longwritable, floatwritable, and booleanwritable.
What is a nullwritable?
It is a special type of Writable that has zero-length serialization. In mapreduce, a key or a value can be declared
as nullwritable if we don’t need that position, storing constant empty value.
What is the purpose of rawcomparator interface?
Rawcomparator allows the implementors to compare records read from a stream without deserialization them into
objects, so it will be optimized, as there is not overhead of object creation.
What is data serialization?
Serialization is the process of converting object data into byte stream data for transmission over a network across
different nodes in a cluster or for persistent data storage.
What is deserialization of data?
Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data
from HDFS. Hadoop provides Writables for serialization and deserialization purpose.
What is Avro Serialization System?
Avro is a language-neutral data serialization system. It has data formats that work with different languages. Avro
data is described using a language independent schema (usually written in JSON). Avro data files
support compression and are splittable.
Avro provides avromapper and avroreducer to run mapreduce programs.
What is outputcommitter?
Outputcommitter describes the commit of mapreduce task. Fileoutputcommitter is the default available class
available for outputcommitter in mapreduce. It performs the following operations:
 Create temporary output directory for the job during initialization.
 Then, it cleans the job as in removes temporary output directory post job completion.
 Sets up the task temporary output.
 Identifies whether a task needs commit. The commit is applied if required.
 Jobsetup, jobcleanup and taskcleanup are important tasks during output commit.
What are the steps to submit a Hadoop job?
Steps involved in Hadoop job submission:
 Hadoop job client submits the job jar/executable and configuration to the resourcemanager.
 Resourcemanager then distributes the software/configuration to the slaves.
 Resourcemanager then scheduling tasks and monitoring them.
 Finally, job status and diagnostic information is provided to the client.
What is mapfile?
A mapfile is an indexed sequencefile and it is used for look-ups by key.
What is a Counter and its purpose?
Counter is a facility for mapreduce applications to report its statistics. They can be used to track job progress in a
very easy and flexible manner. It is defined by mapreduce framework or by applications. Each Counter can be of
any Enum type. Applications can define counters of type Enum and update them via counters.incrcounter in the map
and/or reduce methods.
Q. Define different types of Counters?
Built in Counters:
 Mapreduce Task Counters
 Job Counters
Custom Java Counters:
Mapreduce allows users to specify their own counters (using Java enums) for performing their own counting
operation.
Why Counter values are shared by all map and reduce tasks across the mapreduce framework?
Counters are global so shared across the mapreduce framework and aggregated at the end of the job across all the
tasks.
What is the default value of map and reduce max attempts?
Framework will try to execute a map task or reduce task by default 4 times before giving up on it.
Explain the Job outputformat?
Outputformat describes details of the output for a mapreduce job.
The mapreduce framework depends on the outputformat of the job to:
 It checks the job output-specification
 To write the output files of the job in the <key, value> pairs, it provides the recordwriter implementation.
Default: textoutputformat
How is the option in Hadoop to skip the bad records?
Hadoop provides an option where a certain set of bad input records can be skipped when processing map inputs.
This feature can be controlled through the skipbadrecords class.
 Mrunit testing
What is the syntax to run a mapreduce program?
Hadoop jar file.jar /input_path /output_path
Do we require two servers for the Namenode and the datanodes?
Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly
configurable system as it stores information about the location details of all the files stored in different datanodes
and on the other hand, datanodes require low configuration system.
What do the master class and the output class do?
Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output
location.
Consider case scenario: In M/R system, - HDFS block size is 64 MB. Now Input format is fileinputformat and
we have 3 files of size 64K, 65Mb and 127Mb. How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
 1 split for 64K files
 2 splits for 65MB files
 2 splits for 127MB files
Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key and value pairs of all the input
splits.
Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For
each split, a map is needed.
How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter‘.
There is a feature of customization in Hadoop which can be called from the main method.
How is Hadoop different from other data processing tools?
In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering
about the volume of data to be processed. This is the beauty of parallel processing in contrast to the other data
processing tools available.
Can we rename the output file?
Yes we can rename the output file by implementing multiple format output class.
What happens if number of reducers are 0?
It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-
tasks go directly to the filesystem, into the output path set by setoutputpath(Path). The framework does not sort the
map-outputs before writing them out to the filesystem.
How many instances of jobtracker can run on a hadoopcluser?
One. There can only be one jobtracker in the cluster. This can be run on the same machine running the namenode.
How namenode Handles data node failures?
Through checksums. Every data has a record followed by a checksum. If checksum doesnot match with the original
then it reports an data corrupted error.
Can I set the number of reducers to zero?
Can be given as zero. So, the mapper output is an finalized output and stores in HDFS.
How many states does Writable interface defines in Hadoop?
Two
How would you debug a Hadoop code?
There are many ways to debug Hadoop codes but the most popular methods are:
 Using Counters.
 Using the web interface provided by the Hadoop framework.
What are the main configuration parameters in a mapreduce program?
Users of the mapreduce framework need to specify these parameters:
 Job’s input locations in the distributed file system
 Job’s output location in the distributed file system
 Input format
 Output format
 Class containing the map function
 Class containing the reduce function
What does a split do?
Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split
Method‘. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything,
but reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split
method is equal to the block size and is used to divide block into bunch of splits.
What is inputsplit in mapreduce?
Inputsplit is a Java class that points to the start and end location in the block.
How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter‘.
There is a feature of customization in Hadoop which can be called from the main method.
What is the difference between HDFS block and inputsplit?
An HDFS block splits data into physical divisions while inputsplit in mapreduce splits input files logically.
While inputsplit is used to control number of mappers, the size of splits is user defined. On the contrary, the HDFS
block size is fixed to 64 MB, i.e. For 1GB data , it will be 1GB/64MB = 16 splits/blocks. However, if input split size
is not defined by user, it takes the HDFS default block size.
What is the default input type/format in mapreduce?
By default, the type input type in mapreduce is text.
Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on
the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose
the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again
gets divided into mapper, thus we do not have a track of the previous row value.
What is the inputformat?
The inputformat class is one of the fundamental classes in the Hadoop mapreduce framework. It specifies the
process of reading data from files into an instance of the Mapper This class is responsible for defining two main
things:
 Data splits
 Recordreader
1. Data split is a fundamental concept in Hadoop mapreduce framework which defines both the size of individual Map
tasks and its potential execution server.
2. The recordreader is responsible for actual reading records from the input file and submitting them (as key/value
pairs) to the mapper.
Explain how input and output data format of the Hadoop framework?
Fileinputformat, textinputformat, keyvaluetextinputformat, sequencefileinputformat, sequencefileasinputtextformat,
wholefileformat are file formats in hadoop framework
What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
Sequence File Input Format- This input format is used for reading files in sequence.
What does the text inputformat do?
By default the type input type in mapreduce is ‘text’. In text input format, each line will create a line object, that is
an hexa-decimal number. Key is considered as a line object and value is considered as a whole line text. This is how
the data gets processed by a mapper. The mapper will receive the ‘key’ as a ‘longwritable‘ parameter and value as a
‘text‘ parameter.
Key is the byte offset of the line and value is the content of the line. For instance, Key: longwritable, value: text.
What happens in a textinputformat?
In textinputformat, each line in the text file is a record. Key is the byte offset of the line and value is the content of
the line. For instance, Key: longwritable, value: text.
Java.lang.Object
Org.apache.hadoop.mapreduce.inputformat<K,V>
Org.apache.hadoop.mapreduce.lib.input.fileinputformat<longwritable,Text>
Org.apache.hadoop.mapreduce.lib.input.textinputformat
What do you know about keyvaluetextinputformat?
In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line.
Everything before the separator is the key and everything after the separator is the value. For instance, Key: text,
value: text.
What do you know about Nlineoutputformat?
Nlineoutputformat splits ‘n’ lines of input as one split.
What do you know about Sequencefileinputformat?
Sequencefileinputformat is an input format for reading in sequence files. Key and value are user defined. It is a
specific compressed binary file format which is optimized for passing the data between the output of one
mapreduce job to the input of some other mapreduce job. This will also increase memory overhead for namenode
as it has to store information of huge number of small files. They are often used in high-performance map-reduce jobs
Explain use cases where sequencefile class can be a good fit?
When the data is of type binary then sequencefile will provide a persistent structure for binary key-value pairs.
Sequencefiles also work well as containers for smaller files as HDFS and mapreduce are optimized for large files.
What is inputsplit and recordreader?
Inputsplit specify the data to be processed by an individual Mapper.
In general, inputsplit presents a byte-oriented view of the input.
Default: filesplit
Recordreader reads <key, value> pairs from an inputsplit, then processes them and presents record-oriented view
What is the purpose of recordreader in Hadoop?
Recordreader is a class that loads the data from files and converts it into key, value pair format as required by the
mapper. Inputformat class instantiates recordreader after it splits the data.
Can reducers communicate with each other?
Reducers always run in isolation and they can never communicate with each other as per the Hadoop mapreduce
programming paradigm.
Explain the usage of Context Object.
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for
updating counters, to report the progress and to provide any application level status updates. Contextobject has the
configuration details for the job and also interfaces, that helps it to generating the output.
What are the core methods of a Reducer?
The 3 core methods of a reducer are –
1)setup () – This method of the reducer is used for configuring various parameters like the input data size,
distributed cache, heap size, etc.
Function Definition- public void setup (context)
2)reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
3)cleanup () - This method is called only once at the end of reduce task for clearing all the temporary files.
Function Definition -public void cleanup (context)
What is the process of changing the split size if there is limited storage space on Commodity Hardware?
If there is limited storage space on commodity hardware, the split size can be changed by implementing the Custom
Splitter. The call to Custom Splitter can be made from the main method.
Writing A File To HDFS – Java Program
Writing a file to HDFS is very easy, we can simply execute hadoop fs -copyfromlocal command to copy a file from
local filesystem to HDFS. In this post we will write our own Java program to write the file from local file system to
HDFS.
Here is the program - filewritetohdfs.java
Public class filewritetohdfs {
Public static void main(String[] args) throws Exception {
//Source file in the local file system
String localsrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
Inputstream in = new bufferedinputstream(new fileinputstream(localsrc));
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println(Connecting to -- +conf.get(fs.defaultfs));
//Destination file in HDFS
Filesystem fs = filesystem.get(URI.create(dst), conf);
Outputstream out = fs.create(new Path(dst));
//Copy file from local to HDFS
Ioutils.copybytes(in, out, 4096, true);
System.out.println(dst + copied to HDFS);
}
}
The program takes in 2 parameters. The first paramter is the file and its location in the local file system that will be
copied to the location mentioned in the second parameter in HDFS.

1 //Source file in the local file system


2 String localsrc = args[0];
3 //Destination file in HDFS
4 String dst = args[1];
We will create a inputstream using the bufferedinputstream object by using the first parameter which is the location
of the file in the local file system. The input stream objects are regular java.io stream objects and not hadoop
libraries because we are still referencing a file from the local file system and not HDFS.

//Input stream for the file in local file system to be written to HDFS
Inputstream in = new bufferedinputstream(new fileinputstream(localsrc));
Now we need to create an output stream to the file location in HDFS where we can write the contents of the file
from the local file system. The very first thing we need to know is few key information about the cluster, like the
name node details etc. The details are already specified in the configuration files during cluster setup.
The easiest way to get the configuration of the cluster is by instantiating the Configuration object and this will read
the configuration files from the classpath and read and load all the information that is needed by the program.
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println(Connecting to -- +conf.get(fs.defaultfs));
//Destination file in HDFS
Filesystem fs = filesystem.get(URI.create(dst), conf);
Outputstream out = fs.create(new Path(dst));
In the next line we will get the File System object using the URL that we passed as the program’s input and the
configuration that we just created. The file system that will be returned is the distributedfilesystem object. Once we
have the file system object the next thing we need is the outputstream to the file that we would like to write the
contents of the file from the local file system.
We will then call the create method on the file system object using the location of the file in HDFS which we passed
to the program as the second parameter
//Copy file from local to HDFS
Ioutils.copybytes(in, out, 4096, true);
Finally we will use copybytes method from hadoop’s ioutils class and we will supply the input and output stream
object. We will then read 4096 bytes at a time from the input stream and write it to the output stream which will
copy the entire file from the local file system to HDFS.
Reading A File From HDFS – Java Program
In this last post we saw how to write a file to HDFS by writing our own Java program. In this post we will see how
to read a file from HDFS by writing a Java program.
Here is the program - filereadfromhdfs.java
Public class filereadfromhdfs {
Public static void main(String[] args) throws Exception {
//File to read in HDFS
String uri = args[0];
Configuration conf = new Configuration();
//Get the filesystem - HDFS
Filesystem fs = filesystem.get(URI.create(uri), conf);
Fsdatainputstream in = null;
Try {
//Open the path mentioned in HDFS
In = fs.open(new Path(uri));
Ioutils.copybytes(in, System.out, 4096, false);

System.out.println(End Of file: HDFS file read complete);

} finally {
Ioutils.closestream(in);
}
}
}
This program will take in an argument which is nothing but the fully qualified HDFS path to a file which we would
read and display the contents of the file on the screen. This program will simulate the hadoop fs -cat command.

//File to read in HDFS


String uri = args[0];
We need to know is few key information about the cluster, like the name node details etc. The details are already
specified in the configuration files during cluster setup.
Configuration conf = new Configuration();
The easiest way to get the configuration of the cluster is by instantiating the Configuration object and this will read
the configuration files from the classpath and read and load all the information that is needed by the program.

//Get the filesystem - HDFS


Filesystem fs = filesystem.get(URI.create(uri), conf);
Fsdatainputstream in = null;

In the next line we will get the filesystem object using the URL that we passed as the program input and the
configuration that we just created. This will return the distributedfilesystem object and once we have the file system
object the next thing we need is the input stream to the file that we would like to read.

In = fs.open(new Path(uri));
Ioutils.copybytes(in, System.out, 4096, false);
We can get the input stream by calling the open method on the file system object by supplying the HDFS URL of
the file we would like to read. Then we will use copybytes method from the Hadoop’s ioutils class to read the entire
file’s contents from the input stream and print it on the screen.
Steps
1. Create the input file
Create the input.txt file with sample text.
$ vi input.txt
Thanks Lord Krishna for helping us write this book
Hare Krishna Hare Krishna Krishna Krishna Hare Hare
Hare Rama Hare Rama Rama Rama Hare Hare
2. Move the input file into HDFS
Use the –put or –copyfromlocal command to move the file into HDFS
$ hadoop fs -put input.txt
3. Code for the mapreduce program
Java files:
Wordcountprogram.java // Driver Program
Wordmapper.java // Mapper Program
Wordreducer.java // Reducer Program
————————————————–
Wordcountprogram.java File: Driver Program
————————————————–
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Job;
Import org.apache.hadoop.mapreduce.lib.input.textinputformat;
Import org.apache.hadoop.mapreduce.lib.output.textoutputformat;
Import org.apache.hadoop.util.Tool;
Import org.apache.hadoop.util.toolrunner;
Public class wordcountprogram extends Configured implements Tool{
@Override
Public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, wordcountprogram);
Job.setjarbyclass(getclass());
// Configure output and input source
Textinputformat.addinputpath(job, new Path(args[0]));
Job.setinputformatclass(textinputformat.class);
Job.setmapperclass(wordmapper.class);
Job.setreducerclass(wordreducer.class);
// Configure output
Textoutputformat.setoutputpath(job, new Path(args[1]));
Job.setoutputformatclass(textoutputformat.class);
Job.setoutputkeyclass(Text.class);
Job.setoutputvalueclass(intwritable.class);
Return job.waitforcompletion(true) ? 0 : 1;
}
Public static void main(String[] args) throws Exception {
Int exitcode = toolrunner.run(new wordcountprogram(), args);
System.exit(exitcode);
}
}
————————————————–
Wordmapper.java File: Mapper Program
————————————————–
Import java.io.ioexception;
Import java.util.stringtokenizer;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.longwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Mapper;
Public class wordmapper extends Mapper<longwritable, Text, Text, intwritable> {
Private final static intwritable count = new intwritable(1);
Private final Text nametext = new Text();
@Override
Protected void map(longwritable key, Text value, Context context) throws ioexception,
Interruptedexception {
Stringtokenizer tokenizer = new stringtokenizer(value.tostring(), );
While (tokenizer.hasmoretokens()) {
Nametext.set(tokenizer.nexttoken());
Context.write(nametext, count);
}
}
}
————————————————–
Wordreducer.java file: Reducer Program
————————————————–
Import java.io.ioexception;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Reducer;
Public class wordreducer extends Reducer<Text, intwritable, Text, intwritable> {
@Override
Protected void reduce(Text t, Iterable<intwritable> counts, Context context)
Throws ioexception, interruptedexception {
Int sum = 0;
For (intwritable count : counts) {
Sum += count.get();
}
Context.write(t, new intwritable(sum));
}
}
4. Run the mapreduce program
Create the jar of the Code in Step 3 and use the following command to run the mapreduce program
$ hadoop jar wordcount.jar wordcountprogram input.txt output1
Here,
Wordcount.jar: Name of jar exported having the all the methods.
Wordcountprogram: Driver Program having the entire configuration
Input.txt: Input file
Output1: Output folder where the output file will be stored
5. View the Output
View the output in the output1 folder
$ hadoop fs -cat /user/cloudera/output1/part-r-00000
Hare 8
Krishna 5
Lord 1
Rama 4
Thanks 1
Book 1
For 1
Helping 1
This 1
Us 1
Write 1
HDFS Use Case-
Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia
uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a
petabyte scale.

Mapreduce Use Case:


Skybox has developed an economical image satellite system for capturing videos and images from any location on
earth. Skybox uses Hadoop to analyse the large volumes of image data downloaded from the satellites. The image
processing algorithms of Skybox are written in C++. Busboy, a proprietary framework of Skybox makes use of
built-in code from java based mapreduce framework.
YARN Use Case:
Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 mapreduce jobs per day taking 230 compute
years extra for processing every day. YARN at Yahoo helped them increase the load on the most heavily used
Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is close to 50% increase.

Data Access Components of Hadoop Ecosystem- Hive and Pig


HIVE
What is Hive?
Hive is a Hadoop based system for querying and analyzing large volumes of Structured data which is stored on
HDFS or in other words Hive is an query engine built to work on top of Hadoop that can compile queries into
mapreduce jobs and run them on the cluster.
In which scenario Hive is good fit?
Data warehousing applications where more static data is analyzed.
Fast response time is not the criteria.
Data is not changing rapidly.
An abstract to underlying mapreduce programs
Like SQL
What are the limitations of Hive?
Limitations in Hive:
 Hive cannot do pointed updates or deletes. The closest to delete you get is by dropping partitions. Why?
Because behind the scenes, Hive works against files in HDFS.
 Hive does not support triggers
 Transactions: Very high-level support and it is recently added.
 Indexes – Rudimentary support.
 Views – Read-only views are allowed. Materialized views are not supported.
 Speed – Hive relies on mapreduce and Hadoop for execution, which works very well with big datasets but
not great for split second results
What are the differences between Hive and RDBMS?
HIVE:
Schema on Read
Batch processing jobs
Data stored on HDFS
Processed using mapreduce
RDBMS:
Schema on write
Real time jobs
Data stored on internal structure
Processed using database
What is the purpose of storing the metadata?
People want to read the dataset with a particular schema in mind.
For e.g.: BA and CFO of a company look at the data with a particular schema.
BA may be interested in say IP addresses and timings of the clicks in a weblog while the CFO may be interested in
say the clicks that were direct clicks on the website or from paid Google adds.
Underneath it’s the same dataset that is accessed. This schema is used again and again. So it makes sense to store
this schema in a RDBMS
What is the reason for creating a new metastore_db whenever Hive query is run from a different directory?
Embedded mode:
Whenever Hive runs in embedded mode, it checks whether the metastore exists. If the metastore does not exist then
it creates the local metastore.
Property: Default value
Javax.jdo.option.connectionurl = jdbc:derby:;databasename=metastore_db;create=true
What are the components of Hive architecture?
UI :-User Interface for users to submit queries and other operations to the system.
Driver :- The Driver is used for receives the queires from UI .This component implements the notion of session
handles and provides execute and fetch apis modeled on JDBC/ODBC interfaces.
Compiler :- The component that parses the query, does semantic analysis on the different query blocks and query
expressions and eventually generates an execution plan with the help of the table and partition metadata looked up
from the metastore.
Metastore:- The component that stores all the structure information of the various tables and partitions in the
warehouse including column and column type information, the serializers and deserializers necessary to read and
write data and the corresponding HDFS files where the data is stored.
Execution Engine: - The component which executes the execution plan created by the compiler. The plan is a DAG
of stages. The execution engine manages the dependencies between these different stages of the plan and executes
these stages on the appropriate system components.
Hadoop Hive Data Types With Examples
Hadoop Hive Data Types With Examples are mainly divided into 2 types
Primitive Data Types
Primitive Data Types also divide into 4 types there are mentioned below
Numeric Data Type
String Data Type
Date/Time Data Type
Miscellaneous Data Type
Here is the common thing between Java, Sql and Hive is The Data types and sizes of Hadoop hive data types is
similar to Java and SQL Primitive Data types
Numeric Data Type
Numerical Data types are mainly divided into 2 types
Integral Data Types
Floating Data Types
Integral Data Types
TINYINT (This TINYINT is equal to Java’s BYTE data type)
SMALLINT (This SMALLINT is equal to Java’s SHORT data type)
INT (This INT is equal to Java’s INT data type)
BIGINT (This bigintis equal to Java’s LONG data type)
Floating Data Types
FLOAT (This FLOAT is equal to Java’s FLOAT data type )
DOUBLE (This DOUBLE is equal to Java’s DOUBLE data type)
DECIMAL (This DECIMAL is equal to SQL’s DECIMAL data type)
DECIMAL(4,3) represents total of 4 digits, out of which 3 are decimal digits. Below is the chart for all numeric
types with their ranges and examples.
By default in Hive integral values are taken as a INT data type unless the value of integral values range are cross the
range of INT values (Please check the above table for values). Suppose if we need to use a low integral value like
100 it to be treated as TINYINT or SMALLINT or BIGINT then we need to suffix the value with Y, S or L
respectively.
Examples: 100Y – TINYINT, 100S – SMALLINT, 100L – BIGINT
String Data Types
In Hive String Data Types are Mainly divided into 3 types there are mentioned below
STRING
VARCHAR
CHAR
This String Data types are supported in Hive from only 0.14 and above versions only
What is Difference between CHAR vs VARCHAR
CHAR is fixed length
VARCHAR isof variable length but we need to specify the max length of the field (example: name
VARCHAR(64)). If the values are less than the max length specified then remaining space will be freed out.
Maximum value of CHAR is 255 but Maximum value of VARCHAR is 65355
In VARCHAR we can optimized the Space/Storage by using released the unused bytes but in CHAR we can
optimize the unused Space CHAR unused bytes will not be released but filled with spaces.
If a string value being assigned to a VARCHAR value exceeds the length specified, then the string is silently
truncated.
DATE/TIME Data Types
Date/Time Data types are mainly Divide into 2 types
DATE
TIMESTAMP
This hive date/time data types in UNIX time stamp format for date/time related fields in hive.
DATE
Represented format is YYYY-MM-DD
Example: DATE ‘2014-12-07’
Date ranges allowed are 0000-01-01 to 9999-12-31
TIMESTAMP
TIMESTAMP use the format yyyy-mm-dd hh:mm:ss[.f...].
Cast Type Result
Cast(date as date) Same date value
Cast(date as string) Date is formatted as a string in the form ‘YYYY-MM-DD’.
Cast(date as timestamp) Midnight of the year/month/day of the date value is returned as timestamp.
If the string is in the form ‘YYYY-MM-DD’, then a date value corresponding to that is
Cast(string as date)
returned. If the string value does not match this format, then NULL is returned.
Cast(timestamp as date) The year/month/day of the timestamp is returned as a date value.

Miscellaneous Types
Hive Supports 2 more primitive Data types
BOOLEAN
BINARY
Hive BOOLEAN is similar to Java’s BOOLEAN types, it can stores true or false values only
BINARY is an array of Bytes and like VARBINARY in many rdbmss. BINARY columns are stored within the
record, not separately like blobs
Implicit Conversion Between Primitive Data Types
TINYINT—>SMALLINT–>INT–>BIGINT–>FLOAT–>DOUBLE
TINYINT can be converted to any other numeric data type but BIGINT can only be converted to FLOAT or
DOUBLE but the reverse
Boolean & Binary data types will not be converted to any other data type implicitly.
Explicit Conversion Between Primitive Data Types
Explicit type conversion can be done using the cast operator only.
Example: CAST (‘500’ AS INT) will convert the string ‘500’ to the integer value 500. But If cast is used
incorrectly as in CAST (‘Hello’ AS INT) , then cast operation will fail and returns NULL .
Complex Data Types
ARRAY Data Type
Same as Array in java, An Ordered sequences of similar type elements that are index using zero-based integers. The
all elements in the array is must be same data type.

Select date,city,mytemp[3] from Temperature;


MAP Data Type
Collection of key-value pairs. Fields are accessed using array notation of keys (e.g., [‘key’]).
It is an unordered collection of key-value pairs. Keys must be of primitive types. Values can be of any type.
Example

Select total[2017] from myschools where state=’Chhattisgarh’ and gender=’Female’;


STRUCT Data Type
It is a record type which encapsulates a set of named fields that can be any primitive data type. Elements in
STRUCT type are accessed using the DOT (.) Notation.

Select bikefeatures.enginetype from mybikes where name=’Suzuki Swish’;


UNIONTYPE
UNIONTYPE is collection of Heterogeneous data types. It is similar to Unions in C. At any point of time, an Union
Type can hold any one (exactly one) data type from its specified data types
Example: UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

We can create table in hive in two different approaches, one approach is hive external table and another approach is
hive managed table.
I) Hive External Table
Ii) Hive Managed Table
Create External Table Hive
Hive> CREATE external TABLE IF NOT EXISTS company(cid int,cname string,cloc string,empid int
> comment 'company data'
> row format delimited
> FIELDS terminated BY '\t'
> LINES terminated BY ‘\n’
> stored AS textfile location '/hive_external_table_company';
Load Data From HDFS
LOAD DATA inpath '/companydata/company.txt' INTO TABLE company;
When hive load data from hdfs then automatically the data file which in the path of ‘/companydata/company.txt’
moved into specified location, if you not set the location keyword in your hive external table then external table data
stored in/user/hive/warehouse/company
Create Managed Table in Hive
CREATE TABLE IF NOT EXISTS employeetab(id int,name string,sal int)
> comment 'empolyee details'
> row format delimited
> FIELDS terminated BY '\t'
> LINES terminated BY ‘\n’
> stored AS textfile;
Hive managed tables are also called as hive internal tables. In Managed Table also we can use LOCATION keyword
to specify the storage path of managed table. The default path location of Hive managed and external table is
Hadoop fs -ls /user/hive/warehouse/your_table_name
Hive stores the data for these tables in a subdirectory under the directory defined
by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
Load data Local inpath
Hive> LOAD DATA LOCAL inpath '/home/mahesh/pig-releated/file.txt' INTO TABLE empolyeetab;
Hive Managed Table Load Data From HDFS
LOAD DATA inpath '/import22/part-m-00000' INTO TABLE empolyeetab;
Hive warehouse path
Hadoop fs -ls /user/hive/warehouse/empolyeetab
Hadoop fs -cat /user/hive/warehouse/empolyeetab/part-m-00000
Hive Create Table Differences between Managed/Internal table and External Table
One of the main differences between an external and a managed table in Hive is that when an external table is
dropped, the data associated with it doesn’t get deleted, only the metadata (number of columns, type of columns,
terminators,etc.) Gets dropped from the Hive metastore. When a managed table gets dropped, both the metadata
and data get dropped.
However, most (if not all) of the changes to schema can now be made through ALTER TABLE
Feature Internal External
Schema Data on Schema Schema on data
Storage Location /usr/hive/warehouse HDFS Location
Data Availability Within Local FS Within HDFS
Hive Metastore Data Lost for External Table
Hive Metastore Lost for External Tables
Here the main doubt is in external table if we drop the table the hive metastore data will get lost and actual data will
persist on external location right after lost the meta data then how can we find the tables in external location??
Hive Metastore Data Lost (Deleted)
We can set the external table location in two different ways, The first way to set the location of Hive external table.
Hive> CREATE external TABLE depttab(deptid int,dname string,deptloc string,empid int)
> row format delimited
> FIELDS terminated BY'\t'
> stored AS textfile;
In the above table the hive external table data path will stored in default hive metadata location warehouse path
like /user/hive/warehouse we can check the location by using the below hdfs command.
Hadoop fs -ls /user/hive/warehouse
In this warehouse folder your hive table name will be created as a one of the sub folder like
/user/hive/warehouse/your_table_name
Here is the second way to set the location of Hive external table
Hive> CREATE external TABLE depttab(deptid int,dname string,deptloc string,empid int)
> row format delimited
> FIELDS terminated BY '\t'
> stored AS textfile location '/hivepath/hive_external_table';
If we drop the external table the hive metastore data will lost but the actual data path will stored in the default
location of hive. So we no need to worry about how can we find the tables data in external location.
Hadoop Hive Input Format Selection
Input formats are playing very important role in Hive performance. Primary choices of Input Format are Text,
Sequence File, RC File, ORC.
Text Input Format:-
Default, Json, CSV formats are available
Slow to read and write and Can’t split compressed files (Leads to Huge maps)
Need to read/decompress all fields.
Sequence File Input Format:-
I) Traditional mapreduce binary file format which Stores Keys and Values as a class and Not good for Hive, Which
has sql types. Hive always stores entire line as a value
Ii) Default block size is 1 MB
Iii) Need to read and Decompress all the fields
RC (Row Columnar File) Input Format:-
I) columns stored separately
A) Read and decompressed only needed one.
B) Better compression
ii) Columns stored as binary Blobs which Depend on Meta store to supply Data types
iii) Large Blocks with 4MB default and Still search file for split boundary
ORC (Optimized Row Columnar) Input Format :-
ORC is Optimized Row Columnar File Format. ORC File format provides very efficient way to store relational data
than RC file, By using ORC File format we can reduce the size of original data up to 75%. Comparing to Text,
Sequence, Rc file formats ORC is better.
Using ORC files improves performance when Hive is reading, writing, and processing data. RC and ORC shows
better performance than Text and Sequence File formats.
ORC takes less time to access the data comparing to RC and ORC takes Less space to store data. However, the ORC
file increases CPU overhead by increasing the time it takes to decompress the relational data.
Syntax To Create ORC File Format Table
CREATE TABLE table_orc (column1 STRING, column2 STRING, column3 INT, column4 INT
) STORED AS ORC;
Here is the example table of creating a hive table with Partition, Bucket and ORC File Format
CREATE TABLE airanalytics (flightdate date, dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime
int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin
varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay
string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string)
PARTITIONED BY (flight_year String)
Clustered BY (uniquecarrier)
Sorted BY (flightdate)
INTO 24 buckets
Stored AS orc tblproperties(orc.compress=NONE,orc.stripe.size=67108864,orc.row.index.stride=25000)
In The above we are declaring properties of ORC table properties
Orc.compress indicates the compression techniques like NONE,Snappy,LZO etc
Orc.stripe.size indicates blocks size of file
Orc.row.index.stride indicates index
Inserting The data into air analytics table
INSERT INTO TABLE airanalytics partition(flight_year) SELECT datenew, dayofweek, depttime, crsdepttime,
arrtime, crsarrtime, uniquecarrier, flightno, tailnum, aet, cet, airtime, arrdelay, depdelay, origindest, distance, taxin,
taxout, cancelled, cancelcode, diverted, carrdelay, weatherdelay, securtydelay, cadelay, lateaircraft, year(datenew)
AS flight_year FROM airlines
Sort BY flight_year;
Advantages With ORC File Format
Column stored separately
Stores statistics (Min, Max, Sum,Count)
Has Light weight Index
Larger Blocks by default 256 MB and Reduce The accessing Time and storage Space
Explain the difference between partitioning and bucketing.
Hive organizes tables into partitions a way of dividing a table into coarse-grained parts based on the value of a
partition column, such as a date. Using partitions can make it faster to do queries on slices of the data.
Tables or partitions may be subdivided further into buckets to give extra structure to the data that may be used for
more efficient queries. For example, bucketing by user ID means we can quickly evaluate a user-based query by
running it on a randomized sample of the total set of users.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
 Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. Either
by timestamp ranges, by location, etc. Bucketing does not work by default.
 Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition
into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining
various columns.
 In partitioning technique, a partition is created for every unique value of the column and there could be a situation
where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number
and the data can then be decomposed in those buckets.
 Basically, a bucket is a file in Hive whereas partition is a directory.
Explain about the different types of partitioning in Hive?
Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded
into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the
value of the partition field. Based on how data is loaded into the table, requirements for data and the format in which
data is produced at source- static or dynamic partition can be chosen. In dynamic partitions, the complete data in the
file is read and is partitioned through a mapreduce job based into the tables based on a particular field in the file.
Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in
loading data. The partition is added to the table and then the file is moved into the static partition. The partition
column value can be obtained from the file name without having to read the complete file.

Hive Buckets
Optimization Techniques
Hive Buckets
Hive Buckets or Clusters is nothing but another technique of decomposing data or decreasing the data into more
manageable parts or equal parts. For example we have table with columns like date, employee_name, employee_id,
salary, leaves etc. In this table just use date column as the top-level partition and the employee_id as the second-
level partition leads to too many small partitions. So here employee table is partition by date and bucketing by
employee_id. The value of this column will be hashed by a user-defined number into buckets. Records with the
same employee_id will always be stored in the same bucket.
In Partition there is so many chance to create thousands of tiny partitions but coming into Hive Buckets we can’t
create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table
creation. In Hive Partition we used PARTITIONED BY but in Hive Buckets we used CLUSTERED BY
Advantages with Hive Buckets
The number of buckets is fixed so it does not fluctuate with data
Hash(column) MOD(number of buckets) –evenly distributed
Used for Query optimization Techniques
CREATE TABLE order (
Username STRING, orderdate STRING, amount DOUBLE, tax DOUBLE
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Here we divided Hive Buckets into 25 parts. Set the maximum number of reducers to the same number of buckets
specified in the table metadata (i.e. 25) set map.reduce.tasks = 25
To enforce bucketing: set hive.enforce.bucketing = true
Better to set default Hive Buckets is 25

Load Data Into Table

Hive Buckets table data load


Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these
are our data files.

Hive-Buckets-table-data-load-output
The partition of hive table has been modified to point to a new directory location. Do I have to move the data
to the new location or the data will be moved automatically to the new location?
Changing the point of partition will not move the data to the new location. It has to be moved manually to the new
location from the old one.
INCREMENTAL UPDATES IN APACHE HIVE TABLES
In BI world delta load/incremental load to update the existing record and Inserting new record is very common
process.
I am assuming that good temp space is maintained according to your data volume, so that we not facing temp space
issue during the process.
Note:
Step1: Create a hive target table and do a full load from your source.
My target table is orders and its create statement
Let say after full loading is done. Now we have data in our target table orders (I have loaded only 5 records).

Step2: Create a stage table, to hold all the delta data (records which needs to be updated and new records which
needs to be inserted in to DW Table).
Created a stage table orders_stage and table creation script:
Now load the delta data:

Step3: Create one more tempary table to merge delta records in orders_stage and the hive target tables orders.

Now let’s merge orders and orders_stage data and load it into temporary table orders_temp using below script:

insert into orders_temp partition (order_date) select t1.* from (select * from orders union all select * from
orders_stage) t1 join (select order_no, max(last_update_date) as last_update_date from (select * From orders
union all select * From orders_stage ) t2 group by order_no, quantity, amount) t3 on t1.order_no =
t3.order_no and t1.last_update_date = t3.last_update_date;

Step4: Overwrite Main Hive table with temp table with dynamic partition.
Insert overwrite table orders select * from orders_temp;

What is Hive UDF ?


User can write some own custom user defined functions called UDF (user defined function).
Steps to create Hive UDF
Step 1 :-
Open your Eclipse and Create a java Class Name like replace.java
Step 2 :-
You should add jar files to that Project folder like
Right Click on project —> Build Path —> Configure Build Path —> Libraries —> Add External Jars —> Select
Hadoop and Hive Lib folder Jars files and Add other Jars files In Hadoop folder —–> Click Ok.
Step 3 :-
Now your Hive java program is supported in your eclipse with out any errors. The basic step in Hive UDF is
Public class classname extends UDF and you return the value.
Package dateudf1;
Import org.apache.hadoop.hive.ql.exec.UDF;
Import org.apache.hadoop.io.Text;
Public class replacenum1 extends UDF {
Private Text result = new Text();
Public Text evaluate(String str, String str1, String str2) {
String rep = str.replace(str1, str2);
Result.set(rep);
Return result;
}
}
Step 4 :-
We should create an object with Text value like private Text result = new Text();
Step 5 :-
Evaluate This method is called once for every row of data being processed. We can assign the value as String str,
that means str having the every row value of hive table.
How to Execute this code In Hive UDF?
Step 1 :-
Right click on program —> Export —> create Jar
Step 2 :-
Open Hive Terminal—-> add jar to your current Hive location path
Add jar jarname;
Step 3 :-
Create a temporary function in hive terminal
CREATE TEMPORARY FUNCTION replaceword as ‘dateudf1.replacenum1′;
Dateudf1 represents the package name of program and replacenum1 represents the program name and both are
should be case sensitive.
Step 4 :-
Create a Database and load related matching data into Hive table.
CREATE TABLE names(fname string,lname string) row format delimited FIELDS terminated BY '\t' stored
AS textfileinput Data
Mahesh chimmiri
rajesh chimmiri
Load the input file from local to table
Load data local inpath ‘names.txt’ into table names;
Step 5 :-
Select replaceword(fname,rajesh,suresh) from names;
Fname represents the table column name rajesh is original data in table and suresh is replacing
word names represents the table name.

Hive UDF Example Convert From Int to Float


Create table inttofloat (number int) stored as textfile;
My input is
10
11
14
When i apply my udf on this hive Table immediately the int values are converted into float like
10.0
11.0
14.0
Convert From Int to Float
Package dateudf1;
Import org.apache.hadoop.hive.ql.exec.UDF;
Import org.apache.hadoop.io.Text;
Public class Inttofloat212 extends UDF{
Private Text result = new Text();
Public Text evaluate(String str){
Int num;
Num = Integer.parseint(str);
Float f = (float) num;
String s = Float.tostring(f);
Result.set(s);
Return result;
}
}
Upper Case And Lower Case
In this Hive UDF program i wrote an udf for converting a string from Lower case to upper case and upper case to
lower case based on condition which string length is greater than 5.
Below is the code for converting strings from Upper to lower and lower to upper.
Import org.apache.hadoop.hive.ql.exec.UDF;
Import org.apache.hadoop.io.Text;
Public class Upper extends UDF {
Private Text result = new Text();
Public Text evaluate(Text str) {
String word = str.tostring();
If (word.length() > 5) {
Result.set(str.tostring().touppercase());
} else
Result.set(str.tostring().tolowercase());
Return result;
}
}
Hive Commands
 Hive -e command is used to execute the query without entering into hive terminal
$HIVE_HOME/bin/hive -e ‘select a.col from tab1 a’
 Hive -f command is used for execute one or more quires in hql script
$HIVE_HOME/bin/hive -f /home/my/hive-script.sql
 .hiverc file is used for pre execute the query when you starts the hive and .hiverc file should be in your
home folder not in hive folder.
$home/.hiverc
 Reset – Use quit or exit to leave the interactive shell.
 Set <key>=<value> – Sets the value of a particular configuration variable (key).
 Set – Prints a list of configuration variables that are overridden by the user or Hive.
 Set -v – Prints all Hadoop and Hive configuration variables.
 Add files (or) jars (or) archives – Adds one or more files, jars, or archives to the list of resources in the
distributed cache.
 Delete files (or) jars (or) archives <file path > – Removes the resource(s) from the distributed cache.
 ! <command> – Executes a shell command from the Hive shell.
 Dfs <dfs command> – Executes a dfs command from the Hive shell.
 <query string> – Executes a Hive query and prints results to standard output.
 Source <filepath> – Executes a script file inside the CLI.
 If you want comment at time of table creation or database creation just use command COMMENT ‘your
comment’ --is used for comments in script.hql like — your comment
 EXTENDED command is used for describe the database details
 FORMATTED command is used for describe the database full details in more readable way. Better
 DESCRIBE command is also used to describe the database or table details.
 CASCADE command is used for delete the database if that database have tables;
 RESTRICT command instead of CASCADE is equivalent to the default behavior,
 You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be changed,
including its name and directory location:
Hive> ALTER DATABASE financials SET DBPROPERTIES (‘edited-by’ = ‘Joe Dba’);
 TBLPROPERTIES command is used for assign properties to the table
 DBPROPERTIES command is used for assign properties to the database
 IF NOT EXISTS command used for check that table or db is already there are not

Hive Jdbc Connection program


Below hive jdbc connection program, have connected to hive using hive driver org.apache.hive.jdbc.hivedriver and
then also given url for that and also provide username and password to hive url.
Hiveserver2 has a new JDBC driver. It supports both embedded and remote access to hiveserver2.
Connection URL for Remote or Embedded Mode
The JDBC connection URL format has the prefix jdbc:hive2:// and the Driver class is
org.apache.hive.jdbc.hivedriver. Note that this is different from the old hiveserver.
For a remote server, the URL format is jdbc:hive2://<host>:<port>/<db> (default port for hiveserver2 is
10000).
For an embedded server, the URL format is jdbc:hive2:// (no host or port).
Program:
Import java.sql.sqlexception;
import java.sql.Connection;
import java.sql.resultset;
import java.sql.Statement;
import java.sql.drivermanager;
public class hivejdbc { //class name hive jdbc
private static String drivername = org.apache.hive.jdbc.hivedriver;
Public static void main(String[] args) throws sqlexception {
try {
Class.forname(drivername);
} catch (classnotfoundexception e) {
e.printstacktrace();
System.exit(1);
}
//replace hive here with the name of the user the queries should run as
Connection con = drivermanager.getconnection(jdbc:hive2://pcp.dev.local:10000/default, hdfs, hdfs);
Statement stmt = con.createstatement();
String Database = airanalytics;
String tablename = airlines;
stmt.execute(use + Database);
//stmt.execute(drop table if exists + tablename);
//stmt.execute(create table + tablename + (key int, value string));
// show tables
String sql = select origin from airlines where year=1988″;
System.out.println(Running: + sql);
resultset res = stmt.executequery(sql);
while (res.next()) {
System.out.println(res.getstring(origin));
}
}
}
What is Apache Hcatalog?
Hcatalog is built on top of the Hive metastore and incorporates Hive’s DDL. Apache Hcatalog is a table and data
management layer for hadoop, we can process the data on Hcatalog by using Apache pig, Apache Mapreduce and
Apache Hive. Hcatalog displays data from rcfile format, text files, or sequence files in a tabular view. It also
provides REST apis so that external systems can access these tables’ metadata.
Hcatalog can be used to share data structures with external systems including traditional data management tools..
Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to
hive’s data warehouse.
What is webhcatserver ?
The webhcatserver provides a REST – like web API for Hcatalog. Applications make HTTP requests to run Pig,
Hive, and hcatalog DDL from within applications.
Difference between SQL and hiveql ?

Explain about the different types of join in Hive.


Hiveql has 4 different types of joins –
Inner Join: The Records common to the both tables will be retrieved
SELECT c.Id, c.Name, c.Age, o.Amount FROM sample_joins c JOIN sample_joins1 o ON(c.Id=o.Id);
FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c FULL OUTER JOIN sample_joins1 o
ON(c.Id=o.Id)
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c LEFT OUTER JOIN sample_joins1 o
ON(c.Id=o.Id);
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c RIGHT OUTER JOIN sample_joins1 o
ON(c.Id=o.Id)
What is Map side join in Hive?
For performing Map side joins, there should be two files, one is of larger size and the other is of smaller size.
As it is a Map side join, the number of reducers will be set to 0 automatically.
Done in memory
Advantages: Map side join helps in minimizing the cost that is incurred for sorting and merging in
the shuffle and reduce stages.
Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.
The smaller table is turned into a hash table. This hash table is being serialized and shipped as distributed cache
prior to the job execution.
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
Disadvantages: Map side join is adequate only when one of the tables on which you perform map-side join
operation is small enough to fit into the memory. Hence it is not suitable to perform map-side join on the tables
which are huge data in both of them.
Expensive
What is Reduce side join in Hive?
Reduce side joins are straight forward due to the fact that Hadoop sends identical keys to the same reducer, so by
default the data is organized for us.
Handy when all the files on which to be performed are huge in size and it takes time to join huge data
Done off memory and Cheap
How can you configure remote metastore mode in Hive?
To configure metastore in Hive, hive-site.xml file has to be configured with the below property –
<property>
<name>hive.metastore.uris</name>
<value>thrift: //node1 (or IP Address):9083</value>
<description>IP address and port of the metastore host</description>
</property>
Explain about the SMB Join in Hive.
In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second
table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no
limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns
are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.
How can you connect an application, if you run Hive as a server?
When running Hive as a server, the application can be connected in one of the 3 ways-
ODBC Driver-This supports the ODBC protocol
JDBC Driver- This supports the JDBC protocol
Thrift Client- This client can be used to make calls to all hive commands using different programming language like
PHP, Python, Java, C++ and Ruby.
What is serde in Hive? How can you write your own custom serde?
Serde is a Serializer deserializer. Hive uses serde to read and write data from tables. The Deserializer interface takes
a string or binary representation of a record, and translates it into a Java object that Hive can manipulate.
The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that hive
can write to HDFS or another supported system. A serde allows Hive to read in data from a table, and write it back
out to HDFS in any custom format.
Can the default Hive Metastore be used by multiple users (processes) at the same time?
Derby database is the default Hive Metastore. Multiple users (processes) cannot access it at the same time. It is
mainly used to perform unit tests.
What is a generic UDF in Hive?
It is a UDF which is created using a Java program to serve some specific need not covered under the existing
functions in Hive. It can detect the type of input argument programmatically and provide appropriate responses.
Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction
id and the transaction state. Can this be achieved using Apache Hive?
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor
various hive transactions.
Write a query to rename a table Student to Student_New.
Alter Table Student RENAME to Student_New
Differentiate between describe and describe extended.
Describe database/schema- This query displays the name of the database, the root location on the file system and
comments if any.
Describe extended database/schema- Gives the details of the database or schema in a detailed manner.
Is it possible to overwrite Hadoop mapreduce configuration in Hive?
Yes, hadoop mapreduce configuration can be overwritten by changing the hive-conf settings file.
Can we write the select output of hive to file in HDFS?
INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM table
hive -e 'select books from table' > /home/lvermeer/temp.tsv
What is the use of explode in Hive?
Explode in Hive is used to convert complex data types into desired table formats. Explode UDTF basically emits all
the elements in an array into multiple rows.
Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
SELECT * from employees SORT BY Id DESC;
ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a
single reducer. ORDER BY alters the order in which items are returned.
SELECT * FROM employees ORDER BY Department;
DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by
columns will go to the same reducer.
SELECT Id, Name from employees DISTRIBUTE BY Id;
CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non
overlapping range of data which is then sorted by those ranges at the respective reducers.
SELECT Id, Name from employees CLUSTER BY Id;
GROUP BY will aggregate records by the specified columns which allows you to perform aggregation functions on
non-grouped columns (such as SUM, COUNT, AVG, etc).
SELECT Department, count(*) FROM employees GROUP BY Department;
What is the difference between UNION and UNION ALL?
UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.
 UNION performs a DISTINCT on the result set, eliminating any duplicate rows.
 UNION ALL does not remove duplicates, and it therefore faster than UNION.
Write a hive query to view all the databases whose name begins with db
SHOW DATABASES LIKE ‘db.*’
How can you prevent a large job from running for a long time?
This can be achieved by setting the mapreduce jobs to execute in strict mode set hive.mapred.mode=strict;
The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
What will be the output of cast (‘XYZ’ as INT)?
It will return a NULL value.
What is Index?
Indexes are pointers to particular column name of a table.
The user has to manually define the index
Wherever we are creating index, it means that we are creating pointer to particular column name of table
Any Changes made to the column present in tables are stored using the index value created on the column name.
Syntax:
Create INDEX < INDEX_NAME> ON TABLE < TABLE_NAME(column names)>
I do not need the index created in the first question anymore. How can I delete the above index named
index_bonuspay?
DROP INDEX index_bonuspay ON employee;
What happens on executing the below query? After executing the below query, if you modify the column –
how will the changes be tracked?
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus)
AS 'org.apache.hadoop.hive.ql.index.compact.compactindexhandler';
The query creates an index named index_bonuspay which points to the bonus column in the employee table.
Whenever the value of bonus is modified it will be stored using an index value.
Apache Tez is an extensible framework for building Yarn based High data performance batch and interactive data
applications in Hadoop and Apache Tez can handle TB to PB of data sets. By using this Tez we can optimize and
fast response time and extreme throughput at petabyte scale.
Sometimes it's necessary to tune memory settings for TEZ - it's TEZ cons. MR - is more mature (this may change in
future), not necessary to tune Java memory in most cases for MR.
What Apache Tez Does
Tez provides a developer API and framework to write a native YARN application that bridge the spectrum of
interactive and batch workloads. It allows you to express complex computations as dataflow graphs and allows for
dynamic performance optimizations based on real information about the data and the resources required to process
it.
If Hive Metastore service is down, then what will be its impact on the Hadoop cluster?
If the Hive metastore service is down, Hadoop cluster just works fine.
Load JSON Data
Hive has two built-in functions, get_json_object and json_tuple, for dealing with JSON. There are also a couple of
JSON SerDe's (Serializer/Deserializers) for Hive.
Here's the first document:
{
"Foo": "ABC",
"Bar": "20090101100000",
"Quux": {
"QuuxId": 1234,
"QuuxName": "Sam"
}
}
Collapsed Version: {"Foo":"ABC","Bar":"20090101100000","Quux":{"QuuxId":1234,"QuuxName":" Sam "}}
CREATE TABLE json_table ( json string );
LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_table;

Built in function #1: get_json_object


The get_json_object takes two arguments: tablename.fieldname and the JSON field to parse, where '$' represents the
root of the document.
Syntax: select get_json_object(json_table.json, '$') from json_table;

select get_json_object(json_table.json, '$.Foo') as foo,


get_json_object(json_table.json, '$.Bar') as bar,
get_json_object(json_table.json, '$.Quux.QuuxId') as qid,
get_json_object(json_table.json, '$.Quux.QuuxName') as qname
from json_table;
You should get the output:

foo bar qid qname


ABC 20090101100000 1234 Sam
(Note: to get the header fields, enter set hive.cli.print.header=true at the hive prompt or in your $HOME/.hiverc file.)

This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document
once for every field you want to pull out of your JSON document, so it is rather inefficient.

Built in function #2: json_tuple


So let's see what json_tuple looks like. It has the benefit of being able to pass in multiple fields, but it only works to
a single level deep. You also need to use Hive's slightly odd LATERAL VIEW notation:
A lateral view first applies the UDTF to each row of base table and then joins resulting output rows to
the input rows to form a virtual table having the supplied table alias.
select v.foo, v.bar, v.quux, v.qid
from json_table jt
LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux', 'Quux.QuuxId') v
as foo, bar, quux, qid;
This returns:
foo bar quux qid
ABC 20090101100000 {"QuuxId":1234,"QuuxName":"Sam"} NULL

It doesn't know how to look inside the Quux subdocument. And this is where json_tuple gets clunky fast - you have
to create another lateral view for each subdocument you want to descend into:

select v1.foo, v1.bar, v2.qid, v2.qname


from json_table jt
LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux') v1
as foo, bar, quux
LATERAL VIEW json_tuple(v1.quux, 'QuuxId', 'QuuxName') v2
as qid, qname;

This gives us the output we want:

foo bar qid qname


ABC 20090101100000 1234 Sam
With a complicated highly nested JSON doc, json_tuple is also quite inefficient and clunky as hell. So let's turn to a
custom SerDe to solve this problem.

The best option: rcongiu's Hive-JSON SerDe


A SerDe is a better choice than a json function (UDF) for at least two reasons:
It only has to parse each JSON record once

ADD JAR /path/to/json-serde-1.1.6.jar;


You can do this either at the hive prompt or put it in your $HOME/.hiverc file.

Now let's define the Hive schema that this SerDe expects and load the simple.json doc:

CREATE TABLE json_serde (


Foo string,
Bar string,
Quux struct<QuuxId:int, QuuxName:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_serde;


With the openx JsonSerDe, you can define subdocuments as maps or structs. I prefer structs, as it allows you to use
the convenient dotted-path notation (e.g., Quux.QuuxId) and you can match the case of the fields. With maps, all the
keys you pass in have to be lowercase, even if you defined them as upper or mixed case in your JSON.

SELECT Foo, Bar, Quux.QuuxId, Quux.QuuxName


FROM json_serde;
Result:

foo bar quuxid quuxname


ABC 20090101100000 1234 Sam

And now let's do a more complex JSON document:

{
"DocId": "ABC",
"User": {
"Id": 1234,
"Username": "sam1234",
"Name": "Sam",
"ShippingAddress": {
"Address1": "123 Main St.",
"Address2": null,
"City": "Durham",
"State": "NC"
},
"Orders": [
{
"ItemId": 6789,
"OrderDate": "11/11/2012"
},
{
"ItemId": 4352,
"OrderDate": "12/12/2012"
}
]
}
}

Collapsed version:
{"DocId":"ABC","User":{"Id":1234,"Username":"sam1234","Name":"Sam",
"ShippingAddress":{"Address1":"123 Main St.","Address2":"","City":"Durham","State":"NC"},
"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}

Hive Schema:

CREATE TABLE complex_json (


DocId string,
User struct<Id:int,
Username:string,
Name: string,
ShippingAddress:struct<Address1:string,
Address2:string,
City:string,
State:string>,
Orders:array<struct<ItemId:int,
OrderDate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/tmp/complex.json' OVERWRITE INTO TABLE complex_json;

SELECT DocId, User.Id, User.ShippingAddress.City as city,


User.Orders[0].ItemId as order0id,
User.Orders[1].ItemId as order1id
FROM complex_json;
Result:
Docid id city order0id order1id
ABC 1234 Durham 6789 4352

But what if we don't know how many orders there are and we want a list of all a user's order Ids? This will work:

SELECT DocId, User.Id, User.Orders.ItemId


FROM complex_json;

Result:
docid id itemid
ABC 1234 [6789,4352]
Oooh, it returns an array of ItemIds. Pretty cool. One of Hive's nice features.

Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say
version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?

Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new
"PostalCode" field in Shipping Address.
{
"DocId": "ABC",
"User": {
"Id": 1235,
"Username": "fred1235",
"Name": "Fred",
"ShippingAddress": {
"Address1": "456 Main St.",
"Address2": "",
"City": "Durham",
"State": "NC"
}
}
}

{
"DocId": "ABC",
"User": {
"Id": 1236,
"Username": "larry1234",
"Name": "Larry",
"ShippingAddress": {
"Address1": "789 Main St.",
"Address2": "",
"City": "Durham",
"State": "NC",
"PostalCode": "27713"
},
"Orders": [
{
"ItemId": 1111,
"OrderDate": "11/11/2012"
},
{
"ItemId": 2222,
"OrderDate": "12/12/2012"
}
]
}
}

Collapsed version:
{"DocId":"ABC","User":{"Id":1235,"Username":"fred1235","Name":"Fred","ShippingAddress":{"Address1":"456
Main St.","Address2":"","City":"Durham","State":"NC"}}}
{"DocId":"ABC","User":{"Id":1236,"Username":"larry1234","Name":"Larry","ShippingAddress":{"Address1":"78
9 Main St.","Address2":"","City":"Durham","State":"NC","PostalCode":"27713"},
"Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}}

those records to complex.json and reload the data into the complex_json table.

Now try the query:

SELECT DocId, User.Id, User.Orders.ItemId


FROM complex_json;

It works just fine and gives the result:

docid id itemid
ABC 1234 [6789,4352]
ABC 1235 null
ABC 1236 [1111,2222]

How to set variables in HIVE scripts (or) how do you pass parameters in hive scripts (or) while invoking a
script how to pass parameters
I can pass arguments by two methods :-
1. Passing value through CLI
command is = hive -hiveconf current_date=01-01-2015 -f argument.hql
Here my script is -
select * from glvc.product where date = '${hiveconf:current_date}';
Here my command executes fine and I got the result.

2. Passing arguments
In this case , I have already set the value in my script file and I don't want to pass the value through CLI.
If I write command = hive -hiveconf:current_date -f argument.hql , I didnt get the result. That's why I had taken a
variable earlier.
Script -
set current_date = 01-01-2015;
select * from glvc.product where date = '${hiveconf:current_date}';
Regular Expression:
Hive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to
deserialize data, while data serialization is not supported
There are two classes available:

org.apache.hadoop.hive.contrib.serde2.RegexSerDe
org.apache.hadoop.hive.serde2.RegexSerDe
A regex group is defined by parenthesis "(...)" inside the regex.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields
will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored.
If a row doesn't match the regex then all fields will be NULL.

Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to
3 table columns:

(\\d+), the leading integer id composed of 1 or more digits,


([^\\t]*), a string, everything except tab, positioned between 2nd and 3rd delimiting tabs. If we know that the column
contains no spaces we can also use "\\S+" in our example this is not the case, (however, we are making such
assumption about the 3rd and the 4th field) and
(\\d++.\\d++).*'), a float with at least 1 digit before and after the decimal point.
Input sample (files used in examples are available in the attachment):

110 La Coruña Corunna Spain 0.37


112 Cádiz Cadiz Spain 0.4
120 Köln Cologne Germany 0.97
hive> select * from citiesr1 where id>100 and id<121;
110 La Coruña 0.37
112 Cádiz 0.4
120 Köln 0.97
Now, let's consider a case when some fields are missing in the input file, and we attempt to read it using the same
regex used for the table above:

$ hdfs dfs -mkdir -p hive/serde/regex2


$ hdfs dfs -put allcities-flds-missing.utf8.tsv hive/serde/regex2
hive> CREATE EXTERNAL TABLE citiesr2 (id int, city_org string, ppl float) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES
('input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*') LOCATION '/user/it1/hive/serde/regex2';
Input sample:

2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2
On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If
we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines
because the regex doesn't match these lines. To rectify the problem we can change the regex slightly to allow for
such cases:

hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES
('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2';
The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last
delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab including empty
string. With this changes, the above 3 lines become:

hive> select * from citiesr3 where id in (2, 31, 121);


2 大阪 NULL
31 Якутск NULL
121 München 1.2
Hive Use Case-
Hive simplifies Hadoop at Facebook with the execution of 7500+ Hive jobs daily for Ad-hoc analysis, reporting and
machine learning.
What is Impala?
Impala is a MPP (massive Parallel Processing) sql query engine for processing large volumes of facts this is stored
in Hadoop cluster. It gives excessive overall performance and low latency as compared to different engines. Impala
is the highest performing square engine.
Why Impala?
Impala combines the sql Support and multi-user performance of a traditional analytic database with the scalability
and flexibility of Apache Hadoop, by using utilizing standard components such as HDFS, hbase, Metastore, YARN,
and Sentry.
Impala can examine almost all the document codecs such as Parquet, Avro, rcfile used by Hadoop.
In contrast to Apache Hive, Impala is not based on mapreduce algorithms. It implements a disbursed architecture
based totally on daemon methods which might be chargeable for all of the elements of question execution that run
on the same machines.
Thus, it reduces the latency of using mapreduce and this makes Impala faster than Apache Hive.
Advantages of Impala
Here’s a listing of a few noted blessings of Cloudera Impala.
 Using impala, you could process data that is saved in HDFS at lightning-rapid speed
 Since the information processing is carried where the data resides (on Hadoop cluster), facts transformation
and statistics movement is not required for data stored on Hadoop, even as working with Impala.
 Using Impala, you can access the data this is stored in HDFS, hbase, and Amazon s3 without the
knowledge of Java (mapreduce jobs). You can access them with a basic idea of sql Queries.
 To write queries in business tools, the records need to be long past thru a complex extract-transform-load
(ETL) cycle. However, with Impala, this method is shortened. The time-consuming stages of loading &
reorganizing are overcome with the new strategies consisting of exploratory data analysis & data discovery
making the process faster.
 Impala is pioneering using the Parquet file format, a columnar storage layout that is optimized for large-
scale queries typical in data warehouse scenarios.
Relational Databases and Impala
Impala uses a Query language that is similar to SQL and hiveql. The following table describes some of the key
dfferences between SQL and Impala Query language.
Impala Relational databases
Impala uses an SQL like query language that is Relational databases use SQL language.
similar to hiveql.
In Impala, you cannot update or delete In relational databases, it is possible to update or delete
individual records. individual records.
Impala does not support transactions. Relational databases support transactions.
Impala does not support indexing. Relational databases support indexing.
Impala stores and manages large amounts of Relational databases handle smaller amounts of data
data (petabytes). (terabytes) when compared to Impala.

Hive, Hbase, and Impala

Though Cloudera Impala uses the same query language, metastore, and the user interface as Hive, it differs with
Hive and hbase in certain aspects. The following table presents a comparative analysis among hbase, Hive, and
Impala.

Hbase Hive Impala


Hbase is wide-column store Hive is a data warehouse software. Impala is a tool to manage, analyze
database based on Apache Hadoop. Using this, we can access and manage data that is stored on Hadoop.
It uses the concepts of bigtable. large distributed datasets, built on
Hadoop.
The data model of hbase is wide Hive follows Relational model. Impala follows Relational model.
column store.
Hbase is developed using Java. Hive is developed using Java. Impala is developed using C++.
The data model of hbase is schema- The data model of Hive is Schema- The data model of Impala is Schema-
free. based. based.
Hbase provides Java, restful and, Hive provides JDBC, ODBC, Thrift Impala provides JDBC and ODBC
Thrift API’s. API’s. API’s.
Supports programming languages Supports programming languages like Impala supports all languages
like C, C#, C++, Groovy, Java C++, Java, PHP, and Python. supporting JDBC/ODBC.
PHP, Python, and Scala.
Hbase supports for triggers. Hive does not support triggers. Impala does not support triggers.

All these three databases −


• Are NOSQL databases.
• Available as open source.
• Support server-side scripting.
• Follow ACID properties like Durability and Concurrency.
• Use sharding for partitioning.
Drawbacks of Impala
Some of the drawbacks of using Impala are as follows −
• Impala does not provide any support for Serialization and Deserialization.
• Impala can only read text files, not custom binary files.
• Whenever new records/files are added to the data directory in HDFS, the table needs to be refreshed.
What is Impala Architecture?
Impala is an MPP (Massive Parallel Processing) query implementation engine that runs on a number of systems in
the Hadoop cluster. Unlike traditional storage systems, impala is decoupled from its storage engine. It has three
main mechanisms namely, Impala daemon (Impalad), Impala Statestore, and Impala metadata or metastore.
Impala daemon (Impalad)
Impala daemon (also known as impalad) runs on each node where Impala is installed.
Every time a query is submitted to an impalad on a particular node, that node serves as a “coordinator node” for that
query. Multiple queries are helped by Impalad running on other nodes as well. After accepting the query, Impalad
reads and writes to data files and parallelizes the queries by allocating the work to the other Impala nodes in the
Impala cluster. When queries are handling on various Impalad examples, all of them return the result to the central
coordinating node.
Impala State Store
Impala has another important component called Impala State store, which is accountable for checking the health of
each Impalad and then communicating each Impala daemon health to the other daemons frequently. This can run on
same node where Impala server or other node within the cluster is running.
Impala Metadata & Meta Store
Impala metadata & Meta store is any other important thing. The vital information including table & column data &
table definitions are stored in a centralized database called a Meta store.
Each Impala node caches all the metadata locally. When handling an extremely huge amount of records and/or many
walls, getting table specific metadata should take a enormous amount of time. So, a locally saved metadata cache
allows in providing such data instantly.
While a table definition or table information is updated, other Impala daemons ought to update their metadata cache
by retrieving the modern metadata before issuing a new question towards the table in query.
Query Processing Interfaces
To process queries, Impala gives three interfaces as listed beneath.
•Impala-shell − After setting up Impala the usage of the Cloudera VM, you may start the Impala shell by using
typing the command impala-shell inside the editor.
•Hue interface −Inside the Hue browser, you have Impala query editor wherein you could type and execute the
impala queries.
•ODBC/JDBC drivers − just like different databases, Impala provides ODBC/JDBC drivers. Using these drivers,
you can connect to impala via programming languages that supports those drivers and build applications that manner
queries in impala using the ones programming languages.
PIG
What is apache pig?
Apache pig is a platform for analyzing large data sets that consists high-level language for expressing data analysis
programs, with infrastructure for evaluating these programs.
Apache pig takes in a set of instructions from user called pig latin instructions, analyze them and convert them in to
mapreduce program and execute the instructions in a hadoop cluster.
Pig features:
Ease of programming
Generates mapreduce programs automatically
Flexible: metadata is optional
Extensible: easy extensible by udfs
Resides on the client machine
Pig’s infrastructure layer consists of a compiler that produces sequence of mapreduce programs.

What is the difference between pig and hive?


Criteria Pig Hive
Type of data Apache pig is usually used for semi structured data. Used for structured data
Schema Schema is optional. Hive requires a well-defined schema.
Language It is a procedural data flow language. Follows sql dialect and is a declarative
language.
Purpose Mainly used for programming. It is mainly used for reporting.
General usage Usually used on the client side of the hadoop cluster. Usually used on the server side of the
hadoop cluster.
Coding style Verbose More like sql

Hiveql follows a flat relational data model, whereas piglatin has nested relational data model.
Hiveql and piglatin both convert the commands into mapreduce jobs.
They cannot be used for olap transactions as it is difficult to execute low latency queries.
Differentiate between hadoop mapreduce and pig
Characteristic Mapreduce Pig
Type of language Compiled language Scripting language
Level of abstraction Low level of abstraction Higher level of abstraction
Code More lines of code is required. Compatively less lines of code than hadoop
mapreduce.
Code efficiency Code efficiency is high. Code efficiency is relatively less.

In which scenario pig is better fit than mapreduce?


Processing of weblogs
Pig provides common data operations (joins, filters, group by, order by, union) and nested data types (tuple, bag and
maps), which are missing from mapreduce.
An ad hoc way of creating and executing map reduce jobs on large datasets
In which scenario mapreduce is a better fit than pig?
Some problems are harder to express in pig. For example:
Complex grouping or joins
Combining lot of datasets
Replicated join
Complex cross products
In such cases, pig’s mapreduce relational operator can be used which allows plugging in java mapreduce job.
Where not to use pig?
Completely unstructured data. For example: images, audio, video
When more power to optimize the code is required
Retrieving a single record in a very large dataset- complex business logic
Why do we need mapreduce during pig programming?
Pig is a high-level platform that makes hadoop data analysis issues easier to execute. The language we use for this
platform is pig latin. So, when a program is written in pig latin, the pig compiler will convert the program into
mapreduce jobs. Here, mapreduce acts as the execution engine.
How pig differs from mapreduce
in mapreduce, groupby operation performed at reducer side and filter, projection can be implemented in the map
phase. Pig latin also provides standard-operation similar to mapreduce like orderby and filters, groupby, etc.we can
analyze pig script and know data flows ans also early to find the error checking. Pig latin is much lower cost to write
and maintain than java code for mapreduce.
What can be feed to pig?
We can input structured, semi-structured or unstructured data to pig.
For example, csv’s, tsv’s, delimited data, logs
How is pig useful for?
In three categories, we can use pig. They are 1) etl data pipeline 2) research on raw data 3) iterative processing
most common usecase for pig is data pipeline. Let us take one example, web based companies gets the weblogs, so
before storing data into warehouse, they do some operations on data like cleaning and aggeration operations etc. I,e
transformations on data.
Is piglatin a strongly typed language? If yes, then how did you come to the conclusion?
In a strongly typed language, the user has to declare the type of all variables upfront. In apache pig, when you
describe the schema of the data, it expects the data to come in the same format you mentioned. However, when the
schema is not known, the script will adapt to actually data types at runtime. So, it can be said that piglatin is strongly
typed in most cases but in rare cases it is gently typed, i.e. It continues to work with data that does not live up to its
expectations.
What are the components of apache pig platform?
Pig engine parser, optimizer and produces sequences of mapreduce programs
Grunt pig’s interactive shell. It allows users to enter pig latin interactively and interact with hdfs
Pig latin high level and easy to understand dataflow language provides ease of programming, extensibility and
optimization.
What are the execution modes in pig? Or how do users interact with the shell in apache pig?
Using grunt i.e. Apache pig’s interactive shell, users can interact with hdfs or the local file system
Pig has two execution modes:
Local mode
No hadoop / hdfs installation is required
All processing takes place in only one local jvm
Only files within the local file system can be accessed.
Used only for quick prototyping and debugging pig latin script
Used for small size files
Pig -x local
Mapreduce mode (default) parses, checks and optimizes locally
Plans execution as one mapreduce job
Submits job to hadoop
To access data file present on hdfs
Monitors job progress
Large size files give good performance
Pig or pig -x mapreduce
Different running modes for running pig?
Pig has two running modes:
Interactive mode pig commands runs one at a time in the grunt shell
Batch mode commands are in pig script file.
What are the scalar datatypes in pig?
Scalar datatype
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
bytearray
What are the complex datatypes in pig?
Map:
map in pig is chararray to data element mapping where element have pig data type including complex data type.
Example of map [‘city’#’hyd’,’pin’#500086]
It can also be called as a set of key-value pairs where
Keys → chararray and values → any pig data type
For example [‘student’#’mahi’, ’rank’#1]
The above example city and pin are data elements(key) mapping to values
tuple: rdbms row in the table
A tuple is an ordered set of fields; fields can be of any data type.
It can also be called as a sequence of fields of any type.
Tuple have fixed length and it have collection datatypes. Tuple containing multiple fields and also tuples are
ordered.
Example, (hyd,500086) which containing two fields.
Bag:
A bag containing collection of tuples which are unordered with possible duplicates, bag constants are constructed
using braces, with tuples in the bag separated by commas. For example, {(‘hyd’, 500086), (‘chennai’, 510071),
(‘bombay’, 500185)}
Bags are used to store collections while grouping. The size of bag is the size of the local disk, which means that the
size of the bag is limited. When a bag is full, pig will spill this bag into the local disk and keep only some parts of
the bag in memory. It is not necessary that the complete bag fit into the memory. We represent bag with {}.
Which type in pig is not required to fit in memory?
Bag is the type not required to fit in memory, as it can be quite large.
It can store bags to disk when necessary.
What do you understand by an inner bag and outer bag in pig?
A relation inside a bag is referred to as inner bag and outer bag is just a relation in pig
Does pig support multi-line commands?
Yes
what does flatten do in pig?
Sometimes there is data in a tuple or bag and if we want to remove the level of nesting from that data then flatten
modifier in pig can be used. Flatten un-nests bags and tuples. For tuples, the flatten operator will substitute the fields
of a tuple in place of a tuple whereas un-nesting bags is a little complex because it requires creating new tuples.
What are the different relational operations in pig latin?
For each Order by Filters Group Distinct Join Limit
Explain about the execution plans of a pig script Or Differentiate between the logical and physical plan.
Logical and physical plans are created during the execution of a pig script. Pig scripts are based on interpreter
checking. Logical plan is produced after semantic checking and basic parsing and no data processing takes place
during the creation of a logical plan. Whenever an error is encountered within the script, an exception is thrown and
the program execution ends, else for each statement in the script has its own logical plan.
A logical plan contains collection of operators in the script but does not contain the edges between the operators.
After the logical plan is generated, the script execution moves to the physical plan where there is a description about
the physical operators, apache pig will use, to execute the pig script. A physical plan is more or less like a series of
mapreduce jobs but then the plan does not have any reference on how it will be executed in mapreduce. During the
creation of physical plan, cogroup logical operator is converted into 3 physical operators namely –local rearrange,
global rearrange and package. Load and store functions usually get resolved in the physical plan.
Is the keyword ‘define’ like a function name?
Yes, the keyword ‘define’ is like a function name. Once you have registered, you have to define it. Now the
compiler will check the function in exported jar. When the function is not present in the library, it looks into your
jar.
Is the keyword ‘functional’ a user defined function (udf)?
No, the keyword ‘functional’ is not a user defined function (udf). While using udf, we have to override some
functions. The keyword ‘functional’ is a built-in function i.e a pre-defined function, therefore it does not work as a
udf.
Does pig give any warning when there is a type mismatch or missing field?
No, pig will not show any warning if there is no matching field or a mismatch. If you assume that pig gives such a
warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in pig.
Compare apache pig and sql.
 Apache pig differs from sql in its usage for etl, lazy evaluation, store data at any given point of time in the pipeline,
support for pipeline splits and explicit declaration of execution plans. Sql is oriented around queries which produce a
single result. Sql has no in-built mechanism for splitting a data processing stream and applying different operators to
each sub-stream.
 Apache pig allows user code to be included at any point in the pipeline whereas if sql where to be used data needs to
be imported to the database first and then the process of cleaning and transformation begins.
What do you know about the case sensitivity of apache pig?
It is difficult to say whether apache pig is case sensitive or case insensitive. For instance, user defined functions,
relations and field names in pig are case sensitive i.e. The function COUNT is not the same as function count or
X=load ‘foo’ is not same as x=load ‘foo’. On the other hand, keywords in apache pig are case insensitive i.e. LOAD
is same as load.
What is the role of a co-group in pig?
Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field
and then returns a set of records containing two separate bags. The first bag consists of records from the first data set
with the common data set, while the second bag consists of records from the second data set along with the common
data set.
Can we say co-group is a group of more than 1 data set?
In the case of more than one data sets, co-group will group all the data sets and join them based on the common
field.
Differentiate between group and cogroup operators.
Both group and cogroup operators are identical and can work with one or more relations. Group operator is
generally used to group the data in a single relation for better readability, whereas cogroup can be used to group the
data in 2 or more relations. Cogroup is more like a combination of group and join, i.e., it groups the tables based on
a column and then joins them on the grouped columns. It is possible to cogroup up to 127 relations at a time.
Explain the difference between count_star and count functions in apache pig?
Count function does not include the null value when counting the number of elements in a bag, whereas count_star
(0 function includes null values while counting.
What are the various diagnostic operators available in apache pig?
1. Dump operator- it is used to display the output of pig latin statements on the screen, so that developers can debug
the code.
 Explain utility is used when trying to debug error or optimize piglatin scripts. Explain can be applied on a particular
alias in the script or it can be applied to the entire script in the grunt interactive shell. Explain utility produces
several graphs in text format which can be printed to a file.
 Describe debugging utility is helpful to developers when writing pig scripts as it shows the schema of a relation in
the script. For beginners who are trying to learn apache pig can use the describe utility to understand how each
operator makes alterations to data.
Illustrate executing pig scripts on large data sets, usually takes a long time. For instance, if the script has a join
operator there should be at least a few records in the sample data that have the same key, otherwise the join
operation will not return any results. To tackle these kind of issues, illustrate is used. Illustrate takes a sample from
the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records
pass through and some do not, by making modifications to the records such that they meet the condition. Illustrate
just shows the output of each stage but does not run any mapreduce task.
how will you merge the contents of two or more relations and divide a single relation into two or more
relations?
This can be accomplished using the union and split operators.
i have a relation r. How can i get the top 10 tuples from the relation r.?
Top () function returns the top n tuples from a bag of tuples or a relation. N is passed as a parameter to the function
top () along with the column whose values are to be compared and the relation r.
what are the different types of udf’s in java supported by apache pig?
Algebraic, eval and filter functions are the various types of udf’s supported in pig.
can you join multiple fields in apache pig scripts?
Yes, it is possible to join multiple fields in pig scripts because the join operations takes records from one input and
joins them with another input. This can be achieved by specifying the keys for each input and the two rows will be
joined when the keys are equal.
Loading datasets
Dataset in /user/hirw/input/stocks (hdfs location) is a comma delimited stocks dataset with day by day stocks
information for several stock symbols.
Abcse, kbt,2001-05-22,11.75,11.75,10.80,11.10,765000,5.04
Abcse, kbt,2001-05-21,11.75,11.87,11.51,11.75,814600,5.33
Abcse, kbt,2001-05-18,12.10,12.10,11.83,11.85,250600,5.38
Abcse, kib,2010-02-08,3.03,3.15,2.94,3.10,455600,3.10
Abcse, kib,2010-02-05,3.01,3.06,2.96,3.04,635400,3.04
Abcse, kib,2010-02-04,3.03,3.50,2.85,3.05,990400,3.05
Pigstorage(,) instruct pig that we are loading a comma delimited dataset. Note that the column names are listed in
the same order as it is laid out in the datasets along with its data types. Also, the data types resemble the datatypes in
java.
Different load variations in pig.
Variation 1 – load without column names or types
Grunt> stocks1 = load '/user/hirw/input/stocks' using pigstorage(',');
Variation 2 – load with column names but no types
Grunt> stocks2 = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange, symbol, date, open, high,
low, close, volume, adj_close);
Variation 3 – load with column names and types
Grunt> stocks3 = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray,
symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
The structure of stocks3 (variation 3) is well defined. But what is the structure of stocks1 and stocks2? To look up
the structure of a relation (for eg. Stocks1) use the describe operator.
Describe operator
Pig cannot guess the structure of stocks1 as we did not provide either column names or types.
Grunt> describe stocks1;
Schema for stocks1 unknown.
With stocks2, pig know the column names and makes all the column types to be the default bytearray .
Grunt> describe stocks2;
Stocks2: {exchange: bytearray, symbol: bytearray, date: bytearray, open: bytearray, high: bytearray, low: bytearray,
close: bytearray, volume: bytearray, adj_close: bytearray}
How should ‘load’ keyword is useful in pig scripts?
First step in dataflow language we need to specify the input, which is done by using ‘load’ keyword. Load looks for
your data on hdfs in a tab-delimited file using the default load function ‘pigstorage’. Suppose if we want to load data
from hbase, we would use the loader for hbase ‘hbasestorage‘.
Example of pigstorage loader
A = load ‘/home/ravi/work/flight.tsv’ using pigstorage (‘t’) as(origincode:chararray,
destinationcode:chararray, origincity:chararray, destinationcity:chararray, passengers:int, seats:int,
flights:int, distance:int, year:int, originpopulation:int, destpopulation:int);
Example of hbasestorage loader
X= load ‘a’ using hbasestorage();
If don’t specify any loader function, it will takes built in function is ‘pigstorage ‘the ‘load’ statement can also have
‘as’ keyword for creating schema, which allows you to specify the schema of the data you are loading.
‘pigstorage‘ and ‘textloader’, the two built-in pig load functions that operate on hdfs files.
How should ‘store’ keyword is useful in pig scripts?
After we have completed process, then result should write into somewhere, pig provides the store statement for this
purpose
store processed into ‘/data/ex/process';
if you do not specify a store function, pigstorage will be used. You can specify a different store function with a using
clause:
Store processed into ‘processed’ using hbasestorage();
We can also pass argument to store function, example,
Store processed into ‘processed’ using pigstorage(‘,’);
X= load ‘a’ where x is result stored(relation name) and once made, an assignment is permanent. It is possible to
reuse relation names. Pig latin have field names. Example x= load ‘a'(b,c,d) where b,c,d are field names.both
relation names and field names must be start with alphabetic character.
Project and manipulate columns
Load dataset with column names and types
Grunt> stocks = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray,
symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Use foreach, Generate operator for projecting the columns in the dataset.
Grunt> projection = foreach stocks generate symbol, substring(exchange, 0, 1) as sub_exch, close - open as
up_or_down;
Now projection will have 3 columns – symbol, sub_exch which is a result of a substring operation on the exchange
column and up_or_down which is a result of close price minus the opening price.
Grunt> top100 = limit projection 100;
Grunt> dump top100;
Load dataset with out column names and types
Grunt> stocks1 = load '/user/hirw/input/stocks' using pigstorage(',');
Grunt> describe stocks1;
Schema for stocks1 unknown.
With schema unknown for stocks1 relation how can we project and manipulate column? Very simple, we can use
the column position starting with index 0 for first column.
Grunt> projection = foreach stocks1 generate $1 as symbol, substring($0, 0, 1) as sub_exch, $6 - $3 as
up_or_down;
Similarly close ($6) and open ($3) are converted to double (casting to integer or float might lose precision) since
these columns are involved in numerical calculation.
Grunt> describe projection;
2015-12-09 12:41:55,172 [main] warn org.apache.pig.pigserver - encountered warning using_overloaded_function 1
time(s).
2015-12-09 12:41:55,172 [main] warn org.apache.pig.pigserver - encountered warning implicit_cast_to_chararray 1
time(s).
2015-12-09 12:41:55,172 [main] warn org.apache.pig.pigserver - encountered warning implicit_cast_to_double 2
time(s).
Projection: {symbol: bytearray,sub_exch: chararray,up_or_down: double}
Finally, display the results.
Grunt> top100 = limit projection 100;
Grunt> dump top100;
Display results on screen
What is the purpose of ‘dump’ keyword in pig?
Dump operator is used to display the results on screen.
Grunt> dump stocks;
Limit results
Don’t like to see all the results? Use the limit operator.
Grunt> top100 = limit stocks 100;
Grunt> dump top100;
What does foreach do?
Foreach is used to apply transformations to the data and to generate new data items. The name itself indicates that
for each element of a data bag, the respective action will be performed.
Foreach takes a set of expressions and applies them to every record in the data pipeline
Syntax: foreach bagname generate expression1, expression2, …..
The meaning of this statement is that the expressions mentioned after generate will be applied to the current record
of the data bag.
A = load ‘input’ as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach a generate user, id;
Positional references are preceded by a $ (dollar sign) and start from 0:
C= load d generate $2-$1
Suppose we want to four fields by using (..)
A = load ‘input’ as (high, mediumhigh, avg, low)
b=foreach a generate high..low;(produces high,mediumhigh,avg,low)
c=foreach a generate ..low; (produces high,mediumhigh,avg,low)
To extract data from complex datatyptes such as tuple, bag, map. By using projection operators we can extract data.
How to write ‘foreach’ statement for map datatype in pig scripts?
For map we can use hash(‘#’)
bball = load ‘baseball’ as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#’batting_average';
How to write ‘foreach’ statement for tuple datatype in pig scripts?
For tuple we can use dot(‘.’)
a = load ‘input’ as (t:tuple(x:int, y:int));
b = foreach a generate t.x, t.$1;
How to write ‘foreach’ statement for bag datatype in pig scripts?
When you project fields in a bag, you are creating a new bag with only those fields:
a = load ‘input’ as (b:bag{t:(x:int, y:int)});
b = foreach a generate b.x;
we can also project multiple field in bag
a = load ‘input’ as (b:bag{t:(x:int, y:int)});
b = foreach a generate b.(x, y);
Filter records
Grunt> stocks = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray,
symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Filter operator
Filter stock records with volume over 400,000. Resulting filter_by_volume relation will have stock records with
volume greater than 400,000.
Grunt> filter_by_volume = filter stocks by volume > 400000;
Filter stock records from year 2003. We use the getyear function to extract the year from the date column. Resulting
filter_by_yr relation will have stock records from year 2003.
Grunt> filter_by_yr = filter stocks by getyear(date) == 2003;
Display results
Display both filter_by_volume and filter _by_year. Limit the number of records to 100.
Grunt> top100 = limit filter_by_volume 100;
Grunt> dump top100;
Grunt> top100 = limit filter_by_yr 100;
Grunt> dump top100;
What is the use of having filters in apache pig?
Just like the where clause in sql, apache pig has filters to extract records based on a given condition or predicate.
The record is passed down the pipeline if the predicate or the condition turn to true. Predicate contains various
operators like ==,>=,<=,!=.so,== and != can be applied to maps and tuples.
Example -
X= load ‘inputs’ as(name,address)
Y = filter x by symbol matches ‘mr.*’;
Grouping records
Our stocks dataset has day by day stocks information like opening, closing prices, high, low and volume for the day
for several symbols across many years. Let’s calculate the average volume for all stocks symbol from year 2003.
Load & filter records from 2003
Grunt> stocks = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray,
symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Grunt> filter_by_yr = filter stocks by getyear(date) == 2003;
Grouping records
Grunt> grp_by_sym = group filter_by_yr by symbol;
Finding average
Now the dataset is loaded and filtered let’s group the records by symbol. Grouping records is very simple, use the
group…by operator as follows
Now the records are grouped by symbol we can do more meaningful operations on the grouped records. In order to
project grp_by_sym we need to know the structure of grp_by_sym . Use the describe operator on grp_by_sym
Grunt> describe grp_by_sym;
Grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open:
float,high: float,low: float,close: float,volume: int,adj_close: float)}}from the describe operation we can see that
grp_by_sym has two columns group and filter_by_yr . Group represents the column used for grouping, in our case
we have grouped the records by symbol column so group represents symbol.filter_by_year will have the list of
records for a given symbol.
Our goal is to project the symbol and the average volume for the symbol. So the below foreach projection would
achieve that with avg function.
Grunt> avg_volume = foreach grp_by_sym generate group, round(avg(filter_by_yr. volume)) as avgvolume;
Display result
Grunt> dump avg_volume; group
the group statement collects together records with the same key. In sql the group by clause creates a group that must
feed directly into one or more aggregate functions. In pig latin there is no direct connection between group and
aggregate functions.
Input2 = load ‘daily’ as (exchanges, stocks);
grpds = group input2 by stocks;
Orderby
the orderby statement sorts your data for you, producing a total order of your output data.the syntax of order is
similar to group. You indicate a key or set of keys by which you wish to order your data
input2 = load ‘daily’ as (exchanges, stocks);
grpds = order input2 by exchanges;
Distinct
the distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual
fields:
input2 = load ‘daily’ as (exchanges, stocks);
grpds = distinct exchanges;
Join
join select records from one input and join with another input. this is done by indicating keys for each input. When
those keys are equal, the two rows are joined.
Input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by stocks,input3 by stocks;
We can also join multiple keys
example:
input2 = load ‘daily’ as (exchanges, stocks);
input3 = load ‘week’ as (exchanges, stocks);
grpds = join input2 by (exchanges,stocks),input3 by (exchanges,stocks);
Limit
sometimes you want to see only a limited number of results. Limit allows you do this:
Input2 = load ‘daily’ as (exchanges, stocks);
first10 = limit input2 10;
You have a file employee.txt in the hdfs directory with 100 records. You want to see only the first 10 records
from the employee.txt file. How will you do this?
The first step would be to load the file employee.txt into with the relation name as employee.
The first 10 records of the employee data can be obtained using the limit operator -
Result= limit employee 10.
Ordering records
First let’s load, group and find the average volume of stocks symbol from year 2003.
Grunt> stocks = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray, symbol:chararray,
date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Grunt> filter_by_yr = filter stocks by getyear(date) == 2003;
Grunt> grp_by_sym = group filter_by_yr by symbol;
Grunt> avg_volume = foreach grp_by_sym generate group, round(avg(filter_by_yr.volume)) as avgvolume;
Ordering records
Use the order operator to order the records. By default records are ordered in ascending order. Use desc to order
records in descending order.
Grunt> avg_vol_ordered = order avg_volume by avgvolume desc;
Grunt> avg_vol_ordered = order avg_volume by group, avgvolume desc;
Display results
we can also choose to perform ordering with multiple columns. In the below instruction, the records will be ordered
by symbol and the volume. In the below instruction group refers to the symbol column.
Grunt> dump avg_vol_ordered;
Executing as a script
Dump vs. Store
Dump operator is used to display or print data on the screen but more often than not we would like to store the
results in hdfs.
Store operator is used to store the results in hdfs. By this we can specify what delimiter to use when we store the
results. For example we are instructing pig to store the records from top10 relation in to output/pig/avg-volume in
hdfs and the column delimiter will be specified using thepigstorage function. In this case the columns will be
delimited by comma.
Grunt> top10 = limit avg_vol_ordered 10;
Grunt> store top10 into 'output/pig/avg-volume' using pigstorage(',');
Running instructions as a script
Running a series of pig instructions is very simple. Simply save the instructions in a file. The file extension – .pig is
not mandatory but more of a convention. Execute the file like below
Pig /hirw-workshop/pig/scripts/average-volume.pig
Executing script with parameters
Parameter placeholders
Take a look at the below sample. ‘$input’ is a parameter which defines the input location and ‘$output’ is a
parameter which defines the output location.
Prices = load '$input' using pigstorage(',') as (exchange:chararray, symbol:chararray, date:datetime,
open:float, high:float, low:float, close:float,volume:int, adj_close:float);
Store top10 into '$output' using pigstorage(',');
Passing parameters
Now we know how to set the parameters, let’s see how to pass parameters from the command line. Use -param and
the parameter name to set and pass the value for the paramaters. In the below sample, at run time the value of
the $input in the script will be substituted as /user/hirw/input/stocks
Pig -param input=/user/hirw/input/stocks -param output=output/pig/avg-volume-params /hirw-
workshop/pig/scripts/average-volume-parameters.pig
Tuple & bag
Grunt> stocks = load '/user/hirw/input/stocks' using pigstorage(',') as (exchange:chararray,
symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Grunt> filter_by_yr = filter stocks by getyear(date) == 2003;
Grunt> grp_by_sym = group filter_by_yr by symbol;
Grunt> describe grp_by_sym;
From the output, we can see that the datatype of group is chararray. What is the datatype of filter_by_yr?
You can see the structure of filter_by_yr has a curly braces {followed by a parenthesis (. Whenever you see a curly
braces it is referred to as a bag. Whenever you see a parenthesis it is referred to as a tuple.
Tuple vs. Bag
Tuple is nothing but a record – with a collection of columns, in our case exchange, symbol, date etc. Bag is nothing
but a collection of records or tuples. So if you look at the below structure of filter_by_yr , you can see filter_by_yr is
a bag or in other words, filter_by_yr is a collection of records with columns exchange, symbol, date, open etc.
Grunt> describe grp_by_sym;
You can also say grp_by_sym is a bag because it has curly braces. Although in this case the parenthesis is omitted
from the display grp_by_sym represents a bag or a collection of records or tuples. We can define grp_by_sym as a
collection of tuples with 2 columns in each tuple – group which of type chararray and filter_by_yr which of type
bag.
Map
Take a look at couple of records from department dataset. The first column has the department number, second
column has department name. Third column has the address. But the structure of it looks weird doesn’t it? It is a
map.
328;admin hearng;[street#939 w el camino,city#chicago,state#il]
43;animal contrl;[street#415 n mary ave,city#chicago,state#il]
When you see a square bracket, we can infer it is a map. Map is nothing but a key value pair. Above records have 3
key value pairs – street, city and state.
Load & project a map
Grunt> departments = load '/user/hirw/input/employee-pig/department_dataset_chicago' using
pigstorage(';') as (dept_id:int, dept_name:chararray, address:map[]);
Grunt> dept_addr = foreach departments generate dept_name, address#'street' as street, address#'city' as
city, address#'state' as state;
Loading is easy, for the type simply say map[]. Address is a map with key value pairs. To project the value for street
key from the address column, you can say address#’street’. Similarly for city you can say address#’city’.
Display results
Grunt> top100 = limit dept_addr 100;
Grunt> dump top100; how to write pig udf example in java
Steps to create pig udf
Step 1 :-
Open your eclipse and create a java class name like ucfirst.java
Step 2 :-
You should add jar files to that project folder like
Right click on project —> build path —> configure build path —> libraries —> add external jars —> select hadoop
and pig lib folder jars files and add other jars files in hadoop folder —–> click ok.
Step 3 :-
Now your pig java program is supported in your eclipse with out any errors.the basic step in pig udf is
Public class ucfirst extends evalfunc<class datatype> and you return the value.
Package myudfs;
Import java.io.ioexception;
Import org.apache.pig.evalfunc;
Import org.apache.pig.data.tuple;
Import org.apache.pig.impl.util.wrappedioexception;
Public class ucfirst extends evalfunc<string> {
Public string exec(tuple input) throws ioexception {
If (input.size() == 0)
Return null;
Try {
String str = (string) input.get(0);
Char ch = str.touppercase().charat(0);
String str1 = string.valueof(ch);
Return str1;
}catch (exception e) {
// todo: handle exception
Throw wrappedioexception.wrap( caught exception processing input row , e);
}
}
Step 4 :-
Public string exec(tuple input) throws ioexception {
if (input.size() == 0)
return null;
Class name string and the entire row in text file is consider as tuple and first of all it will check the input is zero or
not if the input is zero then it return null.
Step 5 :-
Try catch block,we have to write the logic in try block
Try {
string str = (string) input.get(0);
char ch = str.touppercase().charat(0);
string str1 = string.valueof(ch);
return str1;
Step 6 :-
Catch block only for exception handling
How to execute this code in pig udf ?
Step 1 :-
Right click on program —> export —> create jar
Step 2 :-
Register jarname;
Step 3 :-
Write the pig script
Register ucfirst.jar;
a = load ‘sample.txt’ as (logid:chararray);
b = foreach a generate myudfs.ucfirst(logid);
dump b;
In the above script myudfs is package name and ucfirst is class name
Pig -x local ucfirst.pig
Output
(m)
(s)
(r)
(r)
This is the way to write pig udf example in java

Pig use case-


The personal healthcare data of an individual is confidential and should not be exposed to others. This
information should be masked to maintain confidentiality but the healthcare data is so huge that identifying and
removing personal healthcare data is crucial. Apache pig can be used under such circumstances to de-identify
health information.

Data Integration Components of Hadoop Ecosystem- Sqoop and Flume


SQOOP
What is Sqoop?
Sqoop component is used for importing data from external sources into related Hadoop components like HDFS,
hbase or Hive. It can also be used for exporting data from Hadoop o other external structured data stores. Sqoop
parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data
quickly. In other words, sqoop is used for import and export the huge amount of data from RDBMS or mainframe to
HDFS and HDFS to RDBMS
RDBMS such as Mysql, oracle
How to Enter Into mysql prompt?
Mysql -u root -p
How to grant all databases Permissions to single user in mysql?
Mysql> grant all privileges on databasename.* to ‘%’@’localhost';
How to grant all databases Permissions to all user in mysql?
Mysql> grant all privileges on databasename.* to @’localhost';
How to create a table In Mysql?
Mysql> create table emp(empid int, ename varchar(30), esal int);
Query OK, 0 rows affected (0.11 sec)
How to Insert the values Into the table?
Mysql> insert into emp values(111,’mahesh’, 28000);
Query OK, 1 row affected (0.00 sec)
Mysql> insert into emp values(112,’neelesh’, 30000);
Query OK, 1 row affected (0.00 sec)
Mysql> insert into emp values(113,’rupesh’, 26000);
Query OK, 1 row affected (0.00 sec)
Mysql> insert into emp values(114,’vijay’, 26000);
Query OK, 1 row affected (0.00 sec)
How to update the row in a table?
Mysql> update emp set esal= 28000 where empid = 114;
What is Sqoop scripts standard location?
/usr/bin/Sqoop

Give a sqoop command to import the columns employee_id, first_name, last_name from the mysql table
Employee
$ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES \
--columns employee_id,first_name,last_name
Give a sqoop command to run only 8 mapreduce tasks in parallel
$ sqoop import --connect jdbc:mysql://host/dbname --table table_name\ -m 8
Give a Sqoop command to import all the records from employee table divided into groups of records by the
values in the column department_id.
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --split-by dept_id
What does the following query do?
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where id > 1000 --target-dir /incremental_dataset –append
It performs an incremental import of new data, after having already imported the first 1000 rows of a table
What is the use of –append command in sqoop?
Append command used to add the extra output records for the old directory, there is no need to overwrite or create a
new directory it will appended to old directory
Example for append command in sqoop?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’ –append;
O/P:-
root@ubuntu:/home/mahesh/sqoop-related# hadoop fs -ls /importdata
Found 5 items
-rw-r–r– 1 root supergroup 0 2013-11-07 19:05 /importdata/_SUCCESS
drwxr-xr-x – root supergroup 0 2013-11-07 19:04 /importdata/_logs
drwxr-xr-x – root supergroup 0 2013-11-07 19:11 /importdata/_logs-00000
-rw-r–r– 1 root supergroup 67 2013-11-07 19:04 /importdata/part-m-00000
-rw-r–r– 1 root supergroup 43 2013-11-07 19:11 /importdata/part-m-00001
root@ubuntu:/home/mahesh/sqoop-related# hadoop fs -cat /importdata/part-m-00001
111 mahesh
112 neelesh
113 rupesh
114 vijay
When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data
in HDFS imported by sqoop? Or What is the process to perform an incremental data load in Sqoop?
Sqoop can have 2 approaches.
A − To use the --incremental parameter with append option where value of some columns are checked and only in
case of modified values the row is imported as a new row.
B − To use the --incremental parameter with lastmodified option where a date column in the source is checked for
records which have been updated after the last import.
Give a sqoop command to import data from all tables in the mysql DB DB1.
Sqoop import-all-tables --connect jdbc:mysql://host/DB1
What are the basic available commands in Sqoop?
Codegen: Generate code to interact with database records
Create-hive-table: Import a table definition into Hive
Eval: Evaluate a SQL statement and display the results
Export: Export an HDFS directory to a database table
Help
Import
Import-all-tables
List-databases
List-tables
Versions Display version information
What is the advantage with Eval command in Sqoop?
We can see the output directly from the terminal there is no need to go and check the output on top of HDFS
Eval Query Examples In sqoop?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp;
| empid | ename | esal |
| 111 | mahesh | 28000 |
| 112 | neelesh | 30000 |
| 113 | rupesh | 26000 |
| 114 | vijay | 28000 |
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp limit 2″;
| empid | ename | esal |
| 111 | mahesh | 28000 |
| 112 | neelesh | 30000 |
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp where empid = 111″;
| empid | ename | esal |
| 111 | mahesh | 28000 |
Create Table In sqoop by using Eval ?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query create table evaltab(evalid int, evalname varchar(30),
evalscope varchar(30));
13/11/07 19:35:39 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:35:39 INFO tool.evalsqltool: 0 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from evaltab;
13/11/07 19:36:02 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| evalid | evalname | evalscope |
Insert Values Into Eval Table ?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(111,’aaa’, ‘app’);
13/11/07 19:37:37 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:37:37 INFO tool.evalsqltool: 1 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(112,’bbb’, ‘prgrm’);
13/11/07 19:37:51 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:37:52 INFO tool.evalsqltool: 1 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(113,’ccc’, ‘project’);
13/11/07 19:38:09 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:38:10 INFO tool.evalsqltool: 1 row(s) updated.
Root@ubuntu:/home/mahesh/sqoop-related# sqoop eval –connect jdbc:mysql://localhost/mahesh –query select *
from evaltab;
13/11/07 19:38:27 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| evalid | evalname | evalscope |
| 111 | aaa | app |
| 112 | bbb | prgrm |
| 113 | ccc | project |
How to show tables in Database by using Eval?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query show tables;
13/11/07 19:40:22 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| TABLE_NAME |
| emp |
| evaltab |
How to describe Eval Table In Query?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query desc evaltab;
13/11/07 19:40:39 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| COLUMN_NAME | COLUMN_TYPE | IS_NULLABLE | COLUMN_KEY | COLUMN_DEFAULT | EXTRA |
| evalid | int(11) | YES | | (null) | |
| evalname | varchar(30) | YES | | (null) | |
| evalscope | varchar(30) | YES | | (null) | |
Give a command to execute a stored procedure named proc1 which exports data to from mysql db named
DB1 into a HDFS directory named Dir1.
$ sqoop export --connect jdbc:mysql://host/DB1 --call proc1 \ --export-dir /Dir1
Export Command In Sqoop?
Sqoop export –connect jdbc:mysql://localhost/mahesh -m 1 –table emp –export-dir/emptab/part-m-00000;
How will you update the rows that are already exported?
The parameter --update-key can be used to update existing rows. In it a comma-separated list of columns is used
which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE
query. All other table columns will be used in the SET part of the query.
What is the difference between the parameters sqoop.export.records.per.statement and
sqoop.export.statements.per.transaction
The parameter sqoop.export.records.per.statement specifies the number of records that will be used in each insert
statement.
But the parameter sqoop.export.statements.per.transaction specifies how many insert statements can be processed
parallel during a transaction.
How can you sync a exported table with HDFS data in which some rows are deleted
Truncate the target table and load it again.
How can you export only a subset of columns to a relational table using sqoop?
By using the –column parameter in which we mention the required column names as a comma separated list of
values.
Command aliases in Sqoop?
Sqoop-(toolname)
sqoop-import, sqoop-export
How are large objects handled in Sqoop?
Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports
the ability to store-
1)CLOB ‘s – Character Large Objects
2)BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as lobfile i.e. Large Object
File. The lobfile can store records of huge size, thus each record in the lobfile is a large object.
Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with
the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import
command the –target dir value must be specified.
How can you choose a name for the mapreduce job which is created on submitting a free-form query import?
By using the --mapreduce-job-name parameter. Below is a example of the command.
Sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--query 'SELECT normcities.id, \
Countries.country, \
Normcities.city \
FROM normcities \
JOIN countries USING(country_id) \
WHERE $CONDITIONS' \
--split-by id \
--target-dir cities \
--mapreduce-job-name normcities
What do you mean by Free Form Import in Sqoop?
Sqoop can import data form a relational database using any SQL query rather than only using table and column
name parameters.
How can you force sqoop to execute a free form Sql query only once and import the rows serially.
By using the –m 1 clause in the import command, sqoop cerates only one mapreduce task which will import the
rows sequentially.
Differentiate between Sqoop and distcp.
Distcp utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between
Hadoop and RDBMS.
What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option
with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile, -direct, -as-
sequencefile, -target-dir , -export-dir are not supported.
How to check List of Databases in RDBMS by using Sqoop?
sqoop list-databases –connect jdbc:mysql://localhost;
information_schema
Gopal_Lab
newyeardb
RK
batch18
bhargav
chandu
kelly
How to check List of Tables in single database by using Sqoop?
sqoop list-tables –connect jdbc:mysql://localhost/mahesh;
O/P:-emp
Write a command to Import RDBMS data into HDFS?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1;
Where the table created for the above command?
In HDFS, hadoop fs -ls /
How to read the RDBMS table data In HDFS ?
Hadoop fs -cat /emp/part-m-00000
111, mahesh,28000
112,neelesh,30000
113,rupesh,26000
114,vijay,28000
What is the Default delimiter between RDBMS table columns?
Coma (,)
How to set target directory and delimiter to Sqoop?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1
–target-dir /importdata –fields-terminated-by ‘t';
O/P:-
hadoop fs -cat /importdata/part-m-00000
111 mahesh 28000
112 neelesh 30000
113 rupesh 26000
114 vijay 28000
What Indicates -m 1 in above sqoop commands ?
-m 1 indicates output file divided into only 1 file, suppose we write -m 2 that means the output devided into 2 parts
of files like part-r-00000 and part-r-00001
How to select only specific columns In a table using Sqoop?
By using columns command
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’
How to write queries on a RDBMS table ?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’ –where ‘esal>26000′ –append;
O/P:-
hadoop fs -cat /importdata/part-m-00002
111 mahesh
112 neelesh
114 vijay
Import Command In Sqoop?
Sqoop import –conncet jdbc:mysql://localhost/mahesh –table emp -m 1;
Job command In Sqoop?
sqoop job –create deptdata — import –connect jdbc:mysql://localhost/mahesh –table dept -m 1 –target-dir
/jobimport –append;
Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The
following command is used to create a job that is importing data from the employee table in the db database to the
HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop
jobs.
$ Sqoop job --list
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample
output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ Sqoop job --exec myjob
How Sqoop can be used in a Java program?
The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runtool () method must
be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect jdbc: mysql: //localhost/user;
What is the role of JDBC driver in a Sqoop set up?
To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter
available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it
needs to inetract with.
Is JDBC driver enough to connect sqoop to the databases?
No. Sqoop needs both JDBC and connector to connect to a database.
When to use --target-dir and when to use --warehouse-dir while importing data?
To specify a particular directory in HDFS use --target-dir but to specify the parent directory of all the sqoop jobs use
--warehouse-dir. In this case under the parent directory sqoop will cerate a directory with the same name as the
table.
How can you import only a subset of rows from a table?
By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
How can we import a subset of rows from a table without using the where clause?
We can run a filtering query on the database and save the result to a temporary table in database. Then use the sqoop
import command without using the --where clause
What is the advantage of using --password-file rather than -P option while preventing the display of password
in the sqoop import statement?
The --password-file option can be used inside a sqoop script while the -P option reads from standard input,
preventing automation.
What is the default extension of the files produced from a sqoop import using the --compress parameter?
.gz
What is the significance of using --compress-codec parameter?
To get the out file of a sqoop import in formats other than .gz like .bz2 we use the --compress -code parameter.
What is a disadvantage of using --direct parameter for faster data load by sqoop?
The native utilities used by databases to support faster load do not work for binary data formats like sequencefile
How can you control the number of mappers used by the sqoop command?
The Parameter --num-mapers is used to control the number of mappers executed by a sqoop command. We should
start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers
initially may slow down the performance on the database side.
How can you avoid importing tables one-by-one when importing a large number of tables from a database?
Using the command
Sqoop import-all-tables –connect –usrename –password --exclude-tables table1,table2 ..
This will import all the tables except the ones mentioned in the exclude-tables clause.
What is the usefulness of the options file in sqoop?
The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the --connect parameter's value and --user name value scan be stored in a file and used again and again
with different sqoop commands.
Is it possible to add a parameter while running a saved job?
Yes, we can add an argument to a saved job at runtime by using the --exec option
Sqoop job --exec jobname -- -- newparameter
How do you fetch data which is the result of join between two tables?
By using the --query parameter in place of --table parameter we can specify a sql query. The result of the query will
be imported
How can we slice the data to be imported to multiple parallel tasks?
Using the --split-by parameter we specify the column name based on which sqoop will divide the data to be
imported into multiple chunks to be run in parallel.
Before starting the data transfer using mapreduce job, sqoop takes a long time to retrieve the minimum and
maximum values of columns mentioned in –split-by parameter. How can we make it efficient?
We can use the --boundary –query parameter in which we specify the min and max value for the column based on
which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-
query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create
before executing the main query.
How will you implement all-or-nothing load using sqoop?
Using the staging-table option we first load the data into a staging table and then load it to the final target table only
if the staging load is successful.
How do you clear the data in a staging table before loading it by Sqoop?
By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done
again and again till we get proper data in staging.
How can we load to a column in a relational table which is not null but the incoming value from HDFS has a
null value?
By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted
into the target table.
Sqoop imported a table successfully to hbase but it is found that the number of rows is fewer than expected.
What can be the cause?
Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a
row, those rows get dropped.
Give a sqoop command to show all the databases in a mysql server.
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
In a sqoop import command you have mentioned to run 8 parallel Mapreduce task but sqoop runs only 4.
What can be the reason?
The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel
tasks less or equal to that of the mapreduce cluster.
What is the importance of --split-by clause in running parallel import tasks in sqoop?
The –split-by clause mentions the column name based on whose value the data will be divided into groups of
records. These group of records will be read in parallel by the mapreduce tasks.
What does this sqoop command achieve?
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
What happens when a table is imported into a HDFS directory which already exists using the –append
parameter?
Using the --append argument, Sqoop will import data to a temporary directory and then rename the files into the
normal target directory in a manner that does not conflict with existing filenames in that directory.
How can you control the mapping between SQL data types and Java types?
By using the --map-column-java property we can configure the mapping between.
Below is an example
$ sqoop import ... --map-column-java id = String, value = Integer
How to import only the updated rows form a table into HDFS using sqoop assuming the source has last
update timestamp details for each row?
By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp
specified with --last-value are imported.
What are the two file formats supported by sqoop for import?
Delimited text and Sequence Files.
What is a sqoop metastore?
It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and
execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
What is the purpose of sqoop-merge?
The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset
preserving only the newest version of the records between both the data sets.
Give the sqoop command to see the content of the job named myjob?
Sqoop job –show myjob
Which database the sqoop metastore runs on?
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
Where can the metastore database be hosted?
The metastore database can be hosted anywhere within or outside of the Hadoop cluster.
Sqoop Use Case-
Online Marketer Coupons.com uses Sqoop component of the Hadoop ecosystem to enable transmission of data
between Hadoop and the IBM Netezza data warehouse and pipes backs the results into Hadoop using Sqoop.

FLUME:
What problem does Apache Flume solve?
Scenario:
 There are several services producing number of logs that run in different servers. These logs need to be
accumulated, stored and analyzed together.
 Hadoop has emerged as a cost effective and scalable framework for storage and analysis for big data.
Problem:
 How can these logs be collected, aggregated and stored to a place where Hadoop can process them?
 Now there is a requirement for a reliable, scalable, extensible and manageable solution.
What is Apache Flume?
Apache Flume is a distributed, reliable and available system for efficiently collecting, aggregating and moving
large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume
can be used to transport massive quantities of event data including but not limited to network traffic data, social-
media-generated data, email messages and pretty much any data source possible.
Flume features:
 Ensures guaranteed data delivery
 Gather high volume data streams in real time
 Streaming data is coming from multiple sources into Hadoop for analysis
 Scales horizontally
How is Flume-NG different from Flume 0.9?
Flume 0.9:
Centralized configuration of the agents handled by Zookeeper.
Input data and writing data are handled by same thread.
Flume 1.X (Flume-NG):
No centralized configuration. Instead a simple on-disk configuration file is used.
Different threads called runners handle input data and writing data.
What is the problem with HDFS and streaming data (like logs)?
 In a regular filesystem when you open a file and write data, it exists on disk even before it is closed.
 Whereas in HDFS, the file exists only as a directory entry of zero length till it is closed. This implies that if data is
written to a file for an extended period without closing it, you may be left with an empty file if there is a network
disconnect with the client.
 It is not a good approach to close the files frequently and create smaller files as this leads to poor efficiency in
HDFS.
What are core components of Flume?
Flume architecture:
Event- The single log entry or unit of data that is transported.
Client- The component that transmits event to the source that operates with the agent.

Flume Agent:
 An agent is a daemon (physical Java virtual machine) running Flume.
 It receives and stores the data until it is written to a next destination.
 Flume source, channel and sink run in an agent.
Source: events
 A source receives data from some application that is producing data.
 A source writes events to one or more channels.
 Sources either poll for data or wait for data to be delivered to them.
For Example: log4j, Avro, syslog, etc.
Sink:
 A sink removes the events from the agent and delivering it to the destination.
 The destination could be different agent or HDFS, hbase, Solr etc.
For Example: Console, HDFS, hbase, etc.
Channel:
 A channel holds events passing from a source to a sink.
 A source ingests events into the channel while sink removes them.
 A sink gets events from one channel only.
For Example: Memory, File, JDBC etc.
Explain a common use case for Flume?
Common Use case: Receiving web logs from several sources into HDFS.
Web server logs → Apache Flume → HDFS (Storage) → Pig/Hive (ETL) → hbase (Database) → Reporting (BI
Tools)
 Logs are generated by several log servers and saved in local hard disks, which need to be pushed into HDFS using
Flume framework.
 Flume agents, which are running on, log servers collect the logs, which are pushed into HDFS.
 Data analytics tools like Pig or Hive then process this data.
 The analysed data is stored in structured format in hbase or other database.
 Business intelligence tools will then generate reports on this data.
What are Flume events?
Flume events:
 Basic payload of data transported by Flume (typically a single log entry)
 It has zero or more headers and a body

Event Headers are key-value pairs that are used to make routing decisions or carry other structured information like:
 Timestamp of the event
 Hostname of the server where event has originated
Event Body
Event Body is an array of bytes that contains the actual payload.
Can we change the body of the flume event?
Yes, editing Flume Event using interceptors can change its body.
What are interceptors?

Interceptor
An interceptor is a point in your data flow where you can inspect and alter flume events. After the source creates an
event, there can be zero or more interceptors tied together before it is delivered to sink.
What are channel selectors?
Channel selectors:
Channel Selectors are used to handle multiple channels. It responsible for how an event moves from a source to one
or more channels.
Types of channel selectors are:
 Replicating Channel Selector: This is the default channel selector that puts a copy of event into each channel
 Multiplexing Channel Selector: routes different events to different channels depending on header information
and/or interceptor logic
What are sink processors?
Sink processor is a mechanism for failover and load balancing events across multiple sinks from a channel
How to Configure an Agent?
 An agent is configured using a simple Java property file of key/value pairs
 This configuration file is passed as an argument to the agent upon startup.
 You can configure multiple agents in a single configuration file. It is required to pass an agent identifier
(called a name).
 Each agent is configured starting with:
Agent.sources=<list of sources>
Agent.channels=<list of channels>
Agent.sinks=<list of sinks>
 Each source, channel and sink also has a distinct name within the context of that agent.
Explain the Hello world example in flume.
In the following example, the source listens on a socket for network clients to connect and sends event data. Those
events were written to an in-memory channel and then fed to a log4j sink to become output.
Configuration file for one agent (called a1) that has a source named s1, a channel named c1 and a sink named k1.
# Name of the components on this agent
A1.sources=s1
A1.channels=c1
A1.sinks=k1
# Configure the source
A1.sources.s1.type=Netcat
Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
Explain about the different channel types in Flume. Which channel type is faster?
The 3-different built in channel types available in Flume are-
MEMORY Channel – Events are read from the source into memory and passed to the sink.
JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source.
The file is deleted only after the contents are successfully delivered to the sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you
choose completely depends on the nature of the big data application and the value of each event.
Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.
How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources
and transfer it to external destinations.
Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain.
Data from Flume can be extracted, transformed and loaded in real-time to Apache Solr servers using
morphlinesolrsink.
How can Flume be used with hbase?
Apache Flume can be used with hbase using one of the two hbase sinks –
 Hbasesink (org.apache.flume.sink.hbase.hbasesink) supports secure hbase clusters and also the novel hbase IPC that
was introduced in the version hbase 0.96.
 Asynchbasesink (org.apache.flume.sink.hbase.asynchbasesink) has better performance than hbase sink as it can
easily make non-blocking calls to hbase.
Working of the hbasesink –
In hbasesink, a Flume Event is converted into hbase Increments or Puts. Serializer implements the
hbaseeventserializer which is then instantiated when the sink starts. For every event, sink calls the initialize method
in the serializer which then translates the Flume Event into hbase increments and puts to be sent to hbase cluster.
Working of the asynchbasesink-
Asynchbasesink implements the asynchbaseeventserializer. The initialize method is called only once by the sink
when it starts. Sink invokes the setevent method and then makes calls to the getincrements and getactions methods
just similar to hbase sink. When the sink stops, the cleanup method is called by the serializer.
Differentiate between filesink and filerollsink
The major difference between HDFS filesink and filerollsink is that HDFS File Sink writes the events into the
Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.
Flume Use Case –
Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These
tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and
retweets to engage users on Twitter.

Data Storage Component of Hadoop Ecosystem –Hbase, Cassandra

Nosql Databases
A nosql database is a widely adapted technology due to the schema less design, and its ability to scale up vertically
and horizontally is simple and in much less effort. RDBMS system performance degrades, cost increases, and
manageability decreases, we can say that nosql provides an edge over RDBMS in these scenarios.
Nosql usually provides either consistency or availability (availability of nodes for processing), depending upon the
architecture and design.

Types of nosql Databases


There are four general types of nosql databases, each with their own specific attributes:
Key-Value store: These databases are designed for storing data in a key-value store. Storing data in a schema-less
way. In a key-value store, all the data within consists of an indexed key and a value, hence the name. The key can be
custom, can be synthetic, or can be autogenerated, and the value can be complex objects such as XML, JSON, or
BLOB. Key of data is indexed for faster access to the data and improving the retrieval of value. Some popular key-
value type databases are dynamodb, Azure Table Storage (ATS), Riak, and berkeleydb.
 Column store: These databases are designed for storing data as a group of column families. Read/write operation is
done using columns, rather than rows. While this simple description sounds like the inverse of a standard database,
wide-column stores offer very high performance and a highly scalable architecture. One of the advantages is the
scope of compression, which can efficiently save space and avoid memory scan of the column. A column stores
database that are highly scalable and have very high-performance architecture. Some popular column store type
databases are hbase, bigtable, Cassandra, Vertica, and Hypertable.
 Document database: These databases are designed for storing, retrieving, and managing document-oriented
information. A document database expands on the idea of key-value stores where values or documents are stored
using some structure and are encoded in formats such as XML, YAML, or JSON, or in binary forms such as BSON,
PDF, Microsoft Office documents (MS Word, Excel), and so on. The advantage in storing in an encoded format like
XML or JSON is that we can search with the key within the document of a data, and it is quite useful in ad hoc
querying and semi-structured data. Some popular document-type databases are mongodb and couchdb.
 Graph database: These databases are designed for data whose relations are well represented as trees or a graph,
and has elements, usually with nodes and edges, which are interconnected. Graph theoretic algorithms are useful for
prediction, user tracking, clickstream analysis, calculating the shortest path, and so on, which will be processed by
graph databases much more efficiently as the algorithms themselves are complex. Some popular graph-type
databases are Neo4J and Polyglot.
The following table lays out some of the key attributes that should be considered when evaluating nosql databases.
Datamodel Performance Scalability Flexibility Complexity Functionality Examples
Key-value store High High High None Variable (None) Dynamodb, Azure Table Storage (ATS)
Hbase, bigtable, Cassandra, Vertica, and
Column Store High High Moderate Low Minimal Hypertable
Variable  Mongodb and couchdb.
Document Store High (High) High Low Variable (Low)
Graph Database Variable Variable High High Graph Theory Neo4J and Polyglot.
SQL vs nosql: High-Level Differences
 SQL databases are primarily called as Relational Databases (RDBMS); whereas nosql database are primarily
called as non-relational or distributed database.
 SQL databases are table based databases whereas nosql databases are document based, key-value pairs, graph
databases or wide-column stores. This means that SQL databases represent data in form of tables which
consists of n number of rows of data whereas nosql databases are the collection of key-value pair, documents,
graph databases or wide-column stores which do not have standard schema definitions which it needs to
adhered to.
 SQL databases have predefined schema whereas nosql databases have dynamic schema for unstructured data.
 SQL databases are vertically scalable whereas the nosql databases are horizontally scalable. SQL databases are
scaled by increasing the horse-power of the hardware. Nosql databases are scaled by increasing the databases
servers in the pool of resources to reduce the load.
 SQL databases uses SQL (structured query language) for defining and manipulating the data, which is very
powerful. In nosql database, queries are focused on collection of documents. Sometimes it is also called as unql
(Unstructured Query Language). The syntax of using unql varies from database to database.
 For complex queries: SQL databases are good fit for the complex query intensive environment whereas nosql
databases are not good fit for complex queries. On a high-level, nosql don’t have standard interfaces to perform
complex queries, and the queries themselves in nosql are not as powerful as SQL query language.
 For the type of data to be stored: SQL databases are not best fit for hierarchical data storage. But, nosql
database fits better for the hierarchical data storage as it follows the key-value pair way of storing data similar
to JSON data.
 For scalability: In most typical situations, SQL databases are vertically scalable. You can manage increasing
load by increasing the CPU, RAM, SSD, etc, on a single server. On the other hand, nosql databases are
horizontally scalable. You can just add few more servers easily in your nosql database infrastructure to handle
the large traffic.
 For high transactional based application: SQL databases are best fit for heavy duty transactional type
applications, as it is more stable and promises the atomicity as well as integrity of the data. While you can use
nosql for transactions purpose, it is still not comparable and sable enough in high load and for complex
transactional applications.
 For properties: SQL databases emphasizes on ACID properties ( Atomicity, Consistency, Isolation and
Durability) whereas the nosql database follows the Brewers CAP theorem ( Consistency, Availability and
Partition tolerance )
 For DB types: On a high-level, we can classify SQL databases as either open-source or close-sourced from
commercial vendors. Nosql databases can be classified on the basis of way of storing data as graph databases,
key-value store databases, document store databases, column store database and XML databases.

HBASE
What is the difference between HBase and Hive?
Hbase Hive
Hbase does not allow execution of SQL queries. Hive allows execution of most SQL queries.
Schema less Fixed schema
Hbase runs on top of HDFS. Hive runs on top of Hadoop mapreduce.
Hbase is a nosql column database. Hive is a datawarehouse framework.
Hbase is ideal for real time querying of big data Hive is an ideal choice for analytical querying of data
collected over period of time.
Hbase supports 4 primary operations-put, get, scan Hive helps SQL savvy people to run mapreduce job
and delete.
Compare HDFS & hbase
Criteria HDFS Hbase
Data write process Append method Bulk incremental, random write
Data read process Table scan Table scan/random read/small range scan
Hive SQL querying Excellent Average
lookup records in a file HDFS doesn’t provide Hbase provides for large table
Latency High latency operations Low latency operations
Only for storage areas Storage and process both can be perform
Write once Read many times Random reads and writes
Primarly accessed through MR Accessed through shell commands, client
(Map Reduce) jobs API in java, REST, Avro or Thrift
Mention the differences between hbase and Relational Databases or RDBMS?
Hbase RDBMS
It is schema-less It is a schema based database
It is a column-oriented data store It is a row-oriented data store
It is used to store de-normalized data It is used to store normalized data
It contains sparsely populated tables It contains thin tables
Automated partitioning is done in Hbase There is no such provision or built-in support for partitioning
Well suited for OLAP systems Well suited for OLTP systems
Read only relevant data from database Retrieve one row at a time and hence could read unnecessary
data if only some of the data in a row is required
Structured and semi-structure data can be stored Structured data can be stored and processed using RDBMS
and processed using HBase
Enables aggregation over many rows and Aggregation is an expensive operation
columns
Explain what is Hbase?
Hbase is a column-oriented database management system which runs on top of HDFS. Hbase is not a relational data
store, and it does not support structured query language like SQL. In Hbase, a master node regulates the cluster and
region servers to store portions of the tables and operates the work on the data.
Explain why to use Hbase?
High capacity storage system
Distributed design to cater large tables
Column-Oriented Stores
Horizontally Scalable
High performance & Availability
Base goal of Hbase is millions of columns, thousands of versions and billions of rows
Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
What are key terms are used for designing of hbase datamodel?
Table: Hbase table consists of rows
Row:Row in hbase which contains row key and one or more columns(column families with value associated)
column family: having set of columns and their values,the column families should be considered carefully during
schema design
column: A column in hbase consists of a column family and a column qualifier, which are delimited by a : (colon)
character
column qualifier: A column qualifier is added to a column family to provide the index for a given piece of data
cell: A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp,
which represents the value’s version
Namespace: Logical grouping of tables
timestamp: represents the time on the regionserver when the data was written, but you can specify a different
timestamp value when you put data into the cell
Mention what are the key components of Hbase?
Zookeeper: It does the co-ordination work between client and Hbase Master
HMaster: Hbase Master monitors the Region Server
HRegionserver: regionserver monitors the Region
HRegion: It contains in memory data store(memstore) and Hfile.
Catalog Tables: Catalog tables consist of ROOT and META. ROOT table tracks where the META table is and
META table stores all the regions in the system.
HMaster:
HMaster is the implementation of Master server in HBase architecture. It acts like monitoring agent to monitor all
Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed
cluster environment, Master runs on NameNode. Master runs several background threads.
The following are important roles performed by HMaster in HBase.
 Plays a vital role in terms of performance and maintaining nodes in the cluster.
 HMaster provides admin performance and distributes services to different region servers.
 HMaster assigns regions to region servers.
 HMaster has the features like controlling load balancing and failover to handle the load over nodes present
in the cluster.
 When a client wants to change any schema and to change any Metadata operations, HMaster takes
responsibility for these operations.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.
 Table (createTable, removeTable, enable, disable)
 ColumnFamily (add Column, modify Column)
 Region (move, assign)
The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write operations,
it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health
status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region servers which are going to store all the
log files.
HRegions Servers:
When Region Server receives writes and read requests from the client, it assigns the request to a specific region,
where actual column family resides. However, the client can directly contact with HRegion servers, there is no need
of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires
HMaster help when operations related to metadata and schema changes are required.
HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that
is present in distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster.
HMaster can get into contact with multiple HRegion servers and performs the following functions.
 Hosting and managing regions
 Splitting regions automatically
 Handling read and writes requests
 Communicating with the client directly
HRegions:
HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and Hfile.
Data flow in Hbase
Write and Read operations
The Read and Write operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that it flushes into HFile. The main reason
for using Memstore is to store data in Distributed file system based on Row Key. Memstore will be placed in Region
server main memory while HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as shown from
top to bottom in below table.
Table: HBase table present in the HBase cluster
Region: HRegions for the presented tables
Store: It store per ColumnFamily for each region for the table
Memstore: Memstore for each store for each region for the table
It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile: StoreFiles for each store for each region for the table
Block: Blocks present inside StoreFiles
ZooKeeper:
In Hbase, Zookeeper is a centralized monitoring server which maintains configuration information and provides
distributed synchronization. Distributed synchronization is to access the distributed applications running across the
cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate
with regions, the servers client has to approach ZooKeeper first.
It is an open source project, and it provides so many important services.
Services provided by ZooKeeper
 Maintains Configuration information
 Provides distributed synchronization
 Client Communication establishment with region servers
 Provides ephemeral nodes for which represent different region servers
 Master servers usability of ephemeral nodes for discovering available servers in the cluster
 To track server failure and network partitions
Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The client needs access to
ZK(zookeeper) quorum configuration to connect with master and region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts to repair
the failed nodes.
When you should use Hbase?
Data size is huge: When you have tons and millions of records to operate
Complete Redesign: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere
just changing the ports. In RDBMS should runs on single database server but in hbase is distributed and scalable and
also run on commodity hardware.
SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
Infrastructure Investment: You need to have enough cluster for Hbase to be really useful
What are the different operational commands in hbase at record level and table level?
Record Level Operational Commands in hbase are –put, get, increment, scan and delete.
Table Level Operational Commands in hbase are-describe, list, drop, disable and scan.
What is column families? Explain what happens if you alter the block size of a column family on an already
occupied database?
The logical deviation of data is represented through a key known as column Family. Column families consist of the
basic unit of physical storage on which compression features can be applied. When you alter the block size of the
column family, the new data occupies the new block size while the old data remains within the old block size.
During data compaction, old data will take the new block size. New files as they are flushed, have a new block size
whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after
the next major compaction.
Explain how does Hbase actually delete a row?
In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction.
During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In
normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during
compaction.
Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp,
further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until
after the major compaction.
Explain what is the row key?
Every row in an hbase table has a unique identifier known as rowkey(primary key). Row key is defined by the
application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort
order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the
same server. Rowkey is internally regarded as a byte array.
Explain row deletion in Hbase? Mention what are the three types of tombstone markers in Hbase for
deletion?
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted
cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
Version delete marker: For deletion, it marks a single version of a column
Column delete marker: For deletion, it marks all the versions of a column
Family delete marker: For deletion, it marks of all column for a column family
Explain about hlog and WAL in hbase.
All edits in the hstore are stored in the hlog. Every region server has one hlog. Hlog contains entries for edits of all
regions performed by a particular region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the hlog
edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
What are the main features of Apache hbase?
Apache hbase has many features which supports both linear and modular scaling, hbase tables are distributed on the
cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).
hbase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom
Filters).
What are datamodel operations in hbase or Mention how many operational commands in Hbase?
Get(returns attributes for a specified row,Gets are executed via htable.get)
put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).
Puts are executed via htable.put (writebuffer) or htable.batch (non-writebuffer))
scan(Scan allow iteration over multiple rows for specified attributes)
Delete(Delete removes a row from a table. Deletes are executed via htable.delete)
Increment
hbase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These
tombstones, along with the dead values, are cleaned up on major compaction.
How should filters are useful in Apache hbase?
Filters in Hbase Shell, Filter Language was introduced in apache hbase 0.92. It allows you to perform server-side
filtering when accessing hbase over Thrift or in the hbase shell.
How many filters are available in Apache hbase?
Total we have 18 filters are support to hbase. They are:
Columnprefixfilter, timestampsfilter, pagefilter, multiplecolumnprefixfilter, familyfilter, columnpaginationfilter,
singlecolumnvaluefilter, rowfilter, qualifierfilter, columnrangefilter, valuefilter, prefixfilter, columncountgetfilter,
singlecolumnvalueexcludefilter, inclusivestopfilter, dependentcolumnfilter, firstkeyonlyfilter, keyonlyfilter
How can we use mapreduce with hbase?
Apache mapreduce is a software framework used to analyze large amounts of data, and is the framework used most
often with Apache Hadoop. Hbase can be used as a data source, tableinputformat, and data sink, tableoutputformat
or multitableoutputformat, for mapreduce jobs. Writing mapreduce jobs that read or write hbase, it is advisable to
subclass tablemapper and/or tablereducer.
How do we back up my hbase cluster?
There are two broad strategies for performing hbase backups: backing up with a full cluster shutdown, and backing
up on a live cluster. Each approach has pros and cons.
1)Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their hbase cluster, for example if it is being used a
back-end analytic capacity and not serving front-end web-pages. The benefits are that the namenode/Master are
regionservers are down, so there is no chance of missing any in-flight changes to either storefiles or metadata. The
obvious con is that the cluster is down.
2)Live Cluster Backup
live clusterbackup-copytable:copy table utility could either be used to copy data from one table to another on the
same cluster, or to copy data to another table on another cluster.
Live cluster backup-export:export approach dumps the content of a table to HDFS on the same cluster.
Does hbase support SQL?
Not really. SQL-ish support for hbase via Hive is in development, however Hive is based on mapreduce which is not
generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from hbase by using sql
queries.
What is bloommapfile used for?
The bloommapfile is a class that extends mapfile. So its functionality is similar to mapfile. Bloommapfile uses
dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
How can I troubleshoot my HBase cluster?
Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over
again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions
you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be
hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of
problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print
some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and
max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads )
aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase
point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly
unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common
reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last
longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by
Todd Lipcon and Long GC pauses above.

Explain what does Hbase consists of?


Hbase consists of a set of tables
And each table contains rows and columns like traditional database
Each table must contain an element defined as a Primary Key
Hbase column denotes an attribute of an object

It’s easier to understand the data model as a multidimensional map. The first row from the table in Figure 1 has been
represented as a multidimensional map in figure 2

CASSANDRA
Explain what is Cassandra?
Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing
and managing large amounts of data across commodity servers without any failure. It can server as both
 Real time data store system for online applications
 Also as a read intensive database for business intelligence system
Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across
multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational
simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data
model designed for maximum flexibility and fast response times.
Why Cassandra? Why not any other no SQL like Hbase?
Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly
available, fault-tolerant, tune-ably consistent, column-oriented database that bases its distribution design on
Amazon’s Dynamo and its data model on Google’s Bigtable.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was
requirement of our use case we preferred Cassandra. Hbase is good for Low latency read write kind of use cases.
Explain Cassandra Data Model.
– The Cassandra data model has 4 main concepts which are cluster, keyspace, column, column family.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.
What platforms does Cassandra run on?
Cassandra is a Java Application, meaning that a compiled binary distribution of Cassandra can run on any platform
that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). Datastax Strongly
recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal
performance. Packaged releases are provided for redhat, centos , Debian and Ubuntu Linux Platforms.
What management tools exist for Cassandra?
Datastax supplies both a free and commercial version of opscenter, which is a visual, browser-based management
toll for Cassandra. With opscenter, a user can visually carry out many administrative tasks, monitor a cluster for
performance, and do much more. Downloads of opscenter are available on the datastax Website.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing
administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX).
Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java
Applications and services. Any statistics or operation that a Java application has exposed as an mbean can then be
monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant
tools such as jconsole, the Cassandra nodetool utility, or the datastax opscenter centralized management console.
With the same tools, you can perform certain administrative commands and operation such as flushing caches or
doing a repair.
Explain CAP theorem.
The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale
distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency,
Availability, and Partition Tolerance.
CAP theorem states that in any given system, you can strongly support only two of these three. Cassandra lies in CA
bucket of CAP Theorem.
What do you understand by Elastic Scalability?
Elastic Scalability means that your cluster can seamlessly scale up and scale back down. That actually means that
adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual
interventions. Vice versa is equally true.
Cassandra is said to be Tuneable Consistent. Why?
Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to
easily decide the level of consistency you require, in balance with the level of availability. This is controlled by
parameters like replication factor and consistency level.
How Cassandra Achieve High Availability and Fault Tolerance?
Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually
losing any data and without bring whole cluster down. In similar fashion you can also improve performance by
replicating data to multiple data center.
What is cluster in Cassandra?
In Cassandra, the cluster is an outermost container for keyspaces that arranges the nodes in a ring format and assigns
data to them. These nodes have a replica which takes charge in case of data handling failure.
What are the differences between a node, a cluster, and datacenter in Cassandra?
Node: A node is a single machine running Cassandra.
Cluster: A cluster is a collection of nodes that contains similar types of data together.
Datacenter: A datacenter is a useful component when serving customers in different geographical areas. Different
nodes of a cluster can be grouped into different data centers. A data center can be a physical data center or virtual
data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple
data centers. However, data centers should never span physical locations whereas a cluster contains one or more data
centers. It can span physical locations
Explain what is composite type in Cassandra?
In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type.
You can use two types of Composite Type
 Row Key
 Column Name
How Cassandra stores data?
Cassandra stores all data as bytes. When you specify validator, Cassandra ensures that those bytes are encoded as
per requirement and then a comparator orders the column based on the ordering specific to the encoding. While
composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by
the byte encoded component followed by a termination bit.
How does Cassandra work?
Cassandra’s built-for-scale architecture means that it is capable of handling petabytes of information and thousands
of concurrent users/operations per second.
Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model are
 Cluster
 Keyspace
 Column
 Column & Family
List out the other components of Cassandra?
The other components of Cassandra are
 Node
 Data Center
 Cluster
 Commit log
 Mem-table
 Sstable
 Bloom Filter
What are key spaces and column family in Cassandra?
In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures:
the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column
family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier
for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding
closely to a relational database.
Explain what is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one
keyspace per node.
What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE <identifier> WITH <properties>
What is the difference between Column and supercolumn?
The values in columns are string while the values in supercolumn are Map of Columns with different data types.
Unlike Columns, Super Columns do not contain the third component of timestamp.
What is supercolumn in Cassandra?
In Cassandra, supercolumn is a unique element containing similar collection of data. They are actually key-value
pairs with values as columns.
Why are super columns in Cassandra no longer favoured?
Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to
deserialize all of the sub-columns of a super column when querying (even if the result will only return a small
subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored
before performance suffers.
In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite
columns are a better solution, and they work without the added complexity.
Mention what are the values stored in the Cassandra Column?
In Cassandra Column, basically there are three values
 Column Name
 Value
 Time Stamp
Mention when you can use Alter keyspace?
ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a
keyspace.
Mention what does the shell commands Capture and Consistency determines?
There are various Cqlsh shell commands in Cassandra. Command Capture, captures the output of a command and
adds it to a file while, command Consistency display the current consistency level or set a new consistency level.
What is mandatory while creating a table in Cassandra?
While creating a table primary key is mandatory, it is made up of one or more columns of a table.
Mention what needs to be taken care while adding a Column?
While adding a column you need to take care that the
 Column name is not conflicting with the existing column names
 Table is not defined with compact storage option
Explain how Cassandra writes.
Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a
memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk
in a table structure called an sstable (sorted string table). Memtables and sstables are created per column family.
With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is
append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the sstable Cassandra can
simply replay the commit log.
How does Cassandra perform write function?
Cassandra performs the write function by applying two commits:
o First commit is applied on disk and then second commit to an in-memory structure known as memtable.
o When the both commits are applied successfully, the write is achieved.
o Writes are written in the table structure as sstable (sorted string table).
Explain how Cassandra writes data?
Cassandra writes data in three components
 Commitlog write
 Memtable write
 Sstable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in sstable
What is a commit log?
It is a crash-recovery mechanism. All data is written first to the commit log (file) for durability. After all its data has
been flushed to sstables, it can be archived, deleted, or recycled.
Explain what is Memtable in Cassandra?
 Cassandra writes the data to a in memory structure known as Memtable
 It is an in-memory cache with content stored as key/column format
 By key, Memtable data are sorted
 There is a separate Memtable for each columnfamily, and it retrieves column data from the key
 It stores the writes until it is full, and then flushed out.
What is sstable? Is it similar to RDBMS table?
A sorted string table (sstable) is an immutable data file to which Cassandra writes memtables periodically. Sstables
are append only and stored on disk sequentially and maintained for each Cassandra table.
Whereas RDBMS Table collection of ordered columns fetched by row.
Explain what is sstable consist of?
Sstable consist of mainly 2 files
 Index file ( Bloom filter & Key offset pairs)
 Data file (Actual column data)
Explain what is Bloom Filter is used for in Cassandra?
A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. Bloom
filters are accessed after every query.In other words, it is used to determine whether an sstable has data for a
particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.
Explain how Cassandra writes changed data into commitlog?
 Cassandra concatenate changed data to commitlog
 Commitlog acts as a crash recovery log for data
 Until the changed data is concatenated to commitlog write operation will be never considered successful
Data will not be lost once commitlog is flushed out to file.
Explain how Cassandra delete Data?
Sstables are immutable and cannot remove a row from sstables. When a row needs to be deleted, Cassandra assigns
the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered
as deleted.
What is Gossip protocol?
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about
themselves and about other nodes they know about. The gossip process runs every second and exchanges state
messages with up to three other nodes in the cluster.
What is Order Preserving partitioner?
This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort
order. Configuring your column family to use order-preserving partitioning allows you to perform range slices,
meaning that Cassandra knows which nodes have which keys. This partitioner is somewhat the opposite of the
Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly
distributing keys.
The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra
.dht.orderpreservingpartitionerclass. There is a special kind of OPP called the collating order-preserving partitioner
(COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography
instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the
org.apache.cassandra .dht.collatingorderpreservingparti tioner class. This is implemented in Cassandra by
org.apache.cassandra.dht.orderpreservingpartitioner.
What is materialized view? Why is it normal practice in Cassandra to have it?
Materialized means storing a full copy of the original data so that everything you need to answer a query is right
there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you
can recreate this effect by writing your data to a second column family that is created specifically to represent that
query.
Why Time stamp is so important while inserting data in Cassandra?
This is important because Cassandra use timestamps to determine the most recent write value.
What are advantages and disadvantages of secondary indexes in Cassandra?
Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns
to the WHERE clause of a SELECT.
When to use secondary indexes: You want to query on a column that isn’t the primary key and isn’t part of a
composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have
a column Town, that is a good choice for secondary indexing because lots of people will be form the same town,
date of birth however will not be such a good choice).
When to avoid secondary indexes: Try not using secondary indexes on columns contain a high count of unique
values and that will produce few results. Remember it makes writing to DB much slower, you can find value only by
exact index and you need to make requests to all servers in cluster to find value by index.
What is the CQL Language?
Cassandra 0.8 is the first release to introduce Cassandra Query Language(CQL), the first standardized query
language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL
parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first
officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted with the
Apache Cassandra project.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation.
Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no
support for JOINS, for example.
Mention what is Cassandra- CQL collections?
Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL
collections in following ways
 List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds
the list of unique elements)
 SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)
 MAP: It is a data type used to store a key-value pair of elements
What is Cassandra-Cqlsh?
Cassandra-Cqlsh is a query language, used to communicate with its database. Cassandra cqlsh facilitates you to do
the following things:
o Define a schema
o Insert a data and
o Execute a query
It’s a Python-based command-line client for cassandra.
What is the use of SOURCE command?
SOURCE command is used to execute a file that contains CQL statements.
How do you query Cassandra?
We query Cassandra using cql (Cassandra query language). We use cqlsh for interacting with DB.
What is the use of HELP command?
It is used to display a synopsis and a brief description of all cqlsh commands.
What is the use of capture command?
Capture command is used to captures the output of a command and adds it to a file.
Does Cassandra works on Windows?
Yes Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
Why renormalization is preferred in Cassandra?
This is because Cassandra does not support joins. User can join data at its own end.
What is the use of consistency command?
Consistency command is used to copy data to and from Cassandra to a file.
Does Cassandra Support Transactions?
Yes and No, depending on what you mean by ‘transactions’. Unlike relational databases, Cassandra does not offer
fully ACID-compliant transactions. There are no locking or transactional dependencies when concurrently updating
multiple rows or column families. But if by ‘transactions’ you mean real-time data entry and retrieval, with
durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing
operation. Nor Does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in
Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very
safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory
and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the
memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
What is Compaction in Cassandra?
The compaction process merges keys, combines columns, evicts tombstones, consolidates sstables, and creates a
new index in the merged sstable.
What is Anti-Entropy?
Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring
that data on different nodes is updated to the newest version.
What are consistency levels for read operations?
 ONE: This is used to returns the value on the first node that responds and Performs a read repair in the background.
 QUORUM: It queries all nodes and returns the record with the most recent timestamp after quorums of nodes have
responded where a quorum is (n /2) + 1.
 DCQUORUM: It ensures that only nodes in the same data center are queried. It is applicable when using Rack-
Aware placement strategy.
 ALL: Queries all nodes and returns the value with the most recent timestamp. This level waits for all nodes to
respond, and if one doesn’t, it fails the read operation.
What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the
one that is used to offer the least confidence that operations will succeed.
Explain Any Consistency?
Ii assures that our write operation was successful on at least one node, even if the acknowledgment is only for a hint.
It is a relatively weak level of consistency.
Explain ONE consistency?
It is used to ensure that the write operation was written to at least one node, including its commit log and memtable.
Explain QUORUM consistency?
A quorum is a number of nodes that is used to represent the consensus on an operation. It is determined by
<replicationfactor> / 2 + 1.
Explain ALL consistency?
Every node as specified in your <replicationfactor> configuration entry must successfully acknowledge the write
operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of
consistency and the lowest level of performance.
What do you mean by hint handoff?
It is mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a
node that is intended to receive that write goes down, a note (the hint) is given (handed off) to a different live node
to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two
things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online,
and it improves write performance in lower consistency levels.
What is Merkle Tree? Where is it used in Cassandra?
Merkle tree is a binary tree data structure that summarizes in short form the data in a larger dataset. Merkle trees are
used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed.
What do you mean by multiget?
It means a query by column name for a set of keys.
What is a SEED node in Cassandra?
A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running.
The newly added node can start gossiping with the seed node to get state information and learn the topology of the
node ring. There may be many seeds in a cluster.
What is Slice and Range slice in Cassandra?
This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use
get_range_slice() to return a subset of columns for a range of keys.
What is Multiget Slice?
It means query to get a subset of columns for a set of keys.
What is Tombstone in Cassandra world?
Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a
tombstone, an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be
propagated to other replicas.
What is Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
What is Batch Mutates?
Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a
single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of
mutations, there will be no rollback, so any updates that have already occurred up to this point will remain intact.
What is Random Partitioner?
This is a kind of Partitioner that uses a bigintegertoken with an MD5 hash to determine where to place the keys on
the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of
causing inefficient range queries. This is the default partitioner.
What is Read Repair?
This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects
that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a
note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the
nodes with stale data to get them up to date with the newer data returned from the original read operation. It does
this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that
were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.
What is Snitch in Cassandra?
A snitch is Cassandra’s way of mapping a node to a physical location in the network.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure
efficient request routing.
What are the benefits/ advantages of Cassandra?
o Cassandra was designed to handle big data workloads across multiple nodes without any single point of
failure.
o Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data
Analysts and Software Engineers.
o It provides extensible scalability and can be easily scaled up and scaled down as per the requirements.
o It is fault tolerant and consistent.
o It is a column-oriented database.
o It has no single point of failure.
o There is no need for separate caching layer.
o It has flexible schema design.
o It has flexible data storage, easy data distribution, and fast writes.
o It supports ACID (Atomicity, Consistency, Isolation, and Durability) properties.
o It has multi-data center and cloud capable.

What's new in Cassandra


Cassandra 2.0 included major enhancements to CQL, security, and performance. CQL for Cassandra 2.0.6
adds several important features including batching of conditional updates, static columns, and increased control over
slicing of clustering columns.
Key features of Cassandra 2.0 are:
 Support for lightweight transactions
o Use of the IF keyword in CQL INSERT and UPDATE statements
o New SERIAL consistency level
 Triggers
The first phase of support for triggers for firing an event that executes a set of programmatic logic, which
runs either inside or outside a database cluster
 CQL paging support
Paging of result sets of SELECT statements executed over a CQL native protocol 2 connection, which
eliminates the need to use the token function to page through results. For example, to page through data in
this table, a simple SELECT statement after Cassandra 2.0 replaces the complex one using the token
function before Cassandra 2.0.
 Improved authentication
SASL support for easier and better authentication over prior versions of the CQL native protocol
 Drop column support
Re-introduction of the ALTER TABLE DROP command
 SELECT column aliases
Support for column aliases in a SELECT statement, similar to aliases in RDBMS SQL:
SELECT hdate AS hired_date
FROM emp WHERE empid = 500Conditional DDL
Conditionally tests for the existence of a table, keyspace, or index before issuing a DROP or CREATE
statement using IF EXISTS or IF NOT EXISTS
 Index enhancements
Indexing of any part, partition key or clustering columns, portion of a compound primary key
 One-off prepare and execute statements
Use of a prepared statement, even for the single execution of a query to pass binary values for a statement,
for example to avoid a conversion of a blob to a string, over a native protocol version 2 connection
 Performance enhancements
o Off-heap partition summary
o Eager retries support
Sending the user request to other replicas before the query times out when a replica is unusually
slow in delivering needed data
o Compaction improvements
Hybrid (leveled and size-tiered) compaction improvements to the leveled compaction strategy to
reduce the performance overhead on read operations when compaction cannot keep pace with
write-heavy workloads
Other changes in Cassandra 2.0 are:
 New commands to disable background compactions
Nodetool disableautocompaction and nodetool enableautocompaction
 A change to random token selection during cluster setup
Auto_bootstrapping of a single-token node with no initial_token
 Removal of super column support
Continued support for apps that query super columns, translation of super columns on the fly into CQL
constructs and results
 Removal of the cqlsh ASSUME command
Use the blobastype and typeasblob conversion functions instead of ASSUME
 Cqlsh COPY command support for collections
 Inclusion of the native protocol version in the system.local table
 Inclusion of default_time_to_live, speculative_retry, and memtable_flush_period_in_ms in cqlsh
DESCRIBE TABLE output
 Support for an empty list of values in the IN clause of SELECT, UPDATE, and DELETE commands,
useful in Java Driver applications when passing empty arrays as arguments for the IN clause
Why do you need hbase when you can use Hive to query Hadoop?
Hbase Use Case-
Facebook is one the largest users of hbase with its messaging platform built on top of hbase in 2010.hbase is also
used by Facebook for streaming data analysis, internal monitoring system, Nearby Friends Feature, Search Indexing
and scraping data for their internal data warehouses.

Streaming Data of Hadoop Ecosystem- Spark, Kafka, Storm


SPARK
Batch Data Analytics: Collect data and prcess later. Historical data processing
Day 1 data is processed on day2 and there is a time lag
Real Time Analytics: processing on current data
Eg: trying to swipe a card at 2 different locations either send the OTP or block the transaction approved when
confirmed. Day 1 data is processed on day1 itself no time lag
Explain the key features of Spark.

• Spark allows Integration with Hadoop and files included in HDFS.


• polyglot can be written in different langauages, scala, java, python, R
• It consists of RDD’s (Resilient Distributed Datasets), that can be cached across computing nodes in a cluster.
Additionally, some of the salient features of Spark include:
Lighting fast processing: Spark makes this possible by reducing the number of read/write operations to the disc. It
stores this intermediate processing data in memory. It runs programs up to 100x faster than Hadoop mapreduce in
memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
Support for sophisticated analytics: In addition to simple map and reduce operations, Spark supports SQL queries,
streaming data, and complex analytics such as machine learning and graph algorithms. This allows users to combine
all these capabilities in a single workflow.
Real-time stream processing: Spark can handle real-time streaming. Mapreduce primarily handles and processes
previously stored data even though there are other frameworks to obtain real-time streaming. Spark does this in the
best way possible.
Compare Hadoop & Spark
Criteria Hadoop Spark

Dedicated storage HDFS None

Speed of processing Average Excellent

Libraries Separate tools available Spark Core, SQL, Streaming, mllib, graphx

Criteria Hadoop mapreduce Apache Spark


Memory Does not leverage the memory of the hadoop Let's save data on memory with the use of RDD's.
cluster to maximum.
Disk usage Mapreduce is disk oriented. Spark caches data in-memory and ensures low latency.
Processing Only batch processing is supported Supports real-time processing through spark streaming.
Installation Is bound to hadoop. Is not bound to Hadoop.
Replication Hadoop uses replication to achieve fault Spark uses a different data storage model, resilient
tolerance while (RDD). distributed datasets
Is there are point of learning Mapreduce, then?
Yes. For the following reason:
• Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the mapreduce
paradigm and how to convert a problem into series of MR tasks is very important.
• When the data grows beyond what can fit into the memory on your cluster, the Hadoop Map-Reduce
paradigm is still very relevant.
• Almost, every other tool such as Hive or Pig converts its query into mapreduce phases. If you understand
the Mapreduce then you will be able to optimize your queries better.
When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?
Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster’s nodes.
So, you just have to install Spark on one node.
What are the downsides of Spark?
Spark utilizes the memory. The developer has to be careful. A casual developer might make following mistakes:
• She may end up running everything on the local node instead of distributing work over to the cluster.
• She might hit some webservice too many times by the way of using multiple clusters. user may hit a service
from inside of map() or reduce() too many times. This overloading of service is also possible while using Spark.
List some use cases where Spark outperforms Hadoop in processing.
i. Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and
combined from different sources.
ii. Spark is preferred over Hadoop for real time querying of data
iii. Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best
solution.
iv. Iterative machine learning.
v. Interactive data analytics and processing.
What is a Sparse Vector?
A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing
non-zero entries to save space.
What is sparkcontext?
A sparkcontext represents the connection to a Spark cluster to create a new sparkcontext object, and can be used to
create rdds, accumulators and broadcast variables on that cluster. Sparkcontext tell spark how to access the cluster.
To create a SparkContext you first need to build a SparkConf object that contains information about your
application.
Scala:
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
Python:
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
SparkContext is already created for you, in the variable called sc
What is RDD?
Rdds (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the
system in object format. Rdds are used for in-memory computations for elements partitioned across multiple nodes
on large clusters.
Immutable – rdds cannot be altered. Once created and assign a value, it’s not possible to change data collection is
not immutable, but data value is immutable.
Distributed: RDD can automatically the data is distributed across different parallel computing nodes.
Resilient – If a node holding the partition fails the other node takes the data.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a
dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a
Hadoop InputFormat.
How does one create rdds in Spark?
1. Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your
driver program. The elements of the collection are copied to form a distributed dataset that can be operated on in
parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
Scala:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Python:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For example, we might call
distData.reduce((a, b) => a + b) to add up the elements of the array.
2. Spark can create distributed datasets from any storage source supported by Hadoop, including your local
file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any
other Hadoop InputFormat.
Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a
local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines.
Scala:
val distFile = sc.textFile("data.txt")
Python:
distFile = sc.textFile("data.txt")
the file must also be accessible at the same path on worker nodes. The textFile method also takes an optional second
argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of
the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing
a larger value. Note that you cannot have fewer partitions than blocks.
Define Partitions.
Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your
cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can
also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).
Here an RDD is created in 10 partitions
Spark splits data into partitions and executes computations on the partitions in parallel. You need to manually adjust
the partitioning to keep your Spark computations running efficiently.
A Partition is a smaller and logical division of data, that is similar to the split in mapreduce. Partitioning is the
process that helps derive logical units of data in order to speed up data processing and support scalability.
What is an RDD Lineage?
Spark doesn’t support data replication in the memory. In the event of any data loss, it is rebuilt using the RDD
Lineage. A process that reconstructs lost data partitions. Each RDD remembers how the RDD build from other
datasets
What is lineage graph?
The rdds in Spark, depend on one or more other rdds. The representation of dependencies in between rdds is known
as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part
of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on rdds. Rdds help achieve fault tolerance through lineage. RDD
always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure,
lineage helps build only that particular lost partition.
Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover rdds from a failure but this is generally time consuming if the rdds have
long lineage chains. Spark has an API for check pointing i.e. A REPLICATE flag to persist. However, the decision
on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and
have wide dependencies.
What operations does the RDD support?
 Transformations
 Actions
Transformations:
The transformations are the functions that are applied on an RDD (resilient distributed data set). The transformation
results in another RDD. All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset The transformations are only computed
when an action requires a result to be returned to the driver program.
Example of transformations: Map, flatmap, groupbykey, reducebykey, filter, co-group, join, sortbykey, Union,
distinct, sample are common spark transformations.
The example of transformations are:
1. Map() – applies the function passed to it on each element of RDD resulting in a new RDD.
2. Filter() – creates a new RDD by picking the elements from the current RDD which pass the function
argument.
Actions:
Actions are the results of RDD computations or transformations. An action return a value to the driver program after
running a computation on the dataset. Execution of an action results in all the previously created transformation.
Reduce, collect, takesample, take, first, saveastextfile, saveassequencefile, countbykey, foreach are common actions
in Apache spark.
The example of actions are:
• reduce() – executes the function passed again and again until only one value is left. The function should
take two argument and return one value.
• take() – take all the values back to the local node form RDD.
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map
and a local reduction, returning only its answer to the driver program.
What is Map and flatmap in Spark?
Map is a specific line or row to process that data. In flatmap each input item can be mapped to multiple output items
(so function should return a Seq rather than a single item). So most frequently used to return Array elements.
Map FlatMap
Spark map function expresses a one-to-one Spark flatMap function expresses a one-to-many
transformation. transformation.
It transforms each element of a collection into one It transforms each element to 0 or more elements.
element of the resulting collection
What do you understand by Pair RDD?

Special operations can be performed on rdds in Spark using key/value pairs and such rdds are referred to as Pair
rdds. Pair rdds allow users to access each key in parallel. They have a reducebykey () method that collects data
based on each key and a join () method that combines different rdds together, based on the elements having the same
key.
Scala:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to
bring them back to the driver program as an array of objects.
Note: when using custom objects as the key in key-value pair operations, you must be sure that a custom equals()
method is accompanied with a matching hashCode() method.
Difference between coalesce and repartition
coalesce uses existing partitions to minimize the amount of data that's shuffled. repartition creates new partitions
and does a full shuffle. coalesce results in partitions with different amounts of data (sometimes partitions that have
much different sizes) and repartition results in roughly equal sized partitions.
Is coalesce or repartition faster?
coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal
sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition to be
faster overall because Spark is built to work with equal sized partitions.
What do you understand by Lazy Evaluation?
When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations
in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
Because of lazy evaluation the errors are not displayed in transformations
Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have
more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-
env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
What do you understand by Executor Memory in a Spark application?
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is
what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the
–executor-memory flag. Every spark application will have one executor on each worker node. The executor memory
is basically a measure on how much RAM memory of the worker node will the application utilize.
What is a Spark Executor?
When sparkcontext connects to a cluster manager, it acquires an Executor on the cluster nodes. Executors are Spark
processes that run computations and store the data on the worker node. The final tasks by sparkcontext are
transferred to executors.
How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily
in the memory which is default storage level. Different storage level options there such as MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY and many more.
What are the various levels of persistence in Apache Spark?
Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel
operations. Spark has various persistence levels to store the rdds on disk or in memory or as a combination of both
with different replication levels.
The various storage/persistence levels in Spark are -
 MEMORY_ONLY
 MEMORY_ONLY_SER
 MEMORY_AND_DISK
 MEMORY_AND_DISK_SER, DISK_ONLY
 OFF_HEAP
How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs to run in a fast and reliable manner.
Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and
accumulators (Hadoop counters). The various ways in which data transfers can be minimized when working with
Apache Spark are:
1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large rdds.
2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations bykey, repartition or any other operations which trigger shuffles.
What are broadcast variables?
Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy
of variables with every tasks so data can be processed faster. Broadcast variables stored as Array Buffers, which
sends read-only values to work nodes. Broadcast variables help in storing a lookup table inside the memory which
enhances the retrieval efficiency when compared to an RDD lookup ().
Scala:
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
What are Accumulators in Spark?
Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution
is split up across worker nodes in a cluster.
Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job
you can use accumulators. A numeric accumulator can be created by
calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate values of type Long
or Double, respectively. Tasks running on a cluster can then add to it using the add method. However, they cannot
read its value. Only the driver program can read the accumulator’s value, using its value method. Aggregratebykey()
and combinebykey() uses accumulators.
Scala:
val accum = sc.longAccumulator("My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
accum.value
Spark cluster client mode difference
Client: we assign the task to master node
Cluster: random worker node is picked for doing the task

In yarn-cluster mode, the driver runs in the Application Master (inside a YARN container). In yarn-client mode, it
runs in the client.

A common deployment strategy is to submit your application from a gateway machine that is physically co-located
with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate.
In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.
The input and output of the application is attached to the console. Thus, this mode is especially suitable for
applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your
laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Currently, standalone mode does not support cluster mode for Python applications.

What do you understand by schemardd?


An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about
the type of data in each column.
What does the Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.
How spark partition the data?
Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By
default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.
Why Partitions are immutable?
Every transformation generate new partition. Partitions uses HDFS API so that partition is immutable, distributed
and fault tolerance. Partition also aware of data locality.
How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into
different batches and writing the intermediary results to the disk.
Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon
What is a Parquet in Spark?
Parquet is a columnar format file supported by many data processing systems. Spark SQL performs both read and
write operations with the Parquet file.
What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
 Limit I/O operations
 Consumes less space
 Fetches only required columns.
What are the various data sources available in sparksql?
 Parquet file
 JSON Datasets
 Hive tables
AVRO vs Parquet
AVRO Parquet
It is a language-neutral data serialization system Parquet is a columnar format
row-based storage format for Hadoop column-based storage format for Hadoop
If your use case typically scans or retrieves all of the If your dataset has many columns, and your use case
fields in a row in each query, Avro is usually the best typically involves working with a subset of those
choice. columns rather than entire records,
It has data formats that work with different languages. supported by many data processing systems
Avro data is described using a language independent Parquet is language independent
schema (usually written in JSON).

How sparksql is different from HQL and SQL?


Sparksql is a special component on the sparkcore engine that support SQL and hivequerylanguage without changing
any syntax. It’s possible to join SQL table and HQL table.
What is Spark SQL?
Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the capability to load
data from multiple structured sources like textfiles, JSON files, Parquet files, Hive among others. Spark SQL
provides a special type of RDD called schemardd. These are row objects, where each object represents a record.
Here’s how you can create an SQL context in Spark SQL:
SQL context: scala> var sqlcontext=new sqlcontext
Hivecontext: scala> var hc = new hivecontext(sc)
What is dataframes?
A dataframe is a distributed collection of data organized into named columns. It is conceptually equivalent to a table
in a relational database or a data frame in R/Python, but with richer optimizations under the hood. Dataframes can be
constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing
rdds.
What is a Dataset?
A Dataset is a new experimental interface added in Spark 1.6 that tries to provide the benefits of rdds (strong typing,
ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset
can be constructed from JVM objects and then manipulated using functional transformations (map, flatmap, filter,
etc.).
The unified Dataset API can be used both in Scala and Java. Python does not yet have support for the Dataset API,
but due to its dynamic nature many of the benefits are already available (i.e. You can access the field of a row by
name naturally row.columnname). Full python support will be added in a future release.
What is Spark Session?
SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using
HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark
data sources.
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
df.select("name").show()
// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// Select people older than 21
df.filter($"age" > 21).show()
// Count people by age
df.groupBy("age").count().show()
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

Global Temporary View

// Register the DataFrame as a global temporary view


df.createGlobalTempView("people")
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+

// Global temporary view is cross-session


spark.newSession().sql("SELECT * FROM global_temp.people").show()
Can we do real-time processing using Spark SQL?

Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.
What is Hive on Spark?
Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in
the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that
Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive
operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes
query execution, where the generated Spark plan gets actually executed in the Spark cluster.
What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically
transform SQL queries by adding new optimizations to build a faster processing system.
Why is blinkdb used?
Blinkdb is a query engine for executing interactive SQL queries on huge volumes of data and renders query results
marked with meaningful error bars. Blinkdb helps users balance ‘query accuracy’ with response time.
How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
Which one will you choose for a project –Hadoop mapreduce or Apache Spark?
The answer to this question depends on the given project scenario - as it is known that Spark makes use of memory
instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to
produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the
project and budget of the organization.
Explain about the different types of transformations on dstreams?
 Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples –
map (), reducebykey (), filter ().
 Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch.
Examples –Transformations that depend on sliding windows.
Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
What is a Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on
data rdds. The driver also delivers RDD graphs to the Master, where the standalone cluster manager runs.
How can you remove the elements with a key present in any other RDD?
Use the subtractbykey () function
How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job
statistics. The log output for each job is written to the work directory of the slave nodes.
How can you launch Spark jobs inside Hadoop mapreduce?
Using SIMR (Spark in mapreduce) users can run any spark job inside mapreduce without requiring any admin
rights.
How can you achieve high availability in Apache Spark?
 Implementing single node recovery with local file system
 Using standby Masters with Apache zookeeper.
Explain about the core components of a distributed Spark application.
 Driver- The process that runs the main () method of the program to create rdds and perform transformations and
actions on them.
 Executor –The worker processes that run the individual tasks of a Spark job.
 Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows
Spark to run on top of other external managers like Apache Mesos or YARN.
What are the disadvantages of using Apache Spark over Hadoop mapreduce?
Apache spark does not scale well for compute intensive jobs and consumes large number of system resources.
Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data.
Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data
platforms or apache hadoop.
What makes Apache Spark good at low-latency workloads like graph processing and machine learning?
Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require
multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and
edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access
and controlled network traffic make a huge difference when there is lots of data to be processed.
What is Catchable?
Keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times
faster than Hadoop.
How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3
and other data resources.
Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store the data. You can load
data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.
Feature RDD DataFrames Datasets
Data RDD is a distributed collection A DataFrame is a It is an extension of
Representation of data elements spread across distributed collection of DataFrame API that
many machines in the cluster. data organized into named provides the functionality
columns. It is conceptually of – type-safe, object-
equal to a table in a oriented programming
relational database. interface of the RDD API
Data Formats It can easily and efficiently It can process structured It also efficiently
process data which is and unstructured data processes structured and
structured as well as efficiently. It organizes the unstructured data. It
unstructured. RDD does not data into named column. represents data in the
infer the schema of the DataFrames allow the form of JVM objects of
ingested data and requires the Spark to manage schema. row or a collection of row
user to specify it. object which is
represented in tabular
forms through encoders.
Data Sources API Data source API allows that an Data source API allows Dataset API of spark also
RDD could come from any Data processing in support data from
data source e.g. text file, a different formats (AVRO, different sources.
database via JDBC etc. and CSV, JSON, and storage
easily handle data with no system HDFS, HIVE
predefined structure. tables, MySQL). It can
read and write from
various data sources
Immutability and RDDs contains the collection After transforming into It overcomes the
Interoperability of records which are DataFrame one cannot limitation of DataFrame
partitioned. The basic unit of regenerate a domain object. to regenerate the RDD
parallelism in an RDD is For example, if you from Dataframe.
called partition. Each partition generate testDF from Datasets allow you to
is one logical division of data testRDD, then you won’t convert your existing
which is immutable and be able to recover the RDD and DataFrames
created through some original RDD of the test into Datasets.
transformation on existing class.
partitions
Compile-time type RDD provides a familiar If you are trying to access It provides compile-time
safety object-oriented programming the column which does not type safety.
style with compile-time type exist in the table in such
safety. case Dataframe APIs does
not support compile-time
error. It detects attribute
error only at runtime.
Optimization No inbuilt optimization engine Optimization takes place It includes the concept of
is available in RDD. using catalyst optimizer. Dataframe Catalyst
Developers optimize each Dataframes use catalyst optimizer for optimizing
RDD on the basis of its tree transformation query plan.
attributes. framework in four phases:
Analyzing a logical plan to
resolve references.
Logical plan optimization.
Physical planning.
Code generation to
compile parts of the query
to Java bytecode.
Serialization Whenever Spark needs to Spark DataFrame Can When it comes to
distribute the data within the serialize the data into off- serializing data, the
cluster or write the data to heap storage (in memory) Dataset API in Spark has
disk, it does so use Java in binary format and then the concept of an encoder
serialization. The overhead of perform many which handles conversion
serializing individual Java and transformations directly on between JVM objects to
Scala objects is expensive and this off heap memory tabular representation. It
requires sending both data and because spark understands stores tabular
structure between nodes. the schema. There is no representation using spark
need to use java internal Tungsten binary
serialization to encode the format. Dataset allows
data. It provides a performing the operation
Tungsten physical on serialized data and
execution backend which improving memory use. It
explicitly manages allows on-demand access
memory and dynamically to individual attribute
generates bytecode for without desterilizing the
expression evaluation. entire object.
Garbage There is overhead for garbage Avoids the garbage There is also no need for
Collection collection that results from collection costs in the garbage collector to
creating and destroying constructing individual destroy object because
individual objects. objects for each row in the serialization
dataset. takes place through
Tungsten. That uses off
heap data serialization.
Efficiency/Memory Efficiency is decreased when Use of off heap memory It allows performing an
use serialization is performed for serialization reduces operation on serialized
individually on a java and the overhead. It generates data and improving
scala object which takes lots of bytecode dynamically so memory use. It allows on-
time. that many operations can demand access to
be performed on that individual attribute
serialized data. No need for without deserializing the
deserialization for small entire object.
operations.
Schema Projection In RDD APIs use schema Auto-discovering the Auto discover the schema
projection is used explicitly. schema from the files and of the files because of
We need to define the schema exposing them as tables using Spark SQL engine.
(manually). through the Hive Meta
store. We did this to
connect standard SQL
clients to our engine. And
explore our dataset without
defining the schema of our
files.
Aggregation RDD API is slower to perform DataFrame API is very In Dataset it is faster to
simple grouping and easy to use. It is faster for perform aggregation
aggregation operations. exploratory analysis, operation on plenty of
creating aggregated data sets.
statistics on large data sets.
What is the bottom layer of abstraction in the Spark Streaming API ?
Dstream.
What is Spark Streaming?
Spark supports stream processing, essentially an extension to the Spark API. This allows stream processing of live
data streams. The data from different sources like Flume and HDFS is streamed and processed to file systems, live
dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever you want to analyze
data with the latency of less than 15 minutes and greater than 2 minutes i.e. Near real time is when you use Spark
streaming. It can be processed using complex algorithms expressed with high-level functions like map, reduce, join
and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you
can apply Spark’s machine learning and graph processing algorithms on data streams.
When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from
different resources like web server log files, social media data, stock market data or Hadoop ecosystems like Flume,
and Kafka.
How Spark Streaming API works?
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches,
which are then processed by the Spark engine to generate the final stream of results in batches.

Spark Streaming provides a high-level abstraction called discretized stream or dstream, which represents a
continuous stream of data. Dstreams can be created either from input data streams from sources such as Kafka,
Flume, and Kinesis, or by applying high-level operations on other dstreams. Internally, a dstream is represented as a
sequence of rdds. Core engine can generate the final results in the form of streaming batches. The output also in the
form of batches. It can allows streaming data and batch data for processing.
This guide shows you how to start writing Spark Streaming programs with dstreams. You can write Spark Streaming
programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find
tabs throughout this guide that let you choose between code snippets of different languages.
What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library
provides windowed computations where the transformations on rdds are applied over a sliding window of data.
Whenever the window slides, the rdds that fall within the particular window are combined and operated upon to
produce new rdds of the windowed dstream.
What is a dstream?
Discretized Stream is a sequence of Resilient Distributed Datasets that represent a stream of data. Micro batch data
which is processed. Dstreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
Dstreams have two operations –
 Transformations that produce a new dstream.
 Output operations that write data to an external system.
What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and mapreduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly
work on top of micro batching.
Explain about the major libraries that constitute the Spark Ecosystem
 Spark mlib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression,
classification, etc.
 Spark Streaming – This library is used to process real time streaming data.
 Spark graphx – Spark API for graph parallel computations with basic operators like joinvertices, subgraph,
aggregatemessages, etc.
 Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Explain about the different cluster managers in Apache Spark
The 3 different clusters managers supported in Apache Spark are:
 YARN
 Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other
applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation
between commands.
 Standalone deployments – Well suited for new deployments which only run and are easy to set up.
How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
 Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by
Mesos. (or)
 Install Apache Spark in the same location as that of Apache Mesos and configure the property
‘spark.mesos.executor.home’ to point to the location where it is installed.
Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the
machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other
big data frameworks.
When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN
or Mesos clusters without affecting any change to the cluster.
What are the functions of Spark Core?
The sparkcore performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job
scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and
output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across
many nodes that can be manipulated in parallel. Spark Core provides many apis for building and manipulating these
collections.
What is Spark mllib?
Mahout is a machine learning library for Hadoop, similarly mllib is a Spark library. Metlib provides different
algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this mllib
library.
What is graphx?
Graphx is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph
computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.
What is the function of mllib?
Mllib is Spark’s machine learning library. It aims at making machine learning easy and scalable with common
learning algorithms and real-life use cases including clustering, regression filtering, and dimensional reduction
among others.spark
What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local filesystem. Spark uses FS API to read
data from different storage engines.
Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average:
Def myavg(x, y):
Return (x+y)/2.0;
Avg = myrdd.reduce(myavg);
What is wrong with it? And How would you correct it?
The average function is not commutative and associative;
I would simply sum it and then divide by count.
Def sum(x, y):
Return x+y;
Total = myrdd.reduce(sum);
Avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus over flow. So, I would rather
divide each number by count and then sum in the following way.
Cnt = myrdd.count();
Def devidebycnd(x):
Return x/cnt;
Myrdd1 = myrdd.map(devidebycnd);
Avg = myrdd.reduce(sum);
Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to compute the
square root of sum of squares of these numbers. How would you do it?
# We would first load the file as RDD from HDFS on spark
Numsastext = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt);
# Define the function to compute the squares
Def tosqint(str):
V = int(str);
Return v*v;
#Run the function on spark rdd as transformation
Nums = numsastext.map(tosqint);
#Run the summation as reduce action
Total = nums.reduce(sum)
#finally compute the square root. For which we need to import math.
Import math;
Print math.sqrt(total);
Is the following approach correct? Is the sqrtofsumofsq a valid reducer?
Numsastext =sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt);
Def toint(str):
Return int(str);
Nums = numsastext.map(toint);
Def sqrtofsumofsq(x, y):
Return math.sqrt(x*x+y*y);
Total = nums.reduce(sum)
Import math;
Print math.sqrt(total);
A: Yes. The approach is correct and sqrtofsumofsq is a valid reducer.
Could you compare the pros and cons of your approach (in Question 2 above) and my approach (in Question
3 above)?
You are doing the square and square root as part of reduce action while I am squaring in map() and summing in
reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer
code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am computing the
sum of squares as part of map.
If you have to compute the total counts of each of the unique words on spark, how would you go about it?
#This will load the bigtextfile.txt as RDD in the spark
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt);
#define a function that can break each line into words
Def towords(line):
Return line.split();
# Run the towords function on each element of RDD on spark as flatmap transformation.
# We are going to flatmap instead of map because our function is returning multiple values.
Words = lines.flatmap(towords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
Def totuple(word):
Return (word, 1);
Wordstuple = words.map(totuple);
# Now we can easily do the reducebykey() action.
Def sum(x, y):
Return x+y;
Counts = wordstuple.reducebykey(sum)
# Now, print
Counts.collect()
In a very huge text file, you want to just check if a particular keyword exists. How would you do this using
Spark?
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt);
Def isfound(line):
If line.find(mykeyword) > -1:
Return 1;
Return 0;
Foundbits = lines.map(isfound);
Sum = foundbits.reduce(sum);
If sum > 0:
Print FOUND;
Else:
Print NOT FOUND;
Can you improve the performance of this code in previous answer?
Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep
executing on all the nodes which is very inefficient. We could utilize accumulators to report whether the word has
been found or not and then stop the job. Something on these line:
Import thread, threading
From time import sleep
Result = Not Set
Lock = threading.Lock()
Accum = sc.accumulator(0)
Def map_func(line):
#introduce delay to emulate the slowness
Sleep(1);
If line.find(Adventures) > -1:
Accum.add(1);
Return 1;
Return 0;
Def start_job():
Global result
Try:
Sc.setjobgroup(job_to_cancel, some description)
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt);
Result = lines.map(map_func);
Result.take(1);
Except Exception as e:
Result = Cancelled
Lock.release()
Def stop_job():
While accum.value < 3 :
Sleep(1);
Sc.canceljobgroup(job_to_cancel)
Supress = lock.acquire()
Supress = thread.start_new_thread(start_job, tuple())
Supress = thread.start_new_thread(stop_job, tuple())
Supress = lock.acquire()
Which file systems does Spark support?
 Hadoop Distributed File System (HDFS)
 Local File system
 S3
What is YARN?
YARN is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark,
providing a central and resource management platform to deliver scalable operations across the cluster.
Define pagerank.
Pagerank is the measure of each vertex in a graph.
Apache Spark Examples Input
On the top of the Crumpetty Tree
The Quangale Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
In the above input file we have so many characters but here our task is we have to show how the list of characters
which character is appeared for one time and which character is appeared for two times etc..
(1,(d, ., v, H, Q, C, W, m))
(2,(T, p, B, ,, O, i, y, g))
(3,(f, l, r))
(4,(, s, c))
(5,(h))
(6,(n, u))
(7,(o))
(8,(a))
(10,(t))
(13,(e))
(20,( ))
Here is the clear steps for Apache Spark Examples on Character Count
Starting Spark on terminal
start-master.sh
start-slave.sh spark://your hostname:7077
spark-shell
Step 1 :: Load the text file form local or HDFS
val textfile= sc.textfile(inputpath)
Step 2:: Mapper
val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Step 3:: Reducer
counts.reducebykey(_+_).collect()
Step 4:: swapping key and value to value and key
val reversemap=for((k,v)<-counts)yield(v,k)
Step 5:: grouping key values and saving as textfile
reversemap.groupbykey().sortbykey().coalesce(1,true).saveastextfile(outputpath)
Apache Spark Examples Program
Val textfile= sc.textfile(inputpath)
Val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Val reversemap=for((k,v)<-counts)yield(v,k)
Reversemap.groupbykey().sortbykey().coalesce(1,true).saveastextfile(outputpath)
Apache Spark Examples on Character Count Output
(1,(d, ., v, H, Q, C, W, m))
(2,(T, p, B, ,, O, i, y, g))
(3,(f, l, r))
(4,(, s, c))
(5,(h))
(6,(n, u))
(7,(o))
(8,(a))
(10,(t))
(13,(e))
(20,( ))
So this is the Apache Spark Examples on Character Count using Scala language.

If you have configured Java version 8 for Hadoop and Java version 7 for Apache Spark, how will you set the
environment variables in the basic configuration file?

KAFKA
What is the main difference between Kafka and Flume?
Criteria Kafka Flume
Data flow Pull Push
Hadoop Integration Loose Tight
Functionality Publish-subscribe model System for data collection,
messaging system aggregation & movement

What is Kafka?
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Message broker
Fast: A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of
clients.
Scalable: Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the
capability of any single machine and to allow clusters of co-ordinated consumers
Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can
handle terabytes of messages without performance impact.
Distributed by Design: Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance
guarantees.
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging
system, but with a unique design.
List the various components in Kafka.
The four major components of Kafka are:
 Topic – a stream of messages belonging to the same type
 Producer – that can publish messages to a topic
 Brokers – a set of servers where the publishes messages are stored
 Consumer – that subscribes to various topics and pulls data from the brokers.
Elaborate Kafka architecture.
A cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple
partitions and each broker store one or more of those partitions so that multiple producers and consumers can
publish and retrieve messages at the same time.
So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to
consumers like this

Mention what is the maximum size of the message does Kafka server can receive?
The maximum size of the message that Kafka server can receive is 1000000 bytes.
Explain the role of the offset.
Messages contained in the partitions are assigned a unique ID number that is called the offset. The role of the offset
is to uniquely identify every message within the partition.
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Where is the offset stored in zk or kafka
Older versions of Kafka (pre 0.9) store offsets in ZK only, while newer version of Kafka, by default store offsets in
an internal Kafka topic called __consumer_offsets (newer version might still commit to ZK though).

The advantage of committing offsets to the broker is, that the consumer does not depend on ZK and thus clients only
need to talk to brokers which simplifies the overall architecture. If you use brokers 0.10.1.0 you could commit
offsets to topic __consumer_offsets.

What is a Consumer Group?


Consumer Groups is a concept exclusive to Kafka. Every Kafka consumer group consists of one or more consumers
that jointly consume a set of subscribed topics.
What is the role of the zookeeper?
Kafka is an open source system and also a distributed system is built to use Zookeeper. The basic responsibility of
Zookeeper is to build coordination between different nodes in a cluster. Since Zookeeper works as periodically
commit offset so that if any node fails, it will be used to recover from previously committed to offset.
The zookeeper is also responsible for configuration management, leader detection, detecting if any node leaves or
joins the cluster, synchronization, etc.
Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific
Consumer Group.
Is it possible to use Kafka without zookeeper?
No, it is not possible to bypass Zookeeper and connect directly to the Kafka server. If, for some reason, zookeeper is
down, you cannot service any client request.
Explain the concept of Leader and Follower.
Every partition in Kafka has one server which plays the role of a Leader, and none or more servers that act as
Followers. The Leader performs the task of all read and write requests for the partition, while the role of the
Followers is to passively replicate the leader. In the event of the Leader failing, one of the Followers will take on the
role of the Leader. This ensures load balancing of the server.
What are consumers or users?
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read
from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers.
Kafka offers a single consumer abstraction that generalizes both of these—the consumer group. Consumers label
themselves with a consumer group name, and each message published to a topic is delivered to one consumer
instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate
machines.
 If all the consumer instances have the same consumer group, then this works just like a traditional queue
balancing load over the consumers.
 If all the consumer instances have different consumer groups, then this works like publish-subscribe and all
messages are broadcast to all consumers.
Explain how message is consumed by consumer in Kafka?
Transfer of messages in Kafka is done by using sendfile API. It enables the transfer of bytes from the socket to disk
via kernel space saving copies and call between kernel user back to the kernel.
Explain how you can improve the throughput of a remote consumer?
If the consumer is located in a different data center from the broker, you may require to tune the socket buffer size to
amortize the long network latency.
Explain how you can get exactly once messaging from Kafka during data production?
During data, production to get exactly once messaging from Kafka you have to follow two things avoiding
duplicates during data consumption and avoiding duplication during data production.
Here are the two ways to get exactly one semantics while data production:
Avail a single writer per partition, every time you get a network error checks the last message in that partition to see
if your last write succeeded
In the message include a primary key (UUID or something) and de-duplicate on the consumer
What roles do Replicas and the ISR play?
Replicas are essentially a list of nodes that replicate the log for a particular partition irrespective of whether they
play the role of the Leader. On the other hand, ISR stands for In-Sync Replicas. It is essentially a set of message
replicas that are synced to the leaders.
Explain how you can reduce churn in ISR? When does broker leave the ISR?
ISR is a set of message replicas that are completely synced up with the leaders, in other word ISR has all messages
that are committed. ISR should always include all replicas until there is a real failure. A replica will be dropped out
of ISR if it deviates from the leader.
Why are Replications critical in Kafka?
Replication ensures that published messages are not lost and can be consumed in the event of any machine error,
program error or frequent software upgrades. It is defined for particular topic
If a Replica stays out of the ISR for a long time, what does it signify?
It means that the Follower is unable to fetch data as fast as data accumulated by the Leader.
Mention what happens if the preferred replica is not in the ISR?
If the preferred replica is not in the ISR, the controller will fail to move leadership to the preferred replica.
Is it possible to get the message offset after producing?
You cannot do that from a class that behaves as a producer like in most queue systems, its role is to fire and forget
the messages. The broker will do the rest of the work like appropriate metadata handling with id’s, offsets, etc.
As a consumer of the message, you can get the offset from a Kafka broker. If you gaze in
the simpleconsumer class, you will notice it fetches multifetchresponse objects that include offsets as a list. In
addition to that, when you iterate the Kafka Message, you will have messageandoffset objects that include both, the
offset and the message sent.
What is the process for starting a Kafka server?
Since Kafka uses zookeeper, it is essential to initialize the zookeeper server, and then fire up the Kafka server.
 To start the zookeeper server: > bin/zookeeper-server-start.sh config/zookeeper.properties
 Next, to start the Kafka server: > bin/kafka-server-start.sh config/server.properties
How do you define a Partitioning Key?
Within the Producer, the role of a Partitioning Key is to indicate the destination partition of the message. By default,
a hashing-based Partitioner is used to determine the partition ID given the key. Alternatively, users can also use
customized Partitions.
In the Producer, when does queuefullexception occur?
Queuefullexception typically occurs when the Producer attempts to send messages at a pace that the Broker cannot
handle. Since the Producer doesn’t block, users will need to add enough brokers to collaboratively handle the
increased load.
Explain the role of the Kafka Producer API.
The role of Kafka’s Producer API is to wrap the two producers – kafka.producer.syncproducer and the
kafka.producer.async.asyncproducer. The goal is to expose all the producer functionality through a single API to the
client.
Even though both are used for real-time processing, Kafka is scalable and ensures message durability.
Communication between the clients and the servers is done with a simple, high-performance, language
agnostic TCP protocol. We provide a Java client for Kafka, but clients are available in many languages.
Topics and Logs
Let's first dive into the high-level abstraction Kafka provides—the topic. A topic is a category or feed name to which
messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:

Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The
messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable
period of time. For example if the log retention is set to two days, then for the two days after a message is published
it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively
constant with respect to data size so retaining lots of data is not a problem.
In fact the only metadata retained on a per-consumer basis is the position of the consumer in the log, called the
offset. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads
messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes.
For example a consumer can reset to an older offset to reprocess.

How Fault Tolerence is achieved in Kafka?


The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and
requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault
tolerance.
Each partition has one server which acts as the leader and zero or more servers which act as followers. The leader
handles all read and write requests for the partition while the followers passively replicate the leader. If the leader
fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its
partitions and a follower for others so load is well balanced within the cluster.

What are Producers?


Producers publish data to the topics of their choice. The producer is responsible for choosing which message to
assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can
be done according to some semantic partition function (say based on some key in the message). More on the use of
partitioning in a second.
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A
has two consumer instances and group B has four.

Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then
the server hands out messages in the order they are stored. However, although the server hands out messages in
order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different
consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption.
Messaging systems often work around this by having a notion of exclusive consumer that allows only one process to
consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide
both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the
partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one
consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes
the data in order. Since there are many partitions this still balances the load over many consumer instances. Note
however that there cannot be more consumer instances in a consumer group than partitions.
Kafka only provides a total order over messages within a partition, not between different partitions in a topic. Per-
partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if
you require a total order over messages this can be achieved with a topic that has only one partition, though this will
mean only one consumer process per consumer group.

Guarantees
At a high-level Kafka gives the following guarantees:
 Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That
is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a
lower offset than M2 and appear earlier in the log.
 A consumer instance sees messages in the order they are stored in the log.
 For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages
committed to the log.

STORM
Why use Storm?
Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did
for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC,
ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is
scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams
of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the
computation however needed.
Components of a Storm cluster
A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run mapreduce jobs, on Storm
you run topologies. Jobs and topologies themselves are very different -- one key difference is that a mapreduce job
eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a
daemon called Nimbus that is similar to Hadoop's jobtracker. Nimbus is responsible for distributing code around the
cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon called the Supervisor. The supervisor listens for work assigned to its machine and
starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process
executes a subset of a topology; a running topology consists of many worker processes spread across many
machines.

All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the
Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk.
This means you can kill -9 Nimbus or the Supervisors and they'll start back up like nothing happened. This design
leads to Storm clusters being incredibly stable.
Topologies
To do realtime computation on Storm, you create what are called topologies. A topology is a graph of computation.
Each node in a topology contains processing logic, and links between nodes indicate how data should be passed
around between nodes.
Running a topology is straightforward. First, you package all your code and dependencies into a single jar. Then,
you run a command like the following:
Storm jar all-my-code.jar backtype.storm.mytopology arg1 arg2
This runs the class backtype.storm.mytopology with the arguments arg1 and arg2. The main function of the class
defines the topology and submits it to Nimbus. The storm jar part takes care of connecting to Nimbus and uploading
the jar.
Since topology definitions are just Thrift structs, and Nimbus is a Thrift service, you can create and submit
topologies using any programming language. The above example is the easiest way to do it from a JVM-based
language. See Running topologies on a production cluster for more information on starting and stopping topologies.
Streams
The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the
primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may
transform a stream of tweets into a stream of trending topics.Streams are defined with a schema that names the fields
in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans,
and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.
Every stream is given an id when declared. Since single-stream spouts and bolts are so
common, outputfieldsdeclarer has convenience methods for declaring a single stream without specifying an id. In
this case, the stream is given the default id of default.
Resources:
Tuple: streams are composed of tuples
Outputfieldsdeclarer: used to declare streams and their schemas
Serialization: Information about Storm's dynamic typing of tuples and declaring custom serializations
Iserialization: custom serializers must implement this interface
CONFIG.TOPOLOGY_SERIALIZATIONS: custom serializers can be registered using this configuration
The basic primitives Storm provides for doing stream transformations are spouts and bolts. Spouts and bolts have
interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a
stream. Or a spout may connect to the Twitter API and emit a stream of tweets.
Spouts
A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit
them into the topology (e.g. A Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A
reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout
forgets about the tuple as soon as it is emitted.
Spouts can emit more than one stream. To do so, declare multiple streams using the declarestream method
of outputfieldsdeclarer and specify the stream to emit to when using the emit method onspoutoutputcollector.
The main method on spouts is nexttuple. Nexttuple either emits a new tuple into the topology or simply returns if
there are no new tuples to emit. It is imperative that nexttuple does not block for any spout implementation, because
Storm calls all the spout methods on the same thread.
The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from
the spout either successfully completed through the topology or failed to be completed. Ack andfail are only called
for reliable spouts. See the Javadoc for more information.
Resources:
Irichspout: this is the interface that spouts must implement.
Guaranteeing message processing

A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Complex
stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps
and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases, and more.
Bolts
All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins,
talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and
thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least
two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X
images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts can emit more than one stream. To do so, declare multiple streams using the declarestream method
of outputfieldsdeclarer and specify the stream to emit to when using the emit method on outputcollector.
When you declare a bolt's input streams, you always subscribe to specific streams of another component. If you
want to subscribe to all the streams of another component, you have to subscribe to each one
individually.inputdeclarer has syntactic sugar for subscribing to streams declared on the default stream id.
Saying declarer.shufflegrouping(1) subscribes to the default stream on component 1 and is equivalent
todeclarer.shufflegrouping(1, DEFAULT_STREAM_ID).
The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using
the outputcollector object. Bolts must call the ack method on the outputcollector for every tuple they process so that
Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples).
For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the
input tuple, Storm provides an ibasicbolt interface which does the acking automatically.
Please note that outputcollector is not thread-safe, and all emits, acks, and fails must happen on the same thread.
Please refer Troubleshooting for more details.
Resources:
Irichbolt: this is general interface for bolts.
Ibasicbolt: this is a convenience interface for defining bolts that do filtering or simple functions.
Outputcollector: bolts emit tuples to their output streams using an instance of this class
Guaranteeing message processing
Networks of spouts and bolts are packaged into a topology which is the top-level abstraction that you submit to
Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt.
Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a
stream, it sends the tuple to every bolt that subscribed to that stream.

Links between nodes in your topology indicate how tuples should be passed around. For example, if there is a link
between Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C, then everytime Spout
A emits a tuple, it will send the tuple to both Bolt B and Bolt C. All of Bolt B's output tuples will go to Bolt C as
well.
Each node in a Storm topology executes in parallel. In your topology, you can specify how much parallelism you
want for each node, and then Storm will spawn that number of threads across the cluster to do the execution.
A topology runs forever, or until you kill it. Storm will automatically reassign any failed tasks. Additionally, Storm
guarantees that there will be no data loss, even if machines go down and messages are dropped.
Stream groupings
A stream grouping tells a topology how to send tuples between two components. Remember, spouts and bolts
execute in parallel as many tasks across the cluster. If you look at how a topology is executing at the task level, it
looks something like this:

Stream groupings
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping
defines how that stream should be partitioned among the bolt's tasks.
There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by
implementing the customstreamgrouping interface:
1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is
guaranteed to get an equal number of tuples.
2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the
stream is grouped by the user-id field, tuples with the same user-id will always go to the same task, but
tuples with different user-id's may go to different tasks.
3. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields
grouping, but are load balanced between two downstream bolts, which provides better utilization of
resources when the incoming data is skewed. This paper provides a good explanation of how it works and
the advantages it provides.
4. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
5. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task
with the lowest id.
6. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none
groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none
groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
7. Direct grouping: This is a special kind of grouping. A stream grouped this way means that
the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can
only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream
must be emitted using one of the emitdirect methods. A bolt can get the task ids of its consumers by either
using the provided topologycontext or by keeping track of the output of the emit method
in outputcollector (which returns the task ids that the tuple was sent to).
8. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will
be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
Resources:
Topologybuilder: use this class to define topologies
Inputdeclarer: this object is returned whenever setbolt is called on topologybuilder and is used for declaring a bolt's
input streams and how those streams should be grouped
Coordinatedbolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct
groupings

When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to?
A stream grouping answers this question by telling Storm how to send tuples between sets of tasks. Before we dig
into the different kinds of stream groupings, let's take a look at another topology from storm-starter.
Guaranteeing message processing
Earlier on in this tutorial, we skipped over a few aspects of how tuples are emitted. Those aspects were part of
Storm's reliability API: how Storm guarantees that every message coming off a spout will be fully processed.
See Guaranteeing message processing for information on how this works and what you have to do as a user to take
advantage of Storm's reliability capabilities
Transactional topologies
Storm guarantees that every message will be played through the topology at least once. A common question asked is
how do you do things like counting on top of Storm? Won't you overcount? Storm has a feature called transactional
topologies that let you achieve exactly-once messaging semantics for most computations. Read more about
transactional topologies here.
Distributed RPC
This tutorial showed how to do basic stream processing on top of Storm. There's lots more things you can do with
Storm's primitives. One of the most interesting applications of Storm is Distributed RPC, where you parallelize the
computation of intense functions on the fly. Read more about Distributed RPC here.
Topologies
The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a
mapreduce job. One key difference is that a mapreduce job eventually finishes, whereas a topology runs forever (or
until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings.
These concepts are described below.
Resources:
Topologybuilder: use this class to construct topologies in Java
Running topologies on a production cluster
Local mode: Read this to learn how to develop and test topologies in local mode.
Reliability
Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of
tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed.
Every topology has a message timeout associated with it. If Storm fails to detect that a spout tuple has been
completed within that timeout, then it fails the tuple and replays it later.
To take advantage of Storm's reliability capabilities, you must tell Storm when new edges in a tuple tree are being
created and tell Storm whenever you've finished processing an individual tuple. These are done using
theoutputcollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that
you're finished with a tuple using the ack method.
This is all explained in much more detail in Guaranteeing message processing.
Tasks
Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and
stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for
each spout or bolt in the setspout and setbolt methods of topologybuilder.
Workers
Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a
subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50
workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the
tasks evenly across all the workers.

Five key abstractions help to understand how Storm processes data:


 Tuples– an ordered list of elements. For example, a 4-tuple might be (7, 1, 3, 7)
 Streams – an unbounded sequence of tuples.
 Spouts –sources of streams in a computation (e.g. A Twitter API)
 Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join
data; or talk to databases.
 Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the
following diagram)

Here are some typical prevent and optimize use cases for Storm
Prevent Use Cases Optimize Use Cases
 Securities fraud
 Order routing
Financial Services  Operational risks &
 Pricing
compliance violations
 Security breaches  Bandwidth allocation
Telecom
 Network outages  Customer service
 Shrinkage  Offers
Retail
 Stock outs  Pricing
 Preventative maintenance  Supply chain optimization
Manufacturing
 Quality assurance  Reduced plant downtime
 Driver monitoring  Routes
Transportation
 Predictive maintenance  Pricing
 Application failures
Web  Personalized content
 Operational issues

Explain what is Apache Storm? What are the components of Storm?


Apache storm is an open source distributed real-time computation system used for processing real time big data
analytics. Unlike Hadoop batch processing, Apache storm does for real-time processing and can be used with any
programming language.
Components of Apache Storm includes
• Nimbus: It works as a Hadoop’s Job Tracker. It distributes code across the cluster, uploads computation for
execution, allocate workers across the cluster and monitors computation and reallocates workers as needed
• Zookeeper: It is used as a mediator for communication with the Storm Cluster
• Supervisor: Interacts with Nimbus through Zookeeper, depending on the signals received from the Nimbus,
it executes the process.
Why Apache Storm is the first choice for Real Time Processing?
Five characteristics make Storm ideal for real-time data processing workloads. Storm is:
Fast – benchmarked as processing one million 100 byte messages per second per node
Scalable – with parallel calculations that run across a cluster of machines
Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be
restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages
are only replayed when there are failures.
Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to
operate.
Explain how data is stream flow in Apache Storm?
In Apache storm, data is stream flow with three components Spout, Bolt and Tuple
Spout: A spout is a source of data in Storm
Bolt: A bolt processes these data’s
Tuple: Data is passed as Tuple
Mention what is the difference between Apache Hbase and Storm?

Apache Storm Apache Hbase


• It provides data processing in real-time It processes the data but not store
• You will streamline your data where data is processed in real time, so that alerts and actions can be raised if
needed • It offers you low-latency reads of processed data for querying later
• It stores the data but does not store
Q. Mention what is the difference between Apache Kafka and Apache Storm?
• Apache Kafka:It is a distributed and robust messaging system that can handle huge amount of data and
allows passage of messages from one end-point to another.
• Apache Storm:It is a real time message processing system, and you can edit or manipulate data in real time.
Apache storm pulls the data from Kafka and applies some required manipulation.
Q. Explain how you can streamline log files using Apache storm?
To read from the log files you can configure your spout and emit per line as it read the log. The output then can be
assign to a bolt for analyzing.
Q. Explain what streams is and stream grouping in Apache storm?
In Apache Storm, stream is referred as a group or unbounded sequence of Tuples while stream grouping determines
how stream should be partitioned among the bolt’s tasks.
Q. Explain what is Topology_Message_Timeout_secs in Apache Storm?
The maximum amount of time alloted to the topology to fully process a message released by a spout. If the message
in not acknowledged in given time frame, Apache storm will fail the message on the spout.
Q. Explain how message is fully processed in Apache Storm?
By calling the nexttuple procedure or method on the Spout, Storm requests a tuple from the Spout. The Spout avails
the spoutoutputcollector given in the open method to discharge a tuple to one of its output streams. While
discharging a tuple, the Spout allocates a message id that will be used to recognize the tuple later.
After that, the tuple gets sent to consuming bolts, and storm takes charge of tracking the tree of messages that is
produced. If the storm is confident that a tuple is processed thoroughly, then it can call the ack procedure on the
originating Spout task with the message id that the Spout has given to the Storm.
Q. Explain how to write the Output into a file using Storm?
In Spout, when you are reading file, make filereader object in Open() method, as such that time it initializes the
reader object for worker node. And use that object in nexttuple() method.
Q. Explain when using field grouping in storm, is there any time-out or limit to known field values?
Field grouping in storm uses a mod hash function to decide which task to send a tuple, ensuring which task will be
processed in the correct order. For that, you don’t require any cache. So, there is no time-out or limit to known field
values.
Q. Does Apache include a search engine?
Yes, Apache contain a Search engine.You can search a report name in Apache by using the Search title.
Q. Explain how you can streamline log files using Apache storm?
To read from the log files you can configure your spout and emit per line as it read the log. The output then can be
assign to a bolt for analyzing.
Q. Why does not Apache include SSL?
SSL (Secure Socket Layer) data transport requires encryption, and many governments have restrictions upon the
import, export, and use of encryption technology. If Apache included SSL in the base package, its distribution would
involve all sorts of legal and bureaucratic issues, and it would no longer be freely available. Also, some of the
technology required to talk to current clients using SSL is patented by RSA Data Security, who restricts its use
without a license.
Q. Does Apache include any sort of database integration?
No. Apache is a Web (HTTP) server, not an application server. The base package does not include any such
functionality. See the PHP project and the mod_perl project for examples of modules that allow you to work with
databases from within the Apache environment.
Q. Does Apache come with Java support?
The base Apache Web server package does not include support for Java
Q. Mention how storm application can be beneficial in financial services?
In financial services, Storm can be helpful in preventing Securities fraud, Order routing, Pricing, Compliance
Violations
Q. Can we use Active Server Pages (ASP) with Apache?
Apache Web Server package does not include ASP support. However, a number of projects provide ASP or ASP-
like functionality for Apache. Some of these are: Apache::ASP, mod_mono
Q. Explain what is Topology_Message_Timeout_secs in Apache Storm?
The maximum amount of time alloted to the topology to fully process a message released by a spout. If the message
in not acknowledged in given time frame, Apache storm will fail the message on the spout.
Q. While installing, why does Apache have three config files – srm.conf, access.conf and httpd.conf?
The first two are remnants from the NCSA times, and generally you should be ok if you delete the first two, and
stick with httpd.conf.
Q. Explain how to write the Output into a file using Storm?
In Spout, when you are reading file, make filereader object in Open() method, as such that time it initializes the
reader object for worker node. And use that object in nexttuple() method.
How to to stop Apache?
To stop apache you can use.
/etc/init.d/httpd stop command.
How to check for the httpd.conf consistency and any errors in it?
We can check syntax for httpd configuration file by using following command.
Httpd –S
What is servertype directive in Apache Server?
It defines whether Apache should spawn itself as a child process (standalone) or keep everything in a single process
(inetd). Keeping it inetd conserves resources. This is deprecated, however.
In which folder are Java Applications stored in Apache?
Java applications are not stored in Apache, it can be only connected to a other Java webapp hosting webserver using
the mod_jk connector
What is mod_vhost_alias?
This module creates dynamically configured virtual hosts, by allowing the IP address and/or the Host: header of the
HTTP request to be used as part of the pathname to determine what files to serve. This allows for easy use of a huge
number of virtual hosts with similar configurations.
What is struct and explain its purpose?
A struts is a open source framework for creating a Java web applications
Tell me Is running apache as a root is a security risk?
No. Root process opens port 80, but never listens to it, so no user will actually enter the site with root rights. If you
kill the root process, you will see the other kids disappear as well.

CORE JAVA
1. What are the principle concepts of OOPS?
These are the four principle concepts of object oriented design and programming:
 Abstraction
 Polymorphism
 Inheritance
 Encapsulation
2. How does abstraction differ from encapsulation?
 Abstraction focuses on the interface of an object whereas Encapsulation prevents clients from seeing it’s inside
view i.e. Where the behavior of the abstraction is implemented.
 Abstraction solves the problem in the design side while Encapsulation is the Implementation.
 Encapsulation is the deliverables of Abstraction. Encapsulation barely talks about grouping up your abstraction
to suit the developer needs.
3. What is an immutable object? How do you create one in Java?
Immutable objects are those whose state cannot be changed once they are created. Any modification will result in a
new object e.g. String, Integer, and other wrapper class.
4. What are the differences between processes and threads?
 A process is an execution of a program whereas a Thread is a single execution sequence within a process. A
process can contain multiple threads.
 Thread is at times called a lightweight process.
5. What is the purpose of garbage collection in Java? When is it used?
The purpose of garbage collection is to identify and discard the objects that are no longer needed by the application
to facilitate the resources to be reclaimed and reused.
6. What is Polymorphism?
Polymorphism is briefly described as one interface, many implementations. Polymorphism is a characteristic of
being able to assign a different meaning or usage to something in different contexts – specifically, to allow an entity
such as a variable, a function, or an object to have more than one form. There are two types of polymorphism:
 Compile time polymorphism
 Run time polymorphism.
Compile time polymorphism is method overloading. Runtime time polymorphism is done using inheritance and
interface.
7. In Java, what is the difference between method overloading and method overriding?
Method overloading in Java occurs when two or more methods in the same class have the exact same name, but
different parameters. On the other hand, method overriding is defined as the case when a child class redefines the
same method as a parent class. Overridden methods must have the same name, argument list, and return type. The
overriding method may not limit the access of the method it overrides.
8. How do you differentiate abstract class from interface?
 Abstract keyword is used to create abstract class. Interface is the keyword for interfaces.
 Abstract classes can have method implementations whereas interfaces can’t.
 A class can extend only one abstract class but it can implement multiple interfaces.
 You can run abstract class if it has main () method but not an interface.
9. Can you override a private or static method in Java?
You cannot override a private or static method in Java. If you create a similar method with same return type and
same method arguments in child class then it will hide the super class method; this is known as method hiding.
Similarly, you cannot override a private method in sub class because it’s not accessible there. What you can do is
create another private method with the same name in the child class.
10. What is Inheritance in Java?
Inheritance in Java is a mechanism in which one object acquires all the properties and behaviors of the parent object.
The idea behind inheritance in Java is that you can create new classes building upon existing classes. When you
inherit from an existing class, you can reuse methods and fields of parent class, and you can also add new methods
and fields.
Inheritance represents the IS-A relationship, also known as parent-child relationship.
Inheritance is used for:
 Method Overriding (so runtime polymorphism can be achieved)
 Code Reusability
11. What is super in Java?
The super keyword in Java is a reference variable that is used to refer the immediate parent class object. Whenever
you create the instance of subclass, an instance of parent class is created i.e. Referred by super reference variable.
Java super Keyword is used to refer:
 Immediate parent class instance variable
 Immediate parent class constructor
 Immediate parent class method
12. What is constructor?
Constructor in Java is a special type of method that is used to initialize the object. It is invoked at the time of object
creation. It constructs the values i.e. Provides data for the object and that is why it is known as constructor. Rules for
creating Java constructor:
 Constructor name must be same as its class name
 Constructor must have no explicit return type
Types of Java constructors:
 Default constructor (no-arg constructor)
 Parameterized constructor
13. What is the purpose of default constructor?
A constructor that has no parameter is known as default constructor.
Syntax of default constructor:
<class_name>(){}
14. What kind of variables can a class consist?
A class consists of Local Variable, Instance Variables and Class Variables.
15. What is the default value of the local variables?
The local variables are not initialized to any default value; neither primitives nor object references.
16. What are the differences between path and classpath variables?
PATH is an environment variable used by the operating system to locate the executables. This is the reason we need
to add the directory location in the PATH variable when we install Java or want any executable to be found by OS.
Classpath is specific to Java and used by Java executables to locate class files. We can provide the classpath location
while running a Java application and it can be a directory, ZIP file or JAR file.
17. What does the ‘static’ keyword mean? Is it possible to override private or static method in Java?
The static keyword denotes that a member variable or method can be accessed, without requiring an instantiation of
the class to which it belongs. You cannot override static methods in Java, because method overriding is based upon
dynamic binding at runtime and static methods are statically binded at compile time. A static method is not
associated with any instance of a class, so the concept is not applicable.
18. What are the differences between Heap and Stack Memory?
Major difference between Heap and Stack memory are:
 Heap memory is used by all the parts of the application whereas stack memory is used only by one thread of
execution.
 When an object is created, it is always stored in the Heap space and stack memory contains the reference to it.
 Stack memory only contains local primitive variables and reference variables to objects in heap space.
 Memory management in stack is done in LIFO manner; it is more complex in Heap memory as it is used
globally.
19. Explain different ways of creating a Thread. Which one would you prefer and why?
There are three ways of creating a Thread:
1) A class may extend the Thread class
2) A class may implement the Runnable interface
3) An application can use the Executor framework, in order to create a thread pool.
The Runnable interface is preferred, as it does not require an object to inherit the Thread class.
20. What is synchronization?
Synchronization refers to multi-threading. A synchronized block of code can be executed by only one thread at a
time. As Java supports execution of multiple threads, two or more threads may access the same fields or objects.
Synchronization is a process which keeps all concurrent threads in execution to be in sync. Synchronization avoids
memory consistence errors caused due to inconsistent view of shared memory. When a method is declared as
synchronized the thread holds the monitor for that method’s object. If another thread is executing the synchronized
method the thread is blocked until that thread releases the monitor.
21. How can we achieve thread safety in Java?
The ways of achieving thread safety in Java are:
 Synchronization
 Atomic concurrent classes
 Implementing concurrent Lock interface
 Using volatile keyword
 Using immutable classes
 Thread safe classes.
22. What are the uses of synchronized keyword?
Synchronized keyword can be applied to static/non-static methods or a block of code. Only one thread at a time can
access synchronized methods and if there are multiple threads trying to access the same method then other threads
have to wait for the execution of method by one thread. Synchronized keyword provides a lock on the object and
thus prevents race condition.
23. What are the differences between wait() and sleep()?
 Wait() is a method of Object class. Sleep() is a method of Object class.
 Sleep() allows the thread to go to sleep state for x milliseconds. When a thread goes into sleep state it doesn’t
release the lock.
 Wait() allows the thread to release the lock and go to suspended state. The thread is only active when a notify()
or notifall() method is called for the same object.
24. How does hashmap work in Java ?
A hashmap in Java stores key-value pairs. The hashmap requires a hash function and uses hashcode and equals
methods in order to put and retrieve elements to and from the collection. When the put method is invoked, the
hashmap calculates the hash value of the key and stores the pair in the appropriate index inside the collection. If the
key exists then its value is updated with the new value. Some important characteristics of a hashmap are its capacity,
its load factor and the threshold resizing.
25. What are the differences between String, stringbuffer and stringbuilder?
String is immutable and final in Java, so a new String is created whenever we do String manipulation. As String
manipulations are resource consuming, Java provides two utility classes: stringbuffer and stringbuilder.
 Stringbuffer and stringbuilder are mutable classes. Stringbuffer operations are thread-safe and synchronized
where stringbuilder operations are not thread-safe.
 Stringbuffer is to be used when multiple threads are working on same String and stringbuilder in the single
threaded environment.
 Stringbuilder performance is faster when compared to stringbuffer because of no overhead of synchronize

Monitoring, Management and Orchestration Components of Hadoop Ecosystem- Oozie and Zookeeper

ZOOKEEPER
What is zookeeper?
Apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable
distributed coordination. Zookeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. An open source server that reliably coordinates
distributed processes. Apache zookeeper provides operational services for a Hadoop cluster. Zookeeper provides a
distributed configuration service, a synchronization service and a naming registry for distributed systems.
Distributed applications use Zookeeper to store and mediate updates to important configuration information.
Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
What is the role of Zookeeper in hbase architecture?
In hbase architecture, zookeeper is the monitoring server that provides different services like –tracking server failure
and network partitions, maintaining the configuration information, establishing communication between the clients
and region servers, usability of ephemeral nodes to identify the available servers in the cluster.
Explain about zookeeper in Kafka
Apache Kafka uses zookeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store
various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness,
configurations are distributed and replicated throughout the leader and follower nodes in the zookeeper ensemble.
We cannot directly connect to Kafka by bye-passing zookeeper because if the zookeeper is down it will not be able
to serve the client request.
What zookeeper Does
Zookeeper provides a very simple interface and services. Zookeeper brings these key benefits:
 Fast.zookeeper is especially fast with workloads where reads to the data are more common than writes.
The ideal read/write ratio is about 10:1.
 Reliable.zookeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each
other. As long as a critical mass of servers is available, the zookeeper service will also be available. There
is no single point of failure.
 Simple.zookeeper maintain a standard hierarchical name space, similar to files and directories.
 Ordered.The service maintains a record of all transactions, which can be used for higher-level abstractions,
like synchronization primitives.
How zookeeper Works
Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of
data registers, known as znodes. Every znode is identified by a path, with path elements separated by a slash (/).
Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.
This is much like a normal file system, but zookeeper provides superior reliability through redundant services. A
service is replicated over a set of machines and each maintains an in-memory image of the the data tree and
transaction logs. Clients connect to a single zookeeper server and maintains a TCP connection through which they
send requests and receive responses.
This architecture allows zookeeper to provide high throughput and availability with low latency, but the size of the
database that zookeeper can manage is limited by memory.

Explain how Zookeeper works


Zookeeper is referred to as the King of Coordination and distributed applications use zookeeper to store and
facilitate important configuration information updates. Zookeeper works by coordinating the processes of distributed
applications. Zookeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is
known as an ensemble and persisted data is distributed between multiple nodes.
3 or more independent servers collectively form a zookeeper cluster and elect a master. One client connects to any
of the specific server and migrates if a particular node fails. The ensemble of zookeeper nodes is alive till the
majority of nods are working. The master node in zookeeper is dynamically selected by the consensus within the
ensemble so if the master node fails then the role of master node will migrate to another node which is selected
dynamically. Writes are linear and reads are concurrent in zookeeper.
List some examples of Zookeeper use cases.
 Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications
and discovery. The entire service of Found built up of various systems that read and write to Zookeeper.
 Apache Kafka that depends on zookeeper is used by linkedin
 Storm that relies on zookeeper is used by popular companies like Groupon and Twitter.
How to use Apache Zookeeper command line interface?
Zookeeper has a command line client support for interactive use. The command line interface of zookeeper is
similar to the file and shell system of UNIX. Data in zookeeper is stored in a hierarchy of Znodes where each znode
can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file
system.
Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden by the log
messages after entering the command, users can just hit ENTER to view the prompt.
What are the different types of Znodes?
There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.
 The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes.
 Sequential Znode is the one in which sequential number is chosen by the zookeeper ensemble and is pre-fixed when
the client assigns name to the znode.
What are watches?
Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes
at regular intervals. Zookeeper has an event system referred to as watch which can be set on Znode to trigger an
event whenever it is removed, altered or any new children are created below it.
What problems can be addressed by using Zookeeper?
In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in
failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks,
inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and
scalable. To address all such problems, Apache zookeeper can be used as a coordination service to write correct
distributed applications without having to reinvent the wheel from the beginning.
What is the role of zookeeper in a Hadoop cluster?
The purpose of zookeeper is cluster management. Zookeeper will help you achieve coordination between Hadoop
nodes. Zookeeper also helps to:
 Manage configuration across nodes
 Implement reliable messaging
 Implement redundant services
 Synchronize process execution
OOZIE
How can you schedule a sqoop job using Oozie?
Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
Apache Oozie Workflow Scheduler for Hadoop
Overview
Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed
Acyclical Graphs (dags) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time
(frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such
as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as
Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
What Oozie Does
Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs
sequentially into one logical unit of work. It is integrated with the Hadoop stack, with YARN as its architectural
center, and supports Hadoop jobs for Apache mapreduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie can
also schedule jobs specific to a system, like Java programs or shell scripts.
Apache Oozie is a tool for Hadoop operations that allows cluster administrators to build complex data
transformations out of multiple component tasks. Oozie provides greater control over jobs and also makes it easier
to repeat those jobs at predetermined intervals.
There are two basic types of Oozie jobs:
Oozie Workflow: An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph (DAG).
Control nodes define job chronology, setting rules for beginning and ending a workflow, which controls the
workflow execution path with decision, fork and join nodes. Action nodes trigger the execution of tasks.
Workflow nodes are classified in control flow nodes and action nodes:
Control flow nodes: nodes that control the start and end of the workflow and workflow job execution path.
Action nodes: nodes that trigger the execution of a computation/processing task.
Workflow definitions can be parameterized. The parameterization of workflow definitions it done using JSP
Expression Language syntax, allowing not only to support variables as parameters but also functions and complex
expressions.
EL expressions can be used in the configuration values of action and decision nodes. They can be used in XML
attribute values and in XML element and attribute values.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability. Oozie
Coordinator can also manage multiple workflows that are dependent on the outcome of subsequent workflows. The
outputs of subsequent workflows become the input to the next workflow. This chain is called a data application
pipeline.
Oozie processes coordinator jobs in a fixed timezone with no DST (typically UTC ), this timezone is referred as
‘Oozie processing timezone’.
The Oozie processing timezone is used to resolve coordinator jobs start/end times, job pause times and the initial-
instance of datasets. Also, all coordinator dataset instance URI templates are resolved to a datetime in the Oozie
processing time-zone.
A coordinator application is a program that triggers actions (commonly workflow jobs) when a set of conditions are
met. Conditions can be a time frequency, the availability of new dataset instances or other external events.
Types of coordinator applications:
Synchronous: Its coordinator actions are created at specified time intervals.
The usage of Oozie Coordinator can be categorized in 3 different segments:
Small: consisting of a single coordinator application with embedded dataset definitions
Medium: consisting of a single shared dataset definitions and a few coordinator applications
Large: consisting of a single or multiple shared dataset definitions and several coordinator applications
Oozie Bundle is a higher-level oozie abstraction that will batch a set of coordinator applications. The user will be
able to start/stop/suspend/resume/rerun in the bundle level resulting a better and easy operational control.
More specifically, the oozie Bundle system allows the user to define and execute a bunch of coordinator
applications often called a data pipeline. There is no explicit dependency among the coordinator applications in a
bundle. However, a user could use the data dependency of coordinator applications to create an implicit data
application pipeline.
Oozie executes workflow based on:

 Time Dependency(Frequency)
 Data Dependency

 Zookeeper-
Zookeeper is the king of coordination and provides simple, fast, reliable and ordered operational services for a
Hadoop cluster. Zookeeper is responsible for synchronization service, distributed configuration service and for
providing a naming registry for distributed systems.

Zookeeper Use Case-


Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority
notifications and discovery. The entire service of Found built up of various systems that read and write
to Zookeeper.
 Oozie-
Oozie is a workflow scheduler where the workflows are expressed as Directed Acyclic Graphs. Oozie runs in a Java
servlet container Tomcat and makes use of a database to store all the running workflow instances, their states ad
variables along with the workflow definitions to manage Hadoop jobs (mapreduce, Sqoop, Pig and Hive).The
workflows in Oozie are executed based on data and time dependencies.

Oozie Use Case:


The American video game publisher Riot Games uses Hadoop and the open source tool Oozie to understand the
player experience.

You might also like