Professional Documents
Culture Documents
Big Data
Big Data
Big Data
Let’s talk about science first. The large Hadron Collider at CERN produce about 1 Petabyte of data every
second, mostly sensor data. Their volume is so huge they don’t even retain or store all the data they
produce.
NASA gathers about 1.73 Gigabytes of data every hour about weather, geo location data etc.
Let’s talk about the government. NSA is known for its controversial data collection programs and guess
how much NSA’s data center at Utah can house in terms of volume? —- A Yottabyte of data that is, 1
Trillion Terabytes of data, pretty massive isn’t it?
In March of 2012 Obama’s administration announced to spend $200 million dollars in Big Data initiatives.
Even though we cannot technically classify the next one under government, it’s an interesting use case so
we included it anyway. Obama’s 2nd term election campaign used big data analytics which gave them a
competitive edge to win the election.
Next let’s look at the private sector. With the advent of social media like Facebook, Twitter, linkedin etc.
There is no scarcity of data. Ebay is known to have at 30 PB cluster and Facebook 30 PB.
Let’s say you shop at amazon.com, Amazon is not only capturing data when you click checkout, and your
every click on their website is tracked to bring a personalized shopping experience. When Amazon shows
you recommendations, big data analytics is at work behind the scenes.
What is a commodity hardware? Does commodity hardware include RAM?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware. We don’t need super computers or high-end hardware to work on
Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on
RAM.
What are different types of filesystem?
Filesystem is used for to control how to stored and retrieved data. They are different file system, each have different
structure and logic, properties of speed, flexibility, security, size and more. Disk filesystems are filesystems put on
hard-drives and memory cards. Such filesystems are designed for this type of hardware. Common examples include
NTFS, ext3, HFS+, UFS, XFS, and HDFS. Flash drives commonly use disk filesystems like FAT32
What is difference between GFS and HDFS?
Google filesystem is a distributed file system developed by Google and specially designed to provide efficient,
reliable access to data using large clusters of commodity servers. Files are divided into chunks of 64 megabytes, and
are usually appended to or read and only extremely rarely overwritten or shrunk. Compared with traditional file
systems, GFS is designed and optimized to run on data centers to provide extremelyhigh data throughputs, low
latency and survive individual server failures. Inspired by GFS, the open source Hadoop Distributed File System
(HDFS) stores large files across multiple machines. It achieves reliability by replicating the data across multiple
servers. Similarly to GFS, data is stored on multiple geo-diverse nodes. The file system is built from a cluster of data
nodes, each of which serves blocks of data over the network using a block protocol specific to HDFS. In order to
perform the certain operations in GFS and HDFS a programming model is required. GFS has its own programming
model called Mapreduce. It is an open-source programming model developed by Google Inc.Apache adopted the
ideas of Google Mapreduce and developed Hadoop Mapreduce.
How would codecs useful to hadoop?
A codecs is the implementation of compression-decompression algorithm. In hadoop codecs is represented by an
implementation of compressioncodec interface.
Can large files are supported for compression in mapreduce?
For large files,you should not use compression format that not supporting for spliting on whole file,because you lose
locality and make mapreduce applications very ineffcient.
Big Data Technologies
Big data technologies are important in providing more accurate analysis, which may lead to more concrete
decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.
While looking into the technologies that handle big data, we examine the following two classes of technology:
Operational Big Data
This include systems like mongodb that provide operational capabilities for real-time, interactive workloads where
data is primarily captured and stored.
Nosql Big Data systems are designed to take advantage of new cloud computing architectures that have emerged
over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational
big data workloads much easier to manage, cheaper, and faster to implement.
Some nosql systems can provide insights into patterns and trends based on real-time data with minimal coding and
without the need for data scientists and additional infrastructure.
Analytical Big Data
This includes systems like Massively Parallel Processing (MPP) database systems and mapreduce that provide
analytical capabilities for retrospective and complex analysis that may touch most or all of the data.
Mapreduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL,
and a system based on mapreduce that can be scaled up from single servers to thousands of high and low end
machines.
These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical Systems
Operational Analytical
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
What are the problems that come with Big Data?
Big data comes with big problems. Let’s talk about few problems you may run in to when you deal with
Big Data.
Since the datasets are huge, you need to find a way to store them as efficient as possible, I am not just
talking about efficiency just in terms of storage space but also efficiency in storing the dataset that is
suitable for computation.
Another problem when you deal with big dataset you should worry about data loss due to corruption in data
or due to hardware failure and you need have proper recovery strategies in place.
The main purpose of storing data is to analyze them and how much time does it take to analyze and provide
a solution to a problem using your big data is a million dollar question. What’s good in storing the data
when you cannot analyze or process the data in reasonable time right? With big datasets computation with
reasonable execution times is a challenge.
Finally, cost. You are going to need a lot of storage space. So the storage solution that you plan to use
should be cost effective.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very
large data sets on computer clusters built from commodity hardware.
Hadoop can handle huge volume of data, it can store data efficiently both in terms of storage and computation, it has
good recovery solution for data loss and above all it can horizontally scale, so as your data gets bigger you add more
nodes and Hadoop takes care of the rest. That simple.
Above all, Hadoop is cost effective – meaning we don’t need any specialized hardware to run Hadoop and hence
great for even startups.
Hadoop vs. Traditional Solutions
RDBMS
What about Hadoop vs. RDBMS? Is Hadoop a replacement for Database?
The straight answer is – no. There are things Hadoop is good at and there are things that database is good at.
RDBMS works exceptionally well with volume in the low terabytes whereas with Hadoop the volume we speak is in
terms of Petabytes.
Hadoop can work with Dynamic schema and can support files with many different formats whereas the
database schema is very strict and not so flexible and cannot handle multiple formats
Database solutions can scale vertically meaning you can add more resources to the existing solution but
will not horizontally scaling that is you cannot bring down the execution time of a query by simply adding
more computers.
Finally the cost, database solutions can get expensive very quick when you increase the volume of data you
are trying to process. Whereas Hadoop offers a cost effective solution.
In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on
huge volumes of data.
Database cluster in RDBMS uses the same data files in shared storage whereas in Hadoop the storage is
independent of each processing node.
Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an
approach to store huge amount of data in the distributed file system and process it.
RDBMS will be useful when you want to seek one record from Big data, whereas, Hadoop will be useful
when you want Big data in one shot and perform analysis on that later.
Hadoop is a batch processing system. It is not as interactive as a database. You cannot expect millisecond response
times with Hadoop as you would expect in a database. With Hadoop you write the file or dataset once and operate or
analyze the data multiple times whereas with the database you can read and write multiple times.
What is Hadoop?
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very
large data sets on computer clusters built from commodity hardware.
Hadoop can handle huge volume of data, it can store data efficiently both in terms of storage and computation, it has
good recovery solution for data loss and above all it can horizontally scale, so as your data gets bigger you add more
nodes and Hadoop takes care of the rest.
Why do we need Hadoop?
Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to
store large data sets in our systems but to retrieve and analyze the big data in the organizations, that too data present
in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to
analyze the data present in different machines at different locations very quickly and in a very cost effective way. It
uses the concept of mapreduce which enables it to divide the query into small parts and process them in parallel.
This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about why
Hadoop is gaining so much popularity!
Give a brief overview of Hadoop history.
In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published mapreduce, GFS
papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000
node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for
Hadoop.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The
equivalent of Daemon in Windows is services and in Dos is TSR.
List the various Hadoop daemons and their roles in a Hadoop cluster.
Namenode: It is the Master node which is responsible for storing the meta data for all the files and directories. It has
information around blocks that make a file, and where those blocks are located in the cluster.
Datanode: It is the Slave node that contains the actual data. It reports information of the blocks it contains to the
namenode in a periodic fashion. The datanode manages the storage attached to a node, of which there can be
multiple nodes in a cluster.
Secondary Namenode: It periodically merges changes in the namenode with the edit log so that it doesn’t grow too
large in size. It also keeps a copy of the image which can be used in case of failure of namenode.
Jobtracker: This is a daemon that runs on a Namenode for submitting and tracking mapreduce jobs in Hadoop. It
assigns the tasks to different task trackers.
Tasktracker: This is a daemon that runs on Datanodes. Task Trackers manage the execution of individual tasks on
the slave node. It is Responsible for instantiating and monitoring individual Map and Reduce tasks i.e.Tasktracker
per datanode performs the actual work.
Resourcemanager (Hadoop 2.x): It is the central authority that manages resources and schedules applications
running on top of YARN.
Nodemanager (Hadoop 2.x): It runs on slave machines, and is responsible for launching the application’s
containers, monitoring their resource usage (CPU, memory, disk, network) and reporting these to the
resourcemanager.
Jobhistoryserver (Hadoop 2.x): It maintains information about mapreduce jobs after the applicationmaster
terminates.
Applicationmaster performs the role of negotiating resources from the Resource Manager and working with the
Node Manager(s) to execute and monitor the tasks. Applicationmaster requests containers for all map tasks and
reduce tasks.Once Containers are assigned to tasks, applicationmaster starts containers by notifying its Node
Manager. Applicationmaster collects progress information from all tasks and aggregate values are propagated to
Client Node or user. Applicationmaster is specific to a single application which is a single job in classic mapreduce
or a cycle of jobs. Once the job execution is completed, applicationmaster will no longer exist.
What does ‘jps’ command do?
It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of namenode,
datanode , secondary namenode, Jobtracker and Task tracker.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
What is a namenode and what is a datanode?
Datanode is the place where the data actually resides before any processing takes place. Namenode is the master
node that contains file system metadata and has information about - which file maps to which block locations and
which blocks are stored on the datanode.
How to restart Namenode?
Step-1. Click on stop-all.sh and then click on start-all.sh OR
Step-2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then /etc/init.d/hadoop-
0.20-namenode start (press enter).
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. Standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file
system. Stand-alone mode is suitable only for running mapreduce programs during development. It is one of the
least used environments.
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on
the same machine.
Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a
Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running
and another host on which datanode is running and then there are machines on which tasktracker/nodemanager is
running. We have separate masters and separate slaves in this distribution.
What does /etc /init.d do?
/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX
specific, and nothing to do with Hadoop.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’50070′, for job tracker is ’50030′ and for task tracker is ’50060′.
What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:
1. Core-site.xml
2. Hdfs-site.xml
3. Mapred-site.xml
The Hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, Masters and Slaves are all available under ‘conf’
directory of Hadoop installation directory.
Core-site.xml and hdfs-site.xml:
The core-site.xml file informs Hadoop daemon where namenode runs in the cluster. It contains the configuration
settings for Hadoop Core such as I/O settings that are common to HDFS and mapreduce.
The hdfs-site.xml file contains the configuration settings for HDFS daemons; the namenode, the Secondary
namenode, and the datanodes. Here, we can configure hdfs-site.xml to specify default block replication and
permission checking on HDFS. The actual number of replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:
1. Hadoop-default.xml which was renamed to 2. Hadoop-site.xml.
These files are written in xml format. We have certain properties in these xml files, which consist of name and
value. But these files do not exist now.
Which are the three main hdfs-site.xml properties?
The three main hdfs-site.xml properties are:
1. Dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk
or onto the remote.
2. Dfs.data.dir which gives you the location where the data is going to be stored.
3. Fs.checkpoint.dir which is for secondary Namenode.
What does hadoop-metrics.properties file do?
Hadoop-metrics.properties is used for ‘Reporting‘ purposes. It controls the reporting for Hadoop. The default
status is ‘not to report‘.
Mapred-site.xml:
The mapred-site.xml file contains the configuration settings for mapreduce daemons; the job tracker and the task-
trackers.
Defining mapred-site.xml:
Hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.
This file offers a way to provide customer parameters for each of the servers. Hadoop-env.sh is sourced by the entire
Hadoop core scripts provided in the ‘conf/’ directory of the installation.
Environment variables
Exporthadoop_DATANODE_HEAPSIZE=128″
Exporthadoop_TASKTRACKER_HEAPSIZE=512″
What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.
What does the command mapred.job.tracker do?
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.
What is Data Locality in Hadoop
Data Locality in Hadoop refers to the proximity of the data with respect to the Mapper tasks working on the data.
Why is Data Locality important?
When a dataset is stored in HDFS, it is divided in to blocks and stored across the datanodes in the Hadoop cluster.
When a mapreduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits).
When the data is not available for the Mapper in the same node where it is being executed, the data needs to be
copied over the network from the datanode which has the data to the datanode which is executing the Mapper task.
Imagine a mapreduce job with over 100 Mappers and each Mapper is trying to copy the data from another datanode
in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy
the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the
data than to move the data closer to the computation.
How is data proximity defined?
When a jobtracker (mrv1) or applicationmaster (mrv2) receive a request to run a job, it looks at which nodes in the
cluster has sufficient resources to execute the Mappers and Reducers for the job. At this point serious consideration
is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper
is located.
Data Local
When the data is located on the same node as the Mapper working on the data, it is referred to as Data Local. In this
case the proximity of the data is closer to the computation. The jobtracker (mrv1) or applicationmaster
(mrv2) prefers the node which has the data that is needed by the Mapper to execute the Mapper.
Rack Local
Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data
due to resource constraints on a busy cluster. In such instances it is preferred to run the Mapper on a different node
but on the same rack as the node which has the data. In this case, the data will be moved between nodes from the
node with the data to the node executing the Mapper with in the same rack.
Different Rack
In a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different rack is chosen to
execute the Mapper and the data will be copied from the node which has the data to the node executing the Mapper
between racks. This is the least preferred scenario.
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks
in a single location.
On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the
client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block
should be stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist
in one rack, third copy in a different rack. This rule is known as Replica Placement Policy.
Do we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
What if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid
such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be
done by changing the value in replication factor which is set to 3 by default.
How do you define rack awareness in Hadoop?
It is the manner in which the Namenode decides how blocks are placed, based on rack definitions to minimize
network traffic between datanodes within the same rack. Let’s say we consider replication factor 3 (default), the
policy is that for every block of data, two copies will exist in one rack, third copy in a different rack. This rule is
known as the Replica Placement Policy.
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries
provides filesystem and OS level abstractions and contains the necessary Java files and scripts required to
start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access
to application data.
Hadoop mapreduce: This is YARN-based system for parallel processing of large data sets.
Hadoop has 2 core components – HDFS & mapreduce.
HDFS
HDFS stands for Hadoop Distributed File System and it takes care of all your storage related complexities like
splitting your dataset into blocks, takes care of replicating each block to more than one node and also keep track of
which block is stored on which node.
Mapreduce
Mapreduce is a programming model and it takes care of all the computational complexities like bringing all the
intermediate results from every single node to offer a consolidated output…
What is distcp?
Distcp (distributed copy) is a tool used for large inter/intra-cluster copying. Distcp is very efficient because it uses
mapreduce to copy the files or datasets and this means the copy operation is distributed across multiple nodes in
your cluster and hence it is very effective as opposed to a hadoop fs -cp operation.
How does it work?
Distcp expands a list of files and directories and distribute the work between multiple Map tasks, each of the Map
task will copy a partition of the files specified in the source list.
Syntax
When you are trying to copy files between 2 clusters both HDFS should be on the same version or the higher version
must be backward compatible.
Hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination
How to change default block size in HDFS?
In the older versions of Hadoop the default block size was 64 MB and in the newer versions the default block size is
128 MB.
Why would you want to make the block size of specific dataset from 128 to 256 MB?
A single HDFS block (64 MB or 128 MB or ...) Will be written to disk sequentially.
When you write the data sequentially there is a fair chance that the data will be written into contiguous space on disk
which means that data will be written next to each other in a continuous fashion.
When a data is laid out in the disk in continuous fashion it reduces the number of disk seeks during the read
operation resulting in an efficient read. So that is why block size in HDFS is huge when compared to the other file
systems.
Let’s say you have a dataset which is 2 Petabytes in size. Having a 64 MB block size for this dataset will result in 31
million+ blocks which would put stress on the Name Node to manage all that blocks. Having a lot of blocks will
also result in a lot of mappers during mapreduce execution. So in this case you may decide to increase the block size
just for that dataset.
Changing the block size when you upload a file in HDFS is very simple.
--Create directory if it does not exist
Hadoop fs -mkdir blksize
--Copy a file to HDFS with default block size (128 MB)
Hadoop fs -copyfromlocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv blksize
--Override the default block size with 265 MB
Hadoop fs -D dfs.blocksize=268435456 -copyfromlocal /hirw-starterkit/hdfs/commands/dwp-payments-
april10.csv blksize/dwp-payments-april10_256MB.csv
HDFS Block Placement Policy
When a file is uploaded in to HDFS it will be divided in to blocks. HDFS will have to decide where to place these
individual blocks in the cluster. HDFS block placement policy dictates a strategy of how and where to place
replica blocks in the cluster.
Inputsplit vs Block
The central idea behind mapreduce is distributed processing and hence the most important thing is to divide the
dataset in to chunks and you have separate process working on the dataset on every chunk of data. The chunks are
called input splits and the process working on the chunks (inputsplits) are called Mappers.
Are inputsplits Same As Blocks?
Inputsplit is not the same as the block.
A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the
dataset will be 128 MB except for the last block which could be less than the block size if the file size is not
entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a
logical record ends.
Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. Huge
records)
So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block
size which is 128 MB. However the 2nd record cannot fit in the block, so the record number 2 will start in block 1
and will end in block 2.
If you assign a mapper to a block 1, in this case, the Mapper cannot process Record 2 because block 1 does not
have the complete record 2. That is exactly the problem inputsplit solves. In this case inputsplit 1 will have both
record 1 and record 2. Inputsplit 2 does not start with Record 2 since Record 2 is already included in the Input
Split 1. So inputsplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still
inputsplit 2 will have the whole of record 3.
Blocks are physical chunks of data store in disks where as inputsplit is not physical chunks of data. It is a Java
class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows
where to start reading and where to stop reading. The start location of an inputsplit can start in a block and end in
another block.
Inputsplit respect logical record boundary and that is why it becomes very important. During mapreduce
execution Hadoop scans through the blocks and create inputsplits and each inputsplit will be assigned to individual
mappers for processing.
What Is Replication Factor?
Replication factor dictates how many copies of a block should be kept in your cluster for being fault tolerant. The
replication factor is 3 by default and hence any file you create in HDFS will have a replication factor of 3 and each
block from the file will be copied to 3 different nodes in your cluster.
Change Replication Factor – Why?
Let’s say you have a 1 TB dataset and the default replication factor is 3 in your cluster. Which means each block
from the dataset will be replicated 3 times in the cluster. Let’s say this 1 TB dataset is not that critical for you,
meaning if the dataset is corrupted or lost it would not cause a business impact. In that case you can set the
replication factor on just this dataset to 1 leaving the other files or datasets in HDFS untouched.
Use the -setrep command to change the replication factor for files that already exist in HDFS. -R flag would
recursively change the replication factor on all the files under the specified folder
The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below
command-
$hadoop fs –setrep –w 5 /my/test_dir
Replication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting
crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different
places. Any data on HDFS gets stored at atleast 3 different locations. So, even if one of them is corrupted and the
other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no
chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there is
no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature of
fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also.
So even if one or two of the systems collapse, the file is still available on the third system.
Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be
replicated on the other two?
Since there are 3 nodes, when we send the mapreduce programs, calculations will be done only on the original data.
The master node will know which node exactly has that particular data. In case, if one of the nodes is not
responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks, can
the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node
will figure out what is the actual amount of space required, how many block are being used, how much space is
available, and it will allocate the blocks accordingly.
HDFS
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a Name
Node that manages the file system metadata and Data Nodes that store the actual data. Clients contact Name Node
for file metadata or file modifications and perform actual file I/O directly with the Data Nodes.
The following are some of the salient features that could be of interest to many users.
The Name Node and Data nodes have built in web servers that makes it easy to check current status of the
cluster.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity
hardware.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large
data sets.
New features and improvements are regularly implemented in HDFS. The following is a subset of useful
features in HDFS:
o File permissions and authentication.
o Rack awareness: to take a node’s physical location into account while scheduling tasks and
allocating storage.
o Safemode: an administrative mode for maintenance.
o Fsck: a utility to diagnose health of the file system, to find missing files or blocks.
o Fetchdt: a utility to fetch delegationtoken and store it in a file on the local system.
o Balancer: tool to balance the cluster when the data is unevenly distributed among datanodes.
o Upgrade and rollback: after a software upgrade, it is possible to roll back to HDFS’ state before
the upgrade in case of unexpected problems.
o Secondary namenode: performs periodic checkpoints of the namespace and helps keep the size
of file containing log of HDFS modifications within certain limits at the namenode.
o Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size
of the log stored at the namenode containing changes to the HDFS. Replaces the role previously
filled by the Secondary namenode, though is not yet battle hardened. The namenode allows
multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with
the system.
o Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives
a stream of edits from the namenode and maintains its own in-memory copy of the namespace,
which is always in sync with the active namenode namespace state. Only one Backup node may be
registered with the namenode at once.
Secondary namenode
The namenode stores modifications to the file system as a log appended to a native file system file, edits. When a
namenode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It
then writes new HDFS state to the fsimage and starts normal operation with an empty edits file. Since namenode
merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster.
Another side effect of a larger edits file is that next restart of namenode takes longer.
The secondary namenode merges the fsimage and the edits log files periodically and keeps edits log size within a
limit. It is usually run on a different machine than the primary namenode since its memory requirements are on the
same order as the primary namenode.
The start of the checkpoint process on the secondary namenode is controlled by two configuration parameters.
Dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two
consecutive checkpoints, and
Dfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed
transactions on the namenode which will force an urgent checkpoint, even if the checkpoint period has not
been reached.
The secondary namenode stores the latest checkpoint in a directory which is structured the same way as the primary
namenode’s directory. So that the check pointed image is always ready to be read by the primary namenode if
necessary.
Import Checkpoint
The latest checkpoint can be imported to the namenode if all other copies of the image and the edits files are lost. In
order to do that one should:
Create an empty directory specified in the dfs.namenode.name.dir configuration variable;
Specify the location of the checkpoint directory in the configuration variable dfs.namenode.checkpoint.dir;
And start the namenode with -importcheckpoint option.
The namenode will upload the checkpoint from the dfs.namenode.checkpoint.dir directory and then save it to the
namenode directory(s) set in dfs.namenode.name.dir. The namenode will fail if a legal image is contained
in dfs.namenode.name.dir. The namenode verifies that the image in dfs.namenode.checkpoint.dir is consistent, but
does not modify it in any way.
Safemode
During start up the namenode loads the file system state from the fsimage and the edits log file. It then waits for
datanodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas
already exist in the cluster. During this time namenode stays in Safemode. Safemode for the namenode is essentially
a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally
the namenode leaves Safemode automatically after the datanodes have reported that most file system blocks are
available. If required, HDFS could be placed in Safemode explicitly using bin/hdfs dfsadmin -safemode command.
Namenode front page shows whether Safemode is on or off. A more detailed description and configuration is
maintained as javadoc for setsafemode().
How does one switch off the SAFEMODE in HDFS?
You use the command: hadoop dfsadmin –safemode leave
Fsck
HDFS supports the fsck command to check for various inconsistencies. It it is designed for reporting problems with
various files, for example, missing blocks for a file or under-replicated blocks. Unlike a traditional fsck utility for
native file systems, this command does not correct the errors it detects. Normally namenode automatically corrects
most of the recoverable failures. By default fsck ignores open files but provides an option to select all files during
reporting. The HDFS fsck command is not a Hadoop shell command. It can be run as bin/hdfs fsck. Fsck can be run
on the whole file system or on a subset of files.
Fetchdt
HDFS supports the fetchdt command to fetch Delegation Token and store it in a file on the local system. This token
can be later used to access secure server (namenode for example) from a non secure client. Utility uses either RPC
or HTTPS (over Kerberos) to get the token, and thus requires kerberos tickets to be present before the run (run kinit
to get the tickets). The HDFS fetchdt command is not a Hadoop shell command. It can be run as bin/hdfs fetchdt
dtfile. After you got the token you can run an HDFS command without having Kerberos tickets, by
pointing HADOOP_TOKEN_FILE_LOCATION environmental variable to the delegation token file. For command
usage, see fetchdt command.
Recovery Mode
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can
read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special
namenode startup mode called Recovery mode that may allow you to recover most of your data.You can start the
namenode in recovery mode like so: namenode -recover. When in recovery mode, the namenode will interactively
prompt you at the command line about possible courses of action you can take to recover your data. If you don’t
want to be prompted, you can give the -force option. This option will force recovery mode to always select the first
choice. Normally, this will be the most reasonable choice. Because Recovery mode can cause you to lose data, you
should always back up your edit log and fsimage before using it.
Upgrade and Rollback
When Hadoop is upgraded on an existing cluster, as with any software upgrade, it is possible there are new bugs or
incompatible changes that affect existing applications and were not discovered earlier. In any non-trivial HDFS
installation, it is not an option to loose any data, let alone to restart HDFS from scratch. HDFS allows administrators
to go back to earlier version of Hadoop and rollback the cluster to the state it was in before the upgrade. HDFS
upgrade is described in more detail in Hadoop Upgrade Wiki page. HDFS can have one such backup at a time.
Before upgrading, administrators need to remove existing backup using bin/hadoop dfsadmin -
finalizeupgrade command. The following briefly describes the typical upgrade procedure:
Before upgrading Hadoop software, finalize if there an existing backup. Dfsadmin -upgradeprogress status
can tell if the cluster needs to be finalized.
Stop the cluster and distribute new version of Hadoop.
Run the new version with -upgrade option (bin/start-dfs.sh -upgrade).
Most of the time, cluster works just fine. Once the new HDFS is considered working well (may be after a
few days of operation), finalize the upgrade. Note that until the cluster is finalized, deleting the files that
existed before the upgrade does not free up real disk space on the datanodes.
If there is a need to move back to the old version,
o Stop the cluster and distribute earlier version of Hadoop.
o Run the rollback command on the namenode (bin/hdfs namenode -rollback).
o Start the cluster with rollback option. (sbin/start-dfs.sh -rollback).
Namenode and datanodes
HDFS has a master/slave architecture. An HDFS cluster consists of a single namenode, a master server that manages
the file system namespace and regulates access to files by clients. In addition, there are a number of datanodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a
file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and
these blocks are stored in a set of datanodes. The namenode executes file system namespace operations like opening,
closing, and renaming files and directories. It also determines the mapping of blocks to datanodes. The datanodes
are responsible for serving read and write requests from the file system’s clients. The datanodes also perform block
creation, deletion, and replication upon instruction from the namenode.
The namenode and datanode are pieces of software designed to run on commodity machines. These machines
typically run a GNU/Linux operating system (OS).
What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send
its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that
there is some problem in datanode or task tracker is unable to perform the assigned task.
How HDFS Works
An HDFS cluster is comprised of a namenode, which manages the cluster metadata, and datanodes that store the
data. Files and directories are represented on the namenode by inodes. Inodes record attributes like permissions,
modification and access times, or namespace and disk space quotas.
The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently
replicated at multiple datanodes. The blocks are stored on the local file system on the datanodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a
datanode failure or disk failure, the namenode creates another replica of the block. The namenode maintains the
namespace tree and the mapping of blocks to datanodes, holding the entire namespace image in RAM.
The namenode does not directly send requests to datanodes. It sends instructions to the datanodes by replying to
heartbeats sent by those datanodes. The instructions include commands to:
1. Datanode is responsible for storing the actual data in HDFS and also known as the Slave
2. Namenode and datanode are in constant communication.
3. When a datanode starts up it announce itself to the namenode along with the list of blocks it is responsible
for.
4. When a datanode is down, it does not affect the availability of data or the cluster. Namenode will arrange
for replication for the blocks managed by the datanode that is not available.
5. Datanode is usually configured with a lot of hard disk space. Because the actual data is stored in the
datanode.
Ways To Change Number of Reducers
Update the driver program and set the setnumreducetasks to the desired value on the job object.
Job.setnumreducetasks(5);
There is also a better ways to change the number of reducers, which is by using the mapred.reduce.tasks property.
This is a better option because if you decide to increase or decrease the number of reducers later, you can do so with
out changing the mapreduce program.
-D mapred.reduce.tasks=10
Usage
Hadoop jar /hirw-starterkit/mapreduce/stocks/maxcloseprice-1.0.jar com.hirw.maxcloseprice.maxcloseprice -D
mapred.reduce.tasks=10 /user/hirw/input/stocks output/mapreduce/stocks
Explain what is Speculative Execution?
When Hadoop framework feels that a certain task (Mapper or Reducer) is taking longer on average compared to the
other tasks from the same job, it clones the long running task and run it on another node. This is called Speculative
Execution. In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different
slave node, multiple copies of same map or reduce task can be executed using Speculative Execution.
Is Speculative Execution Always Beneficial?
In some cases this is beneficial because in a cluster with 100s of nodes problems like hardware failure or network
congestion is common and prematurely running a parallel or duplicate task would be better since we won’t be
waiting for the task in problem to be complete.
But in some cases it is probably expected that certain maps or reduce may run a little longer when compared to
others so in such instance it is not always advisable to speculatively execute tasks as it would unnecessarily take up
cluster resources.
How Can I Enable/Disable Speculative Execution?
You can enable and disable both map and reduce side Speculative Execution using the properties –
mapreduce.map.speculative and mapreduce.reduce.speculative
Why is it that in HDFS, Reading is performed in parallel but Writing is not?
Using the mapreduce program, the file can be read by splitting its blocks. But while writing, mapreduce cannot be
applied and no parallel writing is possible. Hence, the incoming values are not yet known to the system.
What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a block in HDFS. The
default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a datanode and verifies them to find any kind of
checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
What is the port number for namenode, Task Tracker and Job Tracker?
Namenode 50070
Job Tracker 50030
Task Tracker 50060
Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the
address where the next part of data chunk is stored.
Whenever a client submits a hadoop job, who receives it?
Namenode receives the Hadoop job which then looks for the data requested by the client and provides the block
information. Jobtracker takes care of resource allocation of the hadoop job to ensure timely completion.
Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker)
or datanode (task tracker).
What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. In HDFS, all the systems will be executing the
tasks assigned to them independently and in parallel. In this way, the HDFS gives good throughput. By reading data
in parallel, we decrease the actual time to read data tremendously.
What is streaming access?
As HDFS works on the principle of Write Once, Read Many, the feature of streaming access is extremely important
in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed,
especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a
single record from the data.
Why do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread
across multiple files. This is because Namenode is a very expensive high-performance system, so it is not prudent to
occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using mapreduce in any
programming language which can accept standard input and can produce standard output. It could be Perl, Python,
Ruby and not necessarily be Java. However, customization in mapreduce can only be done using Java and not any
other programming language.
What happens when a datanode fails during the write process?
When a datanode fails during the write process, a new replication pipeline that contains the other datanodes opens
up and the write process resumes from there until the file is closed. Namenode observes that one of the blocks is
under-replicated and creates a new replica asynchronously.
What happens when a datanode fails?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node, all tasks are re-scheduled
Namenode replicates the user’s data to another node
Mention what are the main configuration parameters that user need to specify to run Mapreduce Job?
The user of Mapreduce framework needs to specify
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
How does namenode tackle datanode failures?
All the data nodes periodically send notifications a.k.a Heartbeat signal to the namenode, which implies that the
datanode is alive. Apart from Heartbeat, namenode also receives Block report from datanodes, which consists of all
the blocks on a datanode. In case namenode does not receive this, it marks that datanode as a dead node. As soon as
the datanode is marked as non-functional or dead, block transfer is initiated to the datanode with which replication
was done initially.
MAPREDUCE
What is mapreduce?
Mapreduce is a programming model for processing on the distributed datasets on the clusters of computer.
Mapreduce Features:
Distributed programming complexity is hidden
Built in fault-tolerance
Programming model is language independent
Parallelization and distribution are automatic
Enable data local processing
What are ‘maps’ and ‘reduces’?
‘Maps ‘and ‘Reduces ‘are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input
location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local
machine. ’Reducer’ is responsible to process the intermediate output received from the mapper and generate the
final output.
What are the Key/Value Pairs in Mapreduce framework?
Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and
output data to mapreduce framework should be in key/value pairs only.
What are the constraints to Key and Value classes in Mapreduce?
Any data type that can be used for a Value field in a mapper or reducer must implement
org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.
By default, Key fields should be comparable with each other. So, these must implement hadoop’s
org.apache.hadoop.io.writablecomparable Interface which in turn extends hadoop’s Writable interface and
java.lang.Comparableinterfaces
What will hadoop do when a task is failed in a list of suppose 50 spawned tasks?
It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times
then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured
with below properties in mapred-site.xml file.
Mapreduce.map.maxattempts
Mapreduce.reduce.maxattempts
The default value for the above two properties is 4 only.
Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB,
266 MB and 500 MB then how many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows
1 split for 256 KB file
2 splits for 266 MB file (1 split of size 256 MB and another split of size 10 MB)
2 splits for 500 MB file (1 Split of size 256 MB and another of size 244 MB)
How can you set random number of mappers and reducers for a Hadoop job?
Mappers and reducers are calculated by Hadoop, based on the DFS block size. It is possible to set an upper limit for
the mappers using conf.setnummaptasks (int num) function. However, it is not possible to set it to a lower value than
the one calculated by Hadoop.
During command line execution of jar, use the following command to set the number of mappers and reducers --D
mapred.map.tasks=4 and -D mapred.reduce.tasks=3
The above command will allocate 4 mappers and 3 reducers for the task.
What happens if the number of reducers is 0?
The output of mappers will be stored in separate file on HDFS.
What are the steps involved in mapreduce framework?
Firstly, the mapper input key/value pairs maps to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records.
The transformed intermediate records do not need to be of the same type as the input records.
A given input pair maps to zero or many output pairs.
The Hadoop mapreduce framework creates one map task for each inputsplit generated by the inputformat for the
job.
It then calls map(writablecomparable, Writable, Context) for each key/value pair in the inputsplit for that task.
All intermediate values associated with a given output key are grouped passed to the Reducers.
Where is the Mapper Output stored?
The mapper output is stored on the Local file system of each individual mapper nodes. The intermediate data is
cleaned up after the Hadoop Job completes.
Detail description of the Reducer and its phases?
Reducer reduces a set of intermediate values, which has same key to a smaller set of values. The framework then
calls reduce().
Shuffle:
Sorted output (Mapper) à Input (Reducer). Framework then fetches the relevant partition of the output of all the
mappers.
Sort:
The framework groups Reducer inputs by keys. The shuffle and sort phases occur simultaneously; while map-
outputs are being fetched they are merged.
Reduce:
Reduce(writablecomparable, Iterable<Writable>, Context) method is called for each <key, (list of values)> pair in
the grouped inputs.
Secondary Sort:
Grouping the intermediate keys are required to be different from those for grouping keys before reduction, then
Job.setsortcomparatorclass(Class).
The output of the reduce task is typically written using Context.write(writablecomparable, Writable).
What does jobconf class do?
Mapreduce needs to logically separate different jobs running on the same cluster. ‘Job conf class‘helps to do job
level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive
and represent the type of job that is being executed. Jobconf specifies mapper, Combiner, partitioner, Reducer,
inputformat, outputformat implementations and other advanced job faets liek Comparators.
What does conf. Setmapper Class do?
Conf.setmapper class sets the mapper class and all the stuff related to map job such as reading a data and generating
a key-value pair out of the mapper.
What are the primary phases of a Reducer or Explain about the partitioning, shuffle and sort phase
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and
also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate
outputs of map tasks to the reducer as inputs is referred to as Shuffling.
Sort Phase- Hadoop mapreduce automatically sorts the set of intermediate keys on a single node before they are
given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer
instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper
instance that generated it.
What does a mapreduce partitioner do and how the user can control which key will go to which reducer?
A mapreduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly
distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which
reducer is responsible for a particular key.
The key to decide the partition uses hash function. Default partitioner is hashpartitioner.
A custom partitioner is implemented to control, which keys go to which Reducer.
Public class samplepartitioner extends Partitioner<Text, Text> {
@Override
Public int getpartition(Text key, Text value, int numreducetasks) {
}
}
The function returns the partition number using the numreducetasks is the number of fixed reducers.
How number of partitioners and reducers are related?
The total numbers of partitions are the same as the number of reduce tasks for the job
How can we control particular key should go in a specific reducer?
By using a custom partitioner.
How to write a custom partitioner for a Hadoop mapreduce job?
Steps to write a Custom Partitioner for a Hadoop mapreduce Job-
A new class must be created that extends the pre-defined Partitioner Class.
Getpartition method of the Partitioner class must be overridden.
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop mapreduce or the
custom partitioner can be added to the job by using the set method of the partitioner class.
In Hadoop, if custom partitioner is not defined then, how is data partitioned before it is sent to the reducer?
In this case default partitioner is used, which does all the work of hashing and partitioning assignment to the reducer.
What is a Combiner?
A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a
particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of mapreduce by
reducing the quantum of data that is required to be sent to the reducers.
Explain the differences between a combiner and reducer.
Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and
produces the output to reducers input. It is usually used for network optimization when the map generates greater
number of outputs. Unlike a reducer, the combiner has a constraint that the input or output key and value types must
match the output types of the Mapper. Combiners can operate only on a subset of keys and values i.e. Combiners
can be executed on functions that are commutative. Combiner functions get their input from a single mapper
whereas reducers can get data from multiple mappers as a result of partitioning.
What are combiners and its purpose?
Combiners are used to increase the efficiency of a mapreduce program. It can be used to aggregate intermediate map
output locally on individual mapper outputs.
Combiners can help reduce the amount of data that needs to be transferred across to the reducers.
Reducer code as a combiner if the operation performed is commutative and associative.
Hadoop may or may not execute a combiner.
But the question is, is it always a good idea to reuse reducer program for combiner?
Reducer for Combiner – Good use case
Let’s say we are writing a mapreduce program to calculate maximum closing price for each symbol from a stocks
dataset. The mapper program will emit the symbol as the key and closing price as the value for each stock record
from the dataset. The reducer will be called once for each stock symbol and with a list of closing prices. The reducer
will then loop through all the closing prices for the symbol and will calculate the maximum closing price from the
list of closing prices for that symbol.
Assume Mapper 1 processed 3 records for symbol ABC with closing prices – 50, 60 and 111. Let’s also assume that
Mapper 2 processed 2 records for symbol ABC with closing prices – 100 and 31.
Now the reducer will receive five closing prices for symbol ABC - 50, 60, 111, 100 and 31. The job of the reducer
is very simple it will simply loop through all the 5 closing prices and will calculate the maximum closing price to be
111.
We can use the same reducer program for combiner after each Mapper. The combiner on mapper 1 will process 3
closing prices - 50, 60 and 111 and will emit only 111 since it is the maximum closing price of the 3 values which is
111. The combiner on mapper 2 will process 2 closing prices - 100 and 31 and will emit only 100 since it is the
maximum closing price of the 2 values which is 100.
Now with combiner reducer will only process 2 closing prices for symbol ABC which is 111 from Mapper 1 and
100 from Mapper 2 and will calculate the maximum closing price as 111 from both the values.
As we can see the output is the same with and with out the combiner hence in this case reusing the reducer as a
combiner worked with no issues.
Reducer for Combiner – Bad use case
Let’s say we are writing a mapreduce program to calculate the average volume for each symbol from a stocks
dataset. The mapper program will emit the symbol as the key and volume as the value for each stock record from the
dataset. The reducer will be called once for each stock symbol and with a list of volumes. The reducer will then loop
through all the volumes for the symbol and will calculate the average volume from the list of volumes for that
symbol.
Assume Mapper 1 processed 3 records for symbol ABC with volumes – 50, 60 and 111. Let’s also assume that
Mapper 2 processed 2 records for symbol ABC with volumes – 100 and 31.
Now the reducer will receive five volume values for symbol ABC - 50, 60, 111, 100 and 31. The job of the reducer
is very simple it will simply loop through all the 5 volumes and will calculate the average volume to be 70.4
50 + 60 + 111 + 100 + 31 / 5 = 352 / 5 = 70.4
Let’s see what happens if we use the same reducer program as combiner after each Mapper. The combiner on
mapper 1 will process 3 volumes - 50, 60 and 111 and will calculate the average of the 3 volumes 73.66
The combiner on mapper 2 will process 2 volumes - 100 and 31 and will calculate the average volume of the 2
values which is 65.5.
Now with the combiner in place, reducer will only process 2 average volumes for symbol ABC which is 73.66 from
Mapper 1 and 65.5 from Mapper 2 and will calculate the average volume of symbol ABC as 73.66 + 65.5 /2 = 69.58
which is incorrect as the correct average volume is 70.4
So as we can see Reducer cannot always be reused for Combiner. So when ever you decide to reuse reducer for
combiner ask yourself this question – will my output be the same with and without the use of combiner?
Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are
complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case
we can try couple of things.
1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the
reducer.
Scenario 2 - A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways
to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. In Pig joins, the
LEFT side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to
the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting
– mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.
1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map
side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase
altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig
and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data
Scenario 4 – The Shuffle process is the heart of a mapreduce program and it can be tweaked for performance
improvement.
1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your
mapreduce output) you can increase the memory available for Map to perform the Shuffle by increasing the
value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the
keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by
setting the mapred.inmem.merge.threshold to 0
What are the four basic parameters of a mapper?
The four basic parameters of a mapper are longwritable, text, text and intwritable. The first two represent input
parameters and the second two represent intermediate output parameters.
What are the four basic parameters of a reducer?
The four basic parameters of a reducer are text, intwritable, text, intwritable. The first two represent intermediate
input parameters and the second two represent final output parameters.
What do the master class and the output class do?
Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output
location.
What is side data distribution in Mapreduce framework?
The extra read-only data needed by a mapreduce job to process the main data set is called as side data.
There are two ways to make side data available to all the map or reduce tasks.
Job Configuration
Distributed cache
How to distribute side data using job configuration?
Side data can be distributed by setting an arbitrary key-value pairs in the job configuration using the various setter
methods onconfiguration object.
In the task, we can retrieve the data from the configuration returned by Context ’s getconfiguration() method.
When can we use side data distribution by Job Configuration and when it is not supposed?
Side data distribution by job configuration is useful only when we need to pass a small piece of meta data to
map/reduce tasks.
We shouldn’t use this mechanism for transferring more than a few KB’s of data because it put pressure on the
memory usage, particularly in a system running hundreds of jobs.
Explain what is distributed Cache in mapreduce Framework?
Distributed Cache is an important feature provided by mapreduce framework. When you want to share some files
across all nodes in Hadoop Cluster, distributedcache is used. The files could be an executable jar files or simple
properties file.
To save network bandwidth, files are normally copied to any particular node once per job.
How to supply files or archives to mapreduce job in distributed cache mechanism?
The files that need to be distributed can be specified as a comma-separated list of uris as the argument to the -
files option in hadoop job command. Files can be on the local file system, on HDFS.
Archive files (ZIP files, tar files, and gzipped tar files) can also be copied to task nodes by distributed cache by
using -archives option.these are un-archived on the task node.
The -libjars option will add JAR files to the classpath of the mapper and reducer tasks.
Jar command with distributed cache
$ hadoop jar example.jar exampleprogram -files Inputpath/example.txt input/filename /output/
How distributed cache works in Mapreduce Framework?
When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified
by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at
this point.local.cache.size property can be configured to setup cache size on local disk of node managers. Files are
localized under the${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.
Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?
Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50
map or reduce tasks, it will use the same file copy from distributed cache.
On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from
HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing
the same file from node manager’s Local FS is much faster than from HDFS data nodes.
What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache
during run time of the application?
Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not
the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as
changes are not allowed to distributed cached files.
Compare RDBMS with Hadoop mapreduce.
Traditional RDBMS can handle upto Hadoop mapreduce can handle upto
Size of Data gigabytes of data. petabytes of data or more.
Updates Read and Write multiple times. Read many times but write once model.
When is it not recommended to use mapreduce paradigm for large scale data processing?
It is not suggested to use mapreduce for iterative processing use cases, as it is not cost effective, instead Apache Pig
can be used for the same.
Where is Mapper output stored?
The intermediate key value data of the mapper output will be stored on local file system of the mapper nodes. This
directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the
intermediate will be cleaned up.
What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
Jobtracker and tasktracker
Jobtracker and tasktracker are 2 essential process involved in mapreduce execution in mrv1 (or Hadoop version
Both processes are now deprecated in mrv2 (or Hadoop version 2) and replaced by Resource Manager,
applicationmaster and Node Manager Daemons.
Job Tracker -
1. Jobtracker process runs on a separate node and not usually on a datanode.
2. Jobtracker is an essential Daemon for mapreduce execution in mrv1. It is replaced by
resourcemanager/applicationmaster in mrv2.
3. Jobtracker receives the requests for mapreduce execution from the client.
4. Jobtracker talks to the namenode to determine the location of the data.
5. Jobtracker finds the best tasktracker nodes to execute tasks based on the data locality (proximity of the
data) and the available slots to execute a task on a given node.
6. Jobtracker monitors the individual tasktrackers and the submits back the overall status of the job back to the
client.
7. When the jobtracker is down, HDFS will still be functional but the mapreduce execution cannot be started
and the existing mapreduce jobs will be halted.
Tasktracker -
1. Tasktracker runs on datanode. Mostly on all datanodes.
2. Tasktracker is replaced by Node Manager in mrv2.
3. Mapper and Reducer tasks are executed on datanodes administered by tasktrackers.
4. Tasktrackers will be assigned Mapper and Reducer tasks to execute by jobtracker.
5. Tasktracker will be in constant communication with the jobtracker signalling the progress of the task in
execution.
6. Tasktracker failure is not considered fatal. When a tasktracker becomes unresponsive, jobtracker will
assign the task executed by the tasktracker to another node.
Explain how jobtracker schedules a task?
The task tracker sends out heartbeat messages to Jobtracker usually every few minutes to make sure that jobtracker
is active and functioning. The message also informs jobtracker about the number of available slots, so the jobtracker
can stay upto date with where in the cluster work can be delegated
What is a taskinstance?
The actual Hadoop mapreduce jobs that run on each slave node are referred to as Task instances. Every task instance
has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.
Is it possible to rename the output file?
Yes, this can be done by implementing the multiple format output class.
When should you use a reducer?
It is possible to process the data without a reducer but when there is a need to combine the output from multiple
mappers – reducers are used. Reducers are generally used when shuffle and sort are required.
What is identity Mapper and identity reducer?
Identity Mapper is a default Mapper class provided by hadoop. When no mapper is specified in Mapreduce job, then
this mapper will be executed. It doesn’t process/manipulate/ perform any computation on input data rather it simply
writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.identitymapper.
Identity reducer passes on the input key/value pairs into output directory. Its class name
is org.apache.hadoop.mapred.lib.identityreducer. When no reducer class is specified in Mapreduce job, then this
class will be picked up by the job automatically.
What do you understand by the term Straggler?
A map or reduce task that takes long time to finish is referred to as straggler.
What do you understand by chain Mapper and chain Reducer?
Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in
a chain fashion, within a single map task. In this chained pattern execution, first mapper output will become input
for second mapper and second mappers output to third mapper, and so on until the last mapper.
Its class name is org.apache.hadoop.mapreduce.lib.chainmapper.
Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can
be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of
mappers will be run followed by a single reducer.
Its class name is org.apache.hadoop.mapreduce.lib.chainreducer.
How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes?
In Chain Mapper,
Chainmapper.addmapper() method is used to add mapper classes.
In chainreducer,
Chainreducer.setreducer() method is used to specify the single reducer class.
What are the core changes in Hadoop 2.0?
Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in
which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the
mapreduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel
processing frameworks. Hadoop 2.x allows workable and fine grained resource configuration leading to efficient
and better cluster utilization so that the application can scale to process larger number of jobs.
List the difference between Hadoop 1 and Hadoop 2.
In Hadoop 1.x, Namenode is the single point of failure. In Hadoop 2.x, we have Active and Passive Namenodes. If
the active Namenode fails, the passive Namenode takes charge. High availability is achieved in Hadoop 2.x.
Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple
applications in Hadoop, all sharing a common resource. MR2 is a distributed application that runs the mapreduce
framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in
Hadoop 1.x.
Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster
management whereas in in Hadoop 1.x, mapreduce is responsible for both processing and cluster management
Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
Hadoop 2.x works on containers and can also run generic tasks whereas Hadoop 1.x works on the concept of slots
What are active and passive Namenodes?
In Hadoop-2.x, we have two Namenodes – Active Namenode and Passive Namenode. Active Namenode is the
Namenode which runs in the cluster. Passive Namenode is a standby Namenode, which has similar data as active
Namenode. When the active Namenode fails, the passive Namenode replaces the active Namenode in the cluster.
Hence, the cluster is never without a Namenode and so it never fails.
What comes in Hadoop 2.0 and mapreduce V2 (YARN)?
Namenode: HA and Federation
Jobtracker: Cluster and application resource
What is Apache Hadoop YARN?
YARN is a large scale distributed system for running big data applications which is part of Hadoop 2.0.
YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or
Mapreduce 2 or mrv2.
It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by
splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.
Is YARN a replacement of Hadoop mapreduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports mapreduce
and is also referred to as Hadoop 2.0 or mapreduce 2.
What are the additional benefits YARN brings in to Hadoop?
Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource. In
Hadoop mapreduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The
same container can be used for Map and Reduce tasks leading to better utilization.
YARN is backward compatible so all the existing mapreduce jobs.
Using YARN, one can even run applications that are not based on the mapreduce model
What are the main components of Job flow in YARN architecture?
Mapreduce job flow on YARN involves below components.
A Client node, which submits the Mapreduce job.
The YARN Resource Manager, which allocates the cluster resources to jobs.
The YARN Node Managers, which launch and monitor the tasks of jobs.
The mapreduce applicationmaster, which coordinates the tasks running in the mapreduce job.
The HDFS file system is used for sharing job files between the above entities.
How can native libraries be included in YARN jobs?
There are two ways to include native libraries in YARN jobs-
1) By setting the -D java.library.path on the command line but in this case there are chances that the native libraries
might not be loaded correctly and there is possibility of errors.
2) The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.
What is the fundamental idea behind YARN?
In YARN (Yet Another Resource Allocator), jobtracker responsibility is split into:
Resource management
Job scheduling/monitoring having separate daemons.
Yarn supports additional processing models and implements a more flexible execution engine.
//Input stream for the file in local file system to be written to HDFS
Inputstream in = new bufferedinputstream(new fileinputstream(localsrc));
Now we need to create an output stream to the file location in HDFS where we can write the contents of the file
from the local file system. The very first thing we need to know is few key information about the cluster, like the
name node details etc. The details are already specified in the configuration files during cluster setup.
The easiest way to get the configuration of the cluster is by instantiating the Configuration object and this will read
the configuration files from the classpath and read and load all the information that is needed by the program.
//Get configuration of Hadoop system
Configuration conf = new Configuration();
System.out.println(Connecting to -- +conf.get(fs.defaultfs));
//Destination file in HDFS
Filesystem fs = filesystem.get(URI.create(dst), conf);
Outputstream out = fs.create(new Path(dst));
In the next line we will get the File System object using the URL that we passed as the program’s input and the
configuration that we just created. The file system that will be returned is the distributedfilesystem object. Once we
have the file system object the next thing we need is the outputstream to the file that we would like to write the
contents of the file from the local file system.
We will then call the create method on the file system object using the location of the file in HDFS which we passed
to the program as the second parameter
//Copy file from local to HDFS
Ioutils.copybytes(in, out, 4096, true);
Finally we will use copybytes method from hadoop’s ioutils class and we will supply the input and output stream
object. We will then read 4096 bytes at a time from the input stream and write it to the output stream which will
copy the entire file from the local file system to HDFS.
Reading A File From HDFS – Java Program
In this last post we saw how to write a file to HDFS by writing our own Java program. In this post we will see how
to read a file from HDFS by writing a Java program.
Here is the program - filereadfromhdfs.java
Public class filereadfromhdfs {
Public static void main(String[] args) throws Exception {
//File to read in HDFS
String uri = args[0];
Configuration conf = new Configuration();
//Get the filesystem - HDFS
Filesystem fs = filesystem.get(URI.create(uri), conf);
Fsdatainputstream in = null;
Try {
//Open the path mentioned in HDFS
In = fs.open(new Path(uri));
Ioutils.copybytes(in, System.out, 4096, false);
} finally {
Ioutils.closestream(in);
}
}
}
This program will take in an argument which is nothing but the fully qualified HDFS path to a file which we would
read and display the contents of the file on the screen. This program will simulate the hadoop fs -cat command.
In the next line we will get the filesystem object using the URL that we passed as the program input and the
configuration that we just created. This will return the distributedfilesystem object and once we have the file system
object the next thing we need is the input stream to the file that we would like to read.
In = fs.open(new Path(uri));
Ioutils.copybytes(in, System.out, 4096, false);
We can get the input stream by calling the open method on the file system object by supplying the HDFS URL of
the file we would like to read. Then we will use copybytes method from the Hadoop’s ioutils class to read the entire
file’s contents from the input stream and print it on the screen.
Steps
1. Create the input file
Create the input.txt file with sample text.
$ vi input.txt
Thanks Lord Krishna for helping us write this book
Hare Krishna Hare Krishna Krishna Krishna Hare Hare
Hare Rama Hare Rama Rama Rama Hare Hare
2. Move the input file into HDFS
Use the –put or –copyfromlocal command to move the file into HDFS
$ hadoop fs -put input.txt
3. Code for the mapreduce program
Java files:
Wordcountprogram.java // Driver Program
Wordmapper.java // Mapper Program
Wordreducer.java // Reducer Program
————————————————–
Wordcountprogram.java File: Driver Program
————————————————–
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.conf.Configured;
Import org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Job;
Import org.apache.hadoop.mapreduce.lib.input.textinputformat;
Import org.apache.hadoop.mapreduce.lib.output.textoutputformat;
Import org.apache.hadoop.util.Tool;
Import org.apache.hadoop.util.toolrunner;
Public class wordcountprogram extends Configured implements Tool{
@Override
Public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, wordcountprogram);
Job.setjarbyclass(getclass());
// Configure output and input source
Textinputformat.addinputpath(job, new Path(args[0]));
Job.setinputformatclass(textinputformat.class);
Job.setmapperclass(wordmapper.class);
Job.setreducerclass(wordreducer.class);
// Configure output
Textoutputformat.setoutputpath(job, new Path(args[1]));
Job.setoutputformatclass(textoutputformat.class);
Job.setoutputkeyclass(Text.class);
Job.setoutputvalueclass(intwritable.class);
Return job.waitforcompletion(true) ? 0 : 1;
}
Public static void main(String[] args) throws Exception {
Int exitcode = toolrunner.run(new wordcountprogram(), args);
System.exit(exitcode);
}
}
————————————————–
Wordmapper.java File: Mapper Program
————————————————–
Import java.io.ioexception;
Import java.util.stringtokenizer;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.longwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Mapper;
Public class wordmapper extends Mapper<longwritable, Text, Text, intwritable> {
Private final static intwritable count = new intwritable(1);
Private final Text nametext = new Text();
@Override
Protected void map(longwritable key, Text value, Context context) throws ioexception,
Interruptedexception {
Stringtokenizer tokenizer = new stringtokenizer(value.tostring(), );
While (tokenizer.hasmoretokens()) {
Nametext.set(tokenizer.nexttoken());
Context.write(nametext, count);
}
}
}
————————————————–
Wordreducer.java file: Reducer Program
————————————————–
Import java.io.ioexception;
Import org.apache.hadoop.io.intwritable;
Import org.apache.hadoop.io.Text;
Import org.apache.hadoop.mapreduce.Reducer;
Public class wordreducer extends Reducer<Text, intwritable, Text, intwritable> {
@Override
Protected void reduce(Text t, Iterable<intwritable> counts, Context context)
Throws ioexception, interruptedexception {
Int sum = 0;
For (intwritable count : counts) {
Sum += count.get();
}
Context.write(t, new intwritable(sum));
}
}
4. Run the mapreduce program
Create the jar of the Code in Step 3 and use the following command to run the mapreduce program
$ hadoop jar wordcount.jar wordcountprogram input.txt output1
Here,
Wordcount.jar: Name of jar exported having the all the methods.
Wordcountprogram: Driver Program having the entire configuration
Input.txt: Input file
Output1: Output folder where the output file will be stored
5. View the Output
View the output in the output1 folder
$ hadoop fs -cat /user/cloudera/output1/part-r-00000
Hare 8
Krishna 5
Lord 1
Rama 4
Thanks 1
Book 1
For 1
Helping 1
This 1
Us 1
Write 1
HDFS Use Case-
Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured data. Nokia
uses HDFS for storing all the structured and unstructured data sets as it allows processing of the stored data at a
petabyte scale.
Miscellaneous Types
Hive Supports 2 more primitive Data types
BOOLEAN
BINARY
Hive BOOLEAN is similar to Java’s BOOLEAN types, it can stores true or false values only
BINARY is an array of Bytes and like VARBINARY in many rdbmss. BINARY columns are stored within the
record, not separately like blobs
Implicit Conversion Between Primitive Data Types
TINYINT—>SMALLINT–>INT–>BIGINT–>FLOAT–>DOUBLE
TINYINT can be converted to any other numeric data type but BIGINT can only be converted to FLOAT or
DOUBLE but the reverse
Boolean & Binary data types will not be converted to any other data type implicitly.
Explicit Conversion Between Primitive Data Types
Explicit type conversion can be done using the cast operator only.
Example: CAST (‘500’ AS INT) will convert the string ‘500’ to the integer value 500. But If cast is used
incorrectly as in CAST (‘Hello’ AS INT) , then cast operation will fail and returns NULL .
Complex Data Types
ARRAY Data Type
Same as Array in java, An Ordered sequences of similar type elements that are index using zero-based integers. The
all elements in the array is must be same data type.
We can create table in hive in two different approaches, one approach is hive external table and another approach is
hive managed table.
I) Hive External Table
Ii) Hive Managed Table
Create External Table Hive
Hive> CREATE external TABLE IF NOT EXISTS company(cid int,cname string,cloc string,empid int
> comment 'company data'
> row format delimited
> FIELDS terminated BY '\t'
> LINES terminated BY ‘\n’
> stored AS textfile location '/hive_external_table_company';
Load Data From HDFS
LOAD DATA inpath '/companydata/company.txt' INTO TABLE company;
When hive load data from hdfs then automatically the data file which in the path of ‘/companydata/company.txt’
moved into specified location, if you not set the location keyword in your hive external table then external table data
stored in/user/hive/warehouse/company
Create Managed Table in Hive
CREATE TABLE IF NOT EXISTS employeetab(id int,name string,sal int)
> comment 'empolyee details'
> row format delimited
> FIELDS terminated BY '\t'
> LINES terminated BY ‘\n’
> stored AS textfile;
Hive managed tables are also called as hive internal tables. In Managed Table also we can use LOCATION keyword
to specify the storage path of managed table. The default path location of Hive managed and external table is
Hadoop fs -ls /user/hive/warehouse/your_table_name
Hive stores the data for these tables in a subdirectory under the directory defined
by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
Load data Local inpath
Hive> LOAD DATA LOCAL inpath '/home/mahesh/pig-releated/file.txt' INTO TABLE empolyeetab;
Hive Managed Table Load Data From HDFS
LOAD DATA inpath '/import22/part-m-00000' INTO TABLE empolyeetab;
Hive warehouse path
Hadoop fs -ls /user/hive/warehouse/empolyeetab
Hadoop fs -cat /user/hive/warehouse/empolyeetab/part-m-00000
Hive Create Table Differences between Managed/Internal table and External Table
One of the main differences between an external and a managed table in Hive is that when an external table is
dropped, the data associated with it doesn’t get deleted, only the metadata (number of columns, type of columns,
terminators,etc.) Gets dropped from the Hive metastore. When a managed table gets dropped, both the metadata
and data get dropped.
However, most (if not all) of the changes to schema can now be made through ALTER TABLE
Feature Internal External
Schema Data on Schema Schema on data
Storage Location /usr/hive/warehouse HDFS Location
Data Availability Within Local FS Within HDFS
Hive Metastore Data Lost for External Table
Hive Metastore Lost for External Tables
Here the main doubt is in external table if we drop the table the hive metastore data will get lost and actual data will
persist on external location right after lost the meta data then how can we find the tables in external location??
Hive Metastore Data Lost (Deleted)
We can set the external table location in two different ways, The first way to set the location of Hive external table.
Hive> CREATE external TABLE depttab(deptid int,dname string,deptloc string,empid int)
> row format delimited
> FIELDS terminated BY'\t'
> stored AS textfile;
In the above table the hive external table data path will stored in default hive metadata location warehouse path
like /user/hive/warehouse we can check the location by using the below hdfs command.
Hadoop fs -ls /user/hive/warehouse
In this warehouse folder your hive table name will be created as a one of the sub folder like
/user/hive/warehouse/your_table_name
Here is the second way to set the location of Hive external table
Hive> CREATE external TABLE depttab(deptid int,dname string,deptloc string,empid int)
> row format delimited
> FIELDS terminated BY '\t'
> stored AS textfile location '/hivepath/hive_external_table';
If we drop the external table the hive metastore data will lost but the actual data path will stored in the default
location of hive. So we no need to worry about how can we find the tables data in external location.
Hadoop Hive Input Format Selection
Input formats are playing very important role in Hive performance. Primary choices of Input Format are Text,
Sequence File, RC File, ORC.
Text Input Format:-
Default, Json, CSV formats are available
Slow to read and write and Can’t split compressed files (Leads to Huge maps)
Need to read/decompress all fields.
Sequence File Input Format:-
I) Traditional mapreduce binary file format which Stores Keys and Values as a class and Not good for Hive, Which
has sql types. Hive always stores entire line as a value
Ii) Default block size is 1 MB
Iii) Need to read and Decompress all the fields
RC (Row Columnar File) Input Format:-
I) columns stored separately
A) Read and decompressed only needed one.
B) Better compression
ii) Columns stored as binary Blobs which Depend on Meta store to supply Data types
iii) Large Blocks with 4MB default and Still search file for split boundary
ORC (Optimized Row Columnar) Input Format :-
ORC is Optimized Row Columnar File Format. ORC File format provides very efficient way to store relational data
than RC file, By using ORC File format we can reduce the size of original data up to 75%. Comparing to Text,
Sequence, Rc file formats ORC is better.
Using ORC files improves performance when Hive is reading, writing, and processing data. RC and ORC shows
better performance than Text and Sequence File formats.
ORC takes less time to access the data comparing to RC and ORC takes Less space to store data. However, the ORC
file increases CPU overhead by increasing the time it takes to decompress the relational data.
Syntax To Create ORC File Format Table
CREATE TABLE table_orc (column1 STRING, column2 STRING, column3 INT, column4 INT
) STORED AS ORC;
Here is the example table of creating a hive table with Partition, Bucket and ORC File Format
CREATE TABLE airanalytics (flightdate date, dayofweek int,depttime int,crsdepttime int,arrtime int,crsarrtime
int,uniquecarrier varchar(10),flightno int,tailnum int,aet int,cet int,airtime int,arrdelay int,depdelay int,origin
varchar(5),dest varchar(5),distance int,taxin int,taxout int,cancelled int,cancelcode int,diverted string,carrdelay
string,weatherdelay string,securtydelay string,cadelay string,lateaircraft string)
PARTITIONED BY (flight_year String)
Clustered BY (uniquecarrier)
Sorted BY (flightdate)
INTO 24 buckets
Stored AS orc tblproperties(orc.compress=NONE,orc.stripe.size=67108864,orc.row.index.stride=25000)
In The above we are declaring properties of ORC table properties
Orc.compress indicates the compression techniques like NONE,Snappy,LZO etc
Orc.stripe.size indicates blocks size of file
Orc.row.index.stride indicates index
Inserting The data into air analytics table
INSERT INTO TABLE airanalytics partition(flight_year) SELECT datenew, dayofweek, depttime, crsdepttime,
arrtime, crsarrtime, uniquecarrier, flightno, tailnum, aet, cet, airtime, arrdelay, depdelay, origindest, distance, taxin,
taxout, cancelled, cancelcode, diverted, carrdelay, weatherdelay, securtydelay, cadelay, lateaircraft, year(datenew)
AS flight_year FROM airlines
Sort BY flight_year;
Advantages With ORC File Format
Column stored separately
Stores statistics (Min, Max, Sum,Count)
Has Light weight Index
Larger Blocks by default 256 MB and Reduce The accessing Time and storage Space
Explain the difference between partitioning and bucketing.
Hive organizes tables into partitions a way of dividing a table into coarse-grained parts based on the value of a
partition column, such as a date. Using partitions can make it faster to do queries on slices of the data.
Tables or partitions may be subdivided further into buckets to give extra structure to the data that may be used for
more efficient queries. For example, bucketing by user ID means we can quickly evaluate a user-based query by
running it on a randomized sample of the total set of users.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e. Either
by timestamp ranges, by location, etc. Bucketing does not work by default.
Partitioning helps eliminate data when used in WHERE clause. Bucketing helps organize data inside the partition
into multiple files so that same set of data will always be written in the same bucket. Bucketing helps in joining
various columns.
In partitioning technique, a partition is created for every unique value of the column and there could be a situation
where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number
and the data can then be decomposed in those buckets.
Basically, a bucket is a file in Hive whereas partition is a directory.
Explain about the different types of partitioning in Hive?
Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded
into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the
value of the partition field. Based on how data is loaded into the table, requirements for data and the format in which
data is produced at source- static or dynamic partition can be chosen. In dynamic partitions, the complete data in the
file is read and is partitioned through a mapreduce job based into the tables based on a particular field in the file.
Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in
loading data. The partition is added to the table and then the file is moved into the static partition. The partition
column value can be obtained from the file name without having to read the complete file.
Hive Buckets
Optimization Techniques
Hive Buckets
Hive Buckets or Clusters is nothing but another technique of decomposing data or decreasing the data into more
manageable parts or equal parts. For example we have table with columns like date, employee_name, employee_id,
salary, leaves etc. In this table just use date column as the top-level partition and the employee_id as the second-
level partition leads to too many small partitions. So here employee table is partition by date and bucketing by
employee_id. The value of this column will be hashed by a user-defined number into buckets. Records with the
same employee_id will always be stored in the same bucket.
In Partition there is so many chance to create thousands of tiny partitions but coming into Hive Buckets we can’t
create number of Hive Buckets the reason is we should declare the number of buckets for a table in the time of table
creation. In Hive Partition we used PARTITIONED BY but in Hive Buckets we used CLUSTERED BY
Advantages with Hive Buckets
The number of buckets is fixed so it does not fluctuate with data
Hash(column) MOD(number of buckets) –evenly distributed
Used for Query optimization Techniques
CREATE TABLE order (
Username STRING, orderdate STRING, amount DOUBLE, tax DOUBLE
) PARTITIONED BY (company STRING)
CLUSTERED BY (username) INTO 25 BUCKETS;
Here we divided Hive Buckets into 25 parts. Set the maximum number of reducers to the same number of buckets
specified in the table metadata (i.e. 25) set map.reduce.tasks = 25
To enforce bucketing: set hive.enforce.bucketing = true
Better to set default Hive Buckets is 25
Hive-Buckets-table-data-load-output
The partition of hive table has been modified to point to a new directory location. Do I have to move the data
to the new location or the data will be moved automatically to the new location?
Changing the point of partition will not move the data to the new location. It has to be moved manually to the new
location from the old one.
INCREMENTAL UPDATES IN APACHE HIVE TABLES
In BI world delta load/incremental load to update the existing record and Inserting new record is very common
process.
I am assuming that good temp space is maintained according to your data volume, so that we not facing temp space
issue during the process.
Note:
Step1: Create a hive target table and do a full load from your source.
My target table is orders and its create statement
Let say after full loading is done. Now we have data in our target table orders (I have loaded only 5 records).
Step2: Create a stage table, to hold all the delta data (records which needs to be updated and new records which
needs to be inserted in to DW Table).
Created a stage table orders_stage and table creation script:
Now load the delta data:
Step3: Create one more tempary table to merge delta records in orders_stage and the hive target tables orders.
Now let’s merge orders and orders_stage data and load it into temporary table orders_temp using below script:
insert into orders_temp partition (order_date) select t1.* from (select * from orders union all select * from
orders_stage) t1 join (select order_no, max(last_update_date) as last_update_date from (select * From orders
union all select * From orders_stage ) t2 group by order_no, quantity, amount) t3 on t1.order_no =
t3.order_no and t1.last_update_date = t3.last_update_date;
Step4: Overwrite Main Hive table with temp table with dynamic partition.
Insert overwrite table orders select * from orders_temp;
This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document
once for every field you want to pull out of your JSON document, so it is rather inefficient.
It doesn't know how to look inside the Quux subdocument. And this is where json_tuple gets clunky fast - you have
to create another lateral view for each subdocument you want to descend into:
Now let's define the Hive schema that this SerDe expects and load the simple.json doc:
{
"DocId": "ABC",
"User": {
"Id": 1234,
"Username": "sam1234",
"Name": "Sam",
"ShippingAddress": {
"Address1": "123 Main St.",
"Address2": null,
"City": "Durham",
"State": "NC"
},
"Orders": [
{
"ItemId": 6789,
"OrderDate": "11/11/2012"
},
{
"ItemId": 4352,
"OrderDate": "12/12/2012"
}
]
}
}
Collapsed version:
{"DocId":"ABC","User":{"Id":1234,"Username":"sam1234","Name":"Sam",
"ShippingAddress":{"Address1":"123 Main St.","Address2":"","City":"Durham","State":"NC"},
"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}
Hive Schema:
But what if we don't know how many orders there are and we want a list of all a user's order Ids? This will work:
Result:
docid id itemid
ABC 1234 [6789,4352]
Oooh, it returns an array of ItemIds. Pretty cool. One of Hive's nice features.
Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say
version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?
Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new
"PostalCode" field in Shipping Address.
{
"DocId": "ABC",
"User": {
"Id": 1235,
"Username": "fred1235",
"Name": "Fred",
"ShippingAddress": {
"Address1": "456 Main St.",
"Address2": "",
"City": "Durham",
"State": "NC"
}
}
}
{
"DocId": "ABC",
"User": {
"Id": 1236,
"Username": "larry1234",
"Name": "Larry",
"ShippingAddress": {
"Address1": "789 Main St.",
"Address2": "",
"City": "Durham",
"State": "NC",
"PostalCode": "27713"
},
"Orders": [
{
"ItemId": 1111,
"OrderDate": "11/11/2012"
},
{
"ItemId": 2222,
"OrderDate": "12/12/2012"
}
]
}
}
Collapsed version:
{"DocId":"ABC","User":{"Id":1235,"Username":"fred1235","Name":"Fred","ShippingAddress":{"Address1":"456
Main St.","Address2":"","City":"Durham","State":"NC"}}}
{"DocId":"ABC","User":{"Id":1236,"Username":"larry1234","Name":"Larry","ShippingAddress":{"Address1":"78
9 Main St.","Address2":"","City":"Durham","State":"NC","PostalCode":"27713"},
"Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}}
those records to complex.json and reload the data into the complex_json table.
docid id itemid
ABC 1234 [6789,4352]
ABC 1235 null
ABC 1236 [1111,2222]
How to set variables in HIVE scripts (or) how do you pass parameters in hive scripts (or) while invoking a
script how to pass parameters
I can pass arguments by two methods :-
1. Passing value through CLI
command is = hive -hiveconf current_date=01-01-2015 -f argument.hql
Here my script is -
select * from glvc.product where date = '${hiveconf:current_date}';
Here my command executes fine and I got the result.
2. Passing arguments
In this case , I have already set the value in my script file and I don't want to pass the value through CLI.
If I write command = hive -hiveconf:current_date -f argument.hql , I didnt get the result. That's why I had taken a
variable earlier.
Script -
set current_date = 01-01-2015;
select * from glvc.product where date = '${hiveconf:current_date}';
Regular Expression:
Hive RegexSerDe can be used to extract columns from the input file using regular expressions. It's used only to
deserialize data, while data serialization is not supported
There are two classes available:
org.apache.hadoop.hive.contrib.serde2.RegexSerDe
org.apache.hadoop.hive.serde2.RegexSerDe
A regex group is defined by parenthesis "(...)" inside the regex.
On individual lines, if a row matches the regex but has less than expected groups, the missing groups and table fields
will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored.
If a row doesn't match the regex then all fields will be NULL.
Note that the regex contains 3 regex groups capturing the first, second and fifth field on each line, corresponding to
3 table columns:
2<tab>大阪<tab>Osaka<tab><tab>
31<tab>Якутск<tab>Yakutsk<tab>Russia
121<tab>München<tab>Munich<tab><tab>1.2
On lines 1 and 3 we have 5 fields, but some are empty, while on the second line we have only 4 fields and 3 tabs. If
we attempt to read the file using the regex given for table citiesr1 we'll end up with all NULLs on these 3 lines
because the regex doesn't match these lines. To rectify the problem we can change the regex slightly to allow for
such cases:
hive> CREATE EXTERNAL TABLE citiesr3 (id int, city_org string, ppl float) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES
('input.regex'='^(\\d+)\\t([^\\t]*)\\t[^\\t]*\\t[^\\t]*[\\t]*(.*)') LOCATION '/user/it1/hive/serde/regex2';
The first 2 groups are unchanged, however we have replaced both "\\S+" for unused columns with [^\\t]*, the last
delimiting tab is optional, and the last group is not set to "(.*)" meaning everything after the last tab including empty
string. With this changes, the above 3 lines become:
Though Cloudera Impala uses the same query language, metastore, and the user interface as Hive, it differs with
Hive and hbase in certain aspects. The following table presents a comparative analysis among hbase, Hive, and
Impala.
Hiveql follows a flat relational data model, whereas piglatin has nested relational data model.
Hiveql and piglatin both convert the commands into mapreduce jobs.
They cannot be used for olap transactions as it is difficult to execute low latency queries.
Differentiate between hadoop mapreduce and pig
Characteristic Mapreduce Pig
Type of language Compiled language Scripting language
Level of abstraction Low level of abstraction Higher level of abstraction
Code More lines of code is required. Compatively less lines of code than hadoop
mapreduce.
Code efficiency Code efficiency is high. Code efficiency is relatively less.
Give a sqoop command to import the columns employee_id, first_name, last_name from the mysql table
Employee
$ sqoop import --connect jdbc:mysql://host/dbname --table EMPLOYEES \
--columns employee_id,first_name,last_name
Give a sqoop command to run only 8 mapreduce tasks in parallel
$ sqoop import --connect jdbc:mysql://host/dbname --table table_name\ -m 8
Give a Sqoop command to import all the records from employee table divided into groups of records by the
values in the column department_id.
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --split-by dept_id
What does the following query do?
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where id > 1000 --target-dir /incremental_dataset –append
It performs an incremental import of new data, after having already imported the first 1000 rows of a table
What is the use of –append command in sqoop?
Append command used to add the extra output records for the old directory, there is no need to overwrite or create a
new directory it will appended to old directory
Example for append command in sqoop?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’ –append;
O/P:-
root@ubuntu:/home/mahesh/sqoop-related# hadoop fs -ls /importdata
Found 5 items
-rw-r–r– 1 root supergroup 0 2013-11-07 19:05 /importdata/_SUCCESS
drwxr-xr-x – root supergroup 0 2013-11-07 19:04 /importdata/_logs
drwxr-xr-x – root supergroup 0 2013-11-07 19:11 /importdata/_logs-00000
-rw-r–r– 1 root supergroup 67 2013-11-07 19:04 /importdata/part-m-00000
-rw-r–r– 1 root supergroup 43 2013-11-07 19:11 /importdata/part-m-00001
root@ubuntu:/home/mahesh/sqoop-related# hadoop fs -cat /importdata/part-m-00001
111 mahesh
112 neelesh
113 rupesh
114 vijay
When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data
in HDFS imported by sqoop? Or What is the process to perform an incremental data load in Sqoop?
Sqoop can have 2 approaches.
A − To use the --incremental parameter with append option where value of some columns are checked and only in
case of modified values the row is imported as a new row.
B − To use the --incremental parameter with lastmodified option where a date column in the source is checked for
records which have been updated after the last import.
Give a sqoop command to import data from all tables in the mysql DB DB1.
Sqoop import-all-tables --connect jdbc:mysql://host/DB1
What are the basic available commands in Sqoop?
Codegen: Generate code to interact with database records
Create-hive-table: Import a table definition into Hive
Eval: Evaluate a SQL statement and display the results
Export: Export an HDFS directory to a database table
Help
Import
Import-all-tables
List-databases
List-tables
Versions Display version information
What is the advantage with Eval command in Sqoop?
We can see the output directly from the terminal there is no need to go and check the output on top of HDFS
Eval Query Examples In sqoop?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp;
| empid | ename | esal |
| 111 | mahesh | 28000 |
| 112 | neelesh | 30000 |
| 113 | rupesh | 26000 |
| 114 | vijay | 28000 |
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp limit 2″;
| empid | ename | esal |
| 111 | mahesh | 28000 |
| 112 | neelesh | 30000 |
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from emp where empid = 111″;
| empid | ename | esal |
| 111 | mahesh | 28000 |
Create Table In sqoop by using Eval ?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query create table evaltab(evalid int, evalname varchar(30),
evalscope varchar(30));
13/11/07 19:35:39 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:35:39 INFO tool.evalsqltool: 0 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query select * from evaltab;
13/11/07 19:36:02 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| evalid | evalname | evalscope |
Insert Values Into Eval Table ?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(111,’aaa’, ‘app’);
13/11/07 19:37:37 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:37:37 INFO tool.evalsqltool: 1 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(112,’bbb’, ‘prgrm’);
13/11/07 19:37:51 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:37:52 INFO tool.evalsqltool: 1 row(s) updated.
sqoop eval –connect jdbc:mysql://localhost/mahesh –query insert into evaltab values(113,’ccc’, ‘project’);
13/11/07 19:38:09 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
13/11/07 19:38:10 INFO tool.evalsqltool: 1 row(s) updated.
Root@ubuntu:/home/mahesh/sqoop-related# sqoop eval –connect jdbc:mysql://localhost/mahesh –query select *
from evaltab;
13/11/07 19:38:27 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| evalid | evalname | evalscope |
| 111 | aaa | app |
| 112 | bbb | prgrm |
| 113 | ccc | project |
How to show tables in Database by using Eval?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query show tables;
13/11/07 19:40:22 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| TABLE_NAME |
| emp |
| evaltab |
How to describe Eval Table In Query?
sqoop eval –connect jdbc:mysql://localhost/mahesh –query desc evaltab;
13/11/07 19:40:39 INFO manager.mysqlmanager: Preparing to use a mysql streaming resultset.
| COLUMN_NAME | COLUMN_TYPE | IS_NULLABLE | COLUMN_KEY | COLUMN_DEFAULT | EXTRA |
| evalid | int(11) | YES | | (null) | |
| evalname | varchar(30) | YES | | (null) | |
| evalscope | varchar(30) | YES | | (null) | |
Give a command to execute a stored procedure named proc1 which exports data to from mysql db named
DB1 into a HDFS directory named Dir1.
$ sqoop export --connect jdbc:mysql://host/DB1 --call proc1 \ --export-dir /Dir1
Export Command In Sqoop?
Sqoop export –connect jdbc:mysql://localhost/mahesh -m 1 –table emp –export-dir/emptab/part-m-00000;
How will you update the rows that are already exported?
The parameter --update-key can be used to update existing rows. In it a comma-separated list of columns is used
which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE
query. All other table columns will be used in the SET part of the query.
What is the difference between the parameters sqoop.export.records.per.statement and
sqoop.export.statements.per.transaction
The parameter sqoop.export.records.per.statement specifies the number of records that will be used in each insert
statement.
But the parameter sqoop.export.statements.per.transaction specifies how many insert statements can be processed
parallel during a transaction.
How can you sync a exported table with HDFS data in which some rows are deleted
Truncate the target table and load it again.
How can you export only a subset of columns to a relational table using sqoop?
By using the –column parameter in which we mention the required column names as a comma separated list of
values.
Command aliases in Sqoop?
Sqoop-(toolname)
sqoop-import, sqoop-export
How are large objects handled in Sqoop?
Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports
the ability to store-
1)CLOB ‘s – Character Large Objects
2)BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as lobfile i.e. Large Object
File. The lobfile can store records of huge size, thus each record in the lobfile is a large object.
Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with
the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import
command the –target dir value must be specified.
How can you choose a name for the mapreduce job which is created on submitting a free-form query import?
By using the --mapreduce-job-name parameter. Below is a example of the command.
Sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--query 'SELECT normcities.id, \
Countries.country, \
Normcities.city \
FROM normcities \
JOIN countries USING(country_id) \
WHERE $CONDITIONS' \
--split-by id \
--target-dir cities \
--mapreduce-job-name normcities
What do you mean by Free Form Import in Sqoop?
Sqoop can import data form a relational database using any SQL query rather than only using table and column
name parameters.
How can you force sqoop to execute a free form Sql query only once and import the rows serially.
By using the –m 1 clause in the import command, sqoop cerates only one mapreduce task which will import the
rows sequentially.
Differentiate between Sqoop and distcp.
Distcp utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between
Hadoop and RDBMS.
What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option
with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile, -direct, -as-
sequencefile, -target-dir , -export-dir are not supported.
How to check List of Databases in RDBMS by using Sqoop?
sqoop list-databases –connect jdbc:mysql://localhost;
information_schema
Gopal_Lab
newyeardb
RK
batch18
bhargav
chandu
kelly
How to check List of Tables in single database by using Sqoop?
sqoop list-tables –connect jdbc:mysql://localhost/mahesh;
O/P:-emp
Write a command to Import RDBMS data into HDFS?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1;
Where the table created for the above command?
In HDFS, hadoop fs -ls /
How to read the RDBMS table data In HDFS ?
Hadoop fs -cat /emp/part-m-00000
111, mahesh,28000
112,neelesh,30000
113,rupesh,26000
114,vijay,28000
What is the Default delimiter between RDBMS table columns?
Coma (,)
How to set target directory and delimiter to Sqoop?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1
–target-dir /importdata –fields-terminated-by ‘t';
O/P:-
hadoop fs -cat /importdata/part-m-00000
111 mahesh 28000
112 neelesh 30000
113 rupesh 26000
114 vijay 28000
What Indicates -m 1 in above sqoop commands ?
-m 1 indicates output file divided into only 1 file, suppose we write -m 2 that means the output devided into 2 parts
of files like part-r-00000 and part-r-00001
How to select only specific columns In a table using Sqoop?
By using columns command
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’
How to write queries on a RDBMS table ?
sqoop import –connect jdbc:mysql://localhost/mahesh –table emp -m 1 –columns ‘empid, ename’ –target-dir
/importdata –fields-terminated-by ‘t’ –where ‘esal>26000′ –append;
O/P:-
hadoop fs -cat /importdata/part-m-00002
111 mahesh
112 neelesh
114 vijay
Import Command In Sqoop?
Sqoop import –conncet jdbc:mysql://localhost/mahesh –table emp -m 1;
Job command In Sqoop?
sqoop job –create deptdata — import –connect jdbc:mysql://localhost/mahesh –table dept -m 1 –target-dir
/jobimport –append;
Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The
following command is used to create a job that is importing data from the employee table in the db database to the
HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
‘--list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop
jobs.
$ Sqoop job --list
Inspect Job (--show)
‘--show’ argument is used to inspect or verify particular jobs and their details. The following command and sample
output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
‘--exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ Sqoop job --exec myjob
How Sqoop can be used in a Java program?
The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runtool () method must
be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect jdbc: mysql: //localhost/user;
What is the role of JDBC driver in a Sqoop set up?
To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter
available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it
needs to inetract with.
Is JDBC driver enough to connect sqoop to the databases?
No. Sqoop needs both JDBC and connector to connect to a database.
When to use --target-dir and when to use --warehouse-dir while importing data?
To specify a particular directory in HDFS use --target-dir but to specify the parent directory of all the sqoop jobs use
--warehouse-dir. In this case under the parent directory sqoop will cerate a directory with the same name as the
table.
How can you import only a subset of rows from a table?
By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
How can we import a subset of rows from a table without using the where clause?
We can run a filtering query on the database and save the result to a temporary table in database. Then use the sqoop
import command without using the --where clause
What is the advantage of using --password-file rather than -P option while preventing the display of password
in the sqoop import statement?
The --password-file option can be used inside a sqoop script while the -P option reads from standard input,
preventing automation.
What is the default extension of the files produced from a sqoop import using the --compress parameter?
.gz
What is the significance of using --compress-codec parameter?
To get the out file of a sqoop import in formats other than .gz like .bz2 we use the --compress -code parameter.
What is a disadvantage of using --direct parameter for faster data load by sqoop?
The native utilities used by databases to support faster load do not work for binary data formats like sequencefile
How can you control the number of mappers used by the sqoop command?
The Parameter --num-mapers is used to control the number of mappers executed by a sqoop command. We should
start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers
initially may slow down the performance on the database side.
How can you avoid importing tables one-by-one when importing a large number of tables from a database?
Using the command
Sqoop import-all-tables –connect –usrename –password --exclude-tables table1,table2 ..
This will import all the tables except the ones mentioned in the exclude-tables clause.
What is the usefulness of the options file in sqoop?
The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the --connect parameter's value and --user name value scan be stored in a file and used again and again
with different sqoop commands.
Is it possible to add a parameter while running a saved job?
Yes, we can add an argument to a saved job at runtime by using the --exec option
Sqoop job --exec jobname -- -- newparameter
How do you fetch data which is the result of join between two tables?
By using the --query parameter in place of --table parameter we can specify a sql query. The result of the query will
be imported
How can we slice the data to be imported to multiple parallel tasks?
Using the --split-by parameter we specify the column name based on which sqoop will divide the data to be
imported into multiple chunks to be run in parallel.
Before starting the data transfer using mapreduce job, sqoop takes a long time to retrieve the minimum and
maximum values of columns mentioned in –split-by parameter. How can we make it efficient?
We can use the --boundary –query parameter in which we specify the min and max value for the column based on
which the split can happen into multiple mapreduce tasks. This makes it faster as the query inside the –boundary-
query parameter is executed first and the job is ready with the information on how many mapreduce tasks to create
before executing the main query.
How will you implement all-or-nothing load using sqoop?
Using the staging-table option we first load the data into a staging table and then load it to the final target table only
if the staging load is successful.
How do you clear the data in a staging table before loading it by Sqoop?
By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done
again and again till we get proper data in staging.
How can we load to a column in a relational table which is not null but the incoming value from HDFS has a
null value?
By using the –input-null-string parameter we can specify a default value and that will allow the row to be inserted
into the target table.
Sqoop imported a table successfully to hbase but it is found that the number of rows is fewer than expected.
What can be the cause?
Some of the imported records might have null values in all the columns. As Hbase does not allow all null values in a
row, those rows get dropped.
Give a sqoop command to show all the databases in a mysql server.
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
In a sqoop import command you have mentioned to run 8 parallel Mapreduce task but sqoop runs only 4.
What can be the reason?
The Mapreduce cluster is configured to run 4 parallel tasks. So the sqoop command must have number of parallel
tasks less or equal to that of the mapreduce cluster.
What is the importance of --split-by clause in running parallel import tasks in sqoop?
The –split-by clause mentions the column name based on whose value the data will be divided into groups of
records. These group of records will be read in parallel by the mapreduce tasks.
What does this sqoop command achieve?
$ sqoop import --connnect <connect-str> --table foo --target-dir /dest \
What happens when a table is imported into a HDFS directory which already exists using the –append
parameter?
Using the --append argument, Sqoop will import data to a temporary directory and then rename the files into the
normal target directory in a manner that does not conflict with existing filenames in that directory.
How can you control the mapping between SQL data types and Java types?
By using the --map-column-java property we can configure the mapping between.
Below is an example
$ sqoop import ... --map-column-java id = String, value = Integer
How to import only the updated rows form a table into HDFS using sqoop assuming the source has last
update timestamp details for each row?
By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp
specified with --last-value are imported.
What are the two file formats supported by sqoop for import?
Delimited text and Sequence Files.
What is a sqoop metastore?
It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and
execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
What is the purpose of sqoop-merge?
The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset
preserving only the newest version of the records between both the data sets.
Give the sqoop command to see the content of the job named myjob?
Sqoop job –show myjob
Which database the sqoop metastore runs on?
Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
Where can the metastore database be hosted?
The metastore database can be hosted anywhere within or outside of the Hadoop cluster.
Sqoop Use Case-
Online Marketer Coupons.com uses Sqoop component of the Hadoop ecosystem to enable transmission of data
between Hadoop and the IBM Netezza data warehouse and pipes backs the results into Hadoop using Sqoop.
FLUME:
What problem does Apache Flume solve?
Scenario:
There are several services producing number of logs that run in different servers. These logs need to be
accumulated, stored and analyzed together.
Hadoop has emerged as a cost effective and scalable framework for storage and analysis for big data.
Problem:
How can these logs be collected, aggregated and stored to a place where Hadoop can process them?
Now there is a requirement for a reliable, scalable, extensible and manageable solution.
What is Apache Flume?
Apache Flume is a distributed, reliable and available system for efficiently collecting, aggregating and moving
large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume
can be used to transport massive quantities of event data including but not limited to network traffic data, social-
media-generated data, email messages and pretty much any data source possible.
Flume features:
Ensures guaranteed data delivery
Gather high volume data streams in real time
Streaming data is coming from multiple sources into Hadoop for analysis
Scales horizontally
How is Flume-NG different from Flume 0.9?
Flume 0.9:
Centralized configuration of the agents handled by Zookeeper.
Input data and writing data are handled by same thread.
Flume 1.X (Flume-NG):
No centralized configuration. Instead a simple on-disk configuration file is used.
Different threads called runners handle input data and writing data.
What is the problem with HDFS and streaming data (like logs)?
In a regular filesystem when you open a file and write data, it exists on disk even before it is closed.
Whereas in HDFS, the file exists only as a directory entry of zero length till it is closed. This implies that if data is
written to a file for an extended period without closing it, you may be left with an empty file if there is a network
disconnect with the client.
It is not a good approach to close the files frequently and create smaller files as this leads to poor efficiency in
HDFS.
What are core components of Flume?
Flume architecture:
Event- The single log entry or unit of data that is transported.
Client- The component that transmits event to the source that operates with the agent.
Flume Agent:
An agent is a daemon (physical Java virtual machine) running Flume.
It receives and stores the data until it is written to a next destination.
Flume source, channel and sink run in an agent.
Source: events
A source receives data from some application that is producing data.
A source writes events to one or more channels.
Sources either poll for data or wait for data to be delivered to them.
For Example: log4j, Avro, syslog, etc.
Sink:
A sink removes the events from the agent and delivering it to the destination.
The destination could be different agent or HDFS, hbase, Solr etc.
For Example: Console, HDFS, hbase, etc.
Channel:
A channel holds events passing from a source to a sink.
A source ingests events into the channel while sink removes them.
A sink gets events from one channel only.
For Example: Memory, File, JDBC etc.
Explain a common use case for Flume?
Common Use case: Receiving web logs from several sources into HDFS.
Web server logs → Apache Flume → HDFS (Storage) → Pig/Hive (ETL) → hbase (Database) → Reporting (BI
Tools)
Logs are generated by several log servers and saved in local hard disks, which need to be pushed into HDFS using
Flume framework.
Flume agents, which are running on, log servers collect the logs, which are pushed into HDFS.
Data analytics tools like Pig or Hive then process this data.
The analysed data is stored in structured format in hbase or other database.
Business intelligence tools will then generate reports on this data.
What are Flume events?
Flume events:
Basic payload of data transported by Flume (typically a single log entry)
It has zero or more headers and a body
Event Headers are key-value pairs that are used to make routing decisions or carry other structured information like:
Timestamp of the event
Hostname of the server where event has originated
Event Body
Event Body is an array of bytes that contains the actual payload.
Can we change the body of the flume event?
Yes, editing Flume Event using interceptors can change its body.
What are interceptors?
Interceptor
An interceptor is a point in your data flow where you can inspect and alter flume events. After the source creates an
event, there can be zero or more interceptors tied together before it is delivered to sink.
What are channel selectors?
Channel selectors:
Channel Selectors are used to handle multiple channels. It responsible for how an event moves from a source to one
or more channels.
Types of channel selectors are:
Replicating Channel Selector: This is the default channel selector that puts a copy of event into each channel
Multiplexing Channel Selector: routes different events to different channels depending on header information
and/or interceptor logic
What are sink processors?
Sink processor is a mechanism for failover and load balancing events across multiple sinks from a channel
How to Configure an Agent?
An agent is configured using a simple Java property file of key/value pairs
This configuration file is passed as an argument to the agent upon startup.
You can configure multiple agents in a single configuration file. It is required to pass an agent identifier
(called a name).
Each agent is configured starting with:
Agent.sources=<list of sources>
Agent.channels=<list of channels>
Agent.sinks=<list of sinks>
Each source, channel and sink also has a distinct name within the context of that agent.
Explain the Hello world example in flume.
In the following example, the source listens on a socket for network clients to connect and sends event data. Those
events were written to an in-memory channel and then fed to a log4j sink to become output.
Configuration file for one agent (called a1) that has a source named s1, a channel named c1 and a sink named k1.
# Name of the components on this agent
A1.sources=s1
A1.channels=c1
A1.sinks=k1
# Configure the source
A1.sources.s1.type=Netcat
Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
Explain about the different channel types in Flume. Which channel type is faster?
The 3-different built in channel types available in Flume are-
MEMORY Channel – Events are read from the source into memory and passed to the sink.
JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source.
The file is deleted only after the contents are successfully delivered to the sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you
choose completely depends on the nature of the big data application and the value of each event.
Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.
How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources
and transfer it to external destinations.
Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain.
Data from Flume can be extracted, transformed and loaded in real-time to Apache Solr servers using
morphlinesolrsink.
How can Flume be used with hbase?
Apache Flume can be used with hbase using one of the two hbase sinks –
Hbasesink (org.apache.flume.sink.hbase.hbasesink) supports secure hbase clusters and also the novel hbase IPC that
was introduced in the version hbase 0.96.
Asynchbasesink (org.apache.flume.sink.hbase.asynchbasesink) has better performance than hbase sink as it can
easily make non-blocking calls to hbase.
Working of the hbasesink –
In hbasesink, a Flume Event is converted into hbase Increments or Puts. Serializer implements the
hbaseeventserializer which is then instantiated when the sink starts. For every event, sink calls the initialize method
in the serializer which then translates the Flume Event into hbase increments and puts to be sent to hbase cluster.
Working of the asynchbasesink-
Asynchbasesink implements the asynchbaseeventserializer. The initialize method is called only once by the sink
when it starts. Sink invokes the setevent method and then makes calls to the getincrements and getactions methods
just similar to hbase sink. When the sink stops, the cleanup method is called by the serializer.
Differentiate between filesink and filerollsink
The major difference between HDFS filesink and filerollsink is that HDFS File Sink writes the events into the
Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.
Flume Use Case –
Twitter source connects through the streaming API and continuously downloads the tweets (called as events). These
tweets are converted into JSON format and sent to the downstream Flume sinks for further analysis of tweets and
retweets to engage users on Twitter.
Nosql Databases
A nosql database is a widely adapted technology due to the schema less design, and its ability to scale up vertically
and horizontally is simple and in much less effort. RDBMS system performance degrades, cost increases, and
manageability decreases, we can say that nosql provides an edge over RDBMS in these scenarios.
Nosql usually provides either consistency or availability (availability of nodes for processing), depending upon the
architecture and design.
HBASE
What is the difference between HBase and Hive?
Hbase Hive
Hbase does not allow execution of SQL queries. Hive allows execution of most SQL queries.
Schema less Fixed schema
Hbase runs on top of HDFS. Hive runs on top of Hadoop mapreduce.
Hbase is a nosql column database. Hive is a datawarehouse framework.
Hbase is ideal for real time querying of big data Hive is an ideal choice for analytical querying of data
collected over period of time.
Hbase supports 4 primary operations-put, get, scan Hive helps SQL savvy people to run mapreduce job
and delete.
Compare HDFS & hbase
Criteria HDFS Hbase
Data write process Append method Bulk incremental, random write
Data read process Table scan Table scan/random read/small range scan
Hive SQL querying Excellent Average
lookup records in a file HDFS doesn’t provide Hbase provides for large table
Latency High latency operations Low latency operations
Only for storage areas Storage and process both can be perform
Write once Read many times Random reads and writes
Primarly accessed through MR Accessed through shell commands, client
(Map Reduce) jobs API in java, REST, Avro or Thrift
Mention the differences between hbase and Relational Databases or RDBMS?
Hbase RDBMS
It is schema-less It is a schema based database
It is a column-oriented data store It is a row-oriented data store
It is used to store de-normalized data It is used to store normalized data
It contains sparsely populated tables It contains thin tables
Automated partitioning is done in Hbase There is no such provision or built-in support for partitioning
Well suited for OLAP systems Well suited for OLTP systems
Read only relevant data from database Retrieve one row at a time and hence could read unnecessary
data if only some of the data in a row is required
Structured and semi-structure data can be stored Structured data can be stored and processed using RDBMS
and processed using HBase
Enables aggregation over many rows and Aggregation is an expensive operation
columns
Explain what is Hbase?
Hbase is a column-oriented database management system which runs on top of HDFS. Hbase is not a relational data
store, and it does not support structured query language like SQL. In Hbase, a master node regulates the cluster and
region servers to store portions of the tables and operates the work on the data.
Explain why to use Hbase?
High capacity storage system
Distributed design to cater large tables
Column-Oriented Stores
Horizontally Scalable
High performance & Availability
Base goal of Hbase is millions of columns, thousands of versions and billions of rows
Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
What are key terms are used for designing of hbase datamodel?
Table: Hbase table consists of rows
Row:Row in hbase which contains row key and one or more columns(column families with value associated)
column family: having set of columns and their values,the column families should be considered carefully during
schema design
column: A column in hbase consists of a column family and a column qualifier, which are delimited by a : (colon)
character
column qualifier: A column qualifier is added to a column family to provide the index for a given piece of data
cell: A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp,
which represents the value’s version
Namespace: Logical grouping of tables
timestamp: represents the time on the regionserver when the data was written, but you can specify a different
timestamp value when you put data into the cell
Mention what are the key components of Hbase?
Zookeeper: It does the co-ordination work between client and Hbase Master
HMaster: Hbase Master monitors the Region Server
HRegionserver: regionserver monitors the Region
HRegion: It contains in memory data store(memstore) and Hfile.
Catalog Tables: Catalog tables consist of ROOT and META. ROOT table tracks where the META table is and
META table stores all the regions in the system.
HMaster:
HMaster is the implementation of Master server in HBase architecture. It acts like monitoring agent to monitor all
Region Server instances present in the cluster and acts as an interface for all the metadata changes. In a distributed
cluster environment, Master runs on NameNode. Master runs several background threads.
The following are important roles performed by HMaster in HBase.
Plays a vital role in terms of performance and maintaining nodes in the cluster.
HMaster provides admin performance and distributes services to different region servers.
HMaster assigns regions to region servers.
HMaster has the features like controlling load balancing and failover to handle the load over nodes present
in the cluster.
When a client wants to change any schema and to change any Metadata operations, HMaster takes
responsibility for these operations.
Some of the methods exposed by HMaster Interface are primarily Metadata oriented methods.
Table (createTable, removeTable, enable, disable)
ColumnFamily (add Column, modify Column)
Region (move, assign)
The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read and write operations,
it directly contacts with HRegion servers. HMaster assigns regions to region servers and in turn check the health
status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region servers which are going to store all the
log files.
HRegions Servers:
When Region Server receives writes and read requests from the client, it assigns the request to a specific region,
where actual column family resides. However, the client can directly contact with HRegion servers, there is no need
of HMaster mandatory permission to the client regarding communication with HRegion servers. The client requires
HMaster help when operations related to metadata and schema changes are required.
HRegionServer is the Region Server implementation. It is responsible for serving and managing regions or data that
is present in distributed cluster. The region servers run on Data Nodes present in the Hadoop cluster.
HMaster can get into contact with multiple HRegion servers and performs the following functions.
Hosting and managing regions
Splitting regions automatically
Handling read and writes requests
Communicating with the client directly
HRegions:
HRegions are the basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. It contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and Hfile.
Data flow in Hbase
Write and Read operations
The Read and Write operations from Client into Hfile can be shown in below diagram.
Step 1) Client wants to write data and in turn first communicates with Regions server and then regions
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that it flushes into HFile. The main reason
for using Memstore is to store data in Distributed file system based on Row Key. Memstore will be placed in Region
server main memory while HFiles are written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase Regions is as shown from
top to bottom in below table.
Table: HBase table present in the HBase cluster
Region: HRegions for the presented tables
Store: It store per ColumnFamily for each region for the table
Memstore: Memstore for each store for each region for the table
It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile: StoreFiles for each store for each region for the table
Block: Blocks present inside StoreFiles
ZooKeeper:
In Hbase, Zookeeper is a centralized monitoring server which maintains configuration information and provides
distributed synchronization. Distributed synchronization is to access the distributed applications running across the
cluster with the responsibility of providing coordination services between nodes. If the client wants to communicate
with regions, the servers client has to approach ZooKeeper first.
It is an open source project, and it provides so many important services.
Services provided by ZooKeeper
Maintains Configuration information
Provides distributed synchronization
Client Communication establishment with region servers
Provides ephemeral nodes for which represent different region servers
Master servers usability of ephemeral nodes for discovering available servers in the cluster
To track server failure and network partitions
Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The client needs access to
ZK(zookeeper) quorum configuration to connect with master and region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error messages, and it starts to repair
the failed nodes.
When you should use Hbase?
Data size is huge: When you have tons and millions of records to operate
Complete Redesign: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere
just changing the ports. In RDBMS should runs on single database server but in hbase is distributed and scalable and
also run on commodity hardware.
SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
Infrastructure Investment: You need to have enough cluster for Hbase to be really useful
What are the different operational commands in hbase at record level and table level?
Record Level Operational Commands in hbase are –put, get, increment, scan and delete.
Table Level Operational Commands in hbase are-describe, list, drop, disable and scan.
What is column families? Explain what happens if you alter the block size of a column family on an already
occupied database?
The logical deviation of data is represented through a key known as column Family. Column families consist of the
basic unit of physical storage on which compression features can be applied. When you alter the block size of the
column family, the new data occupies the new block size while the old data remains within the old block size.
During data compaction, old data will take the new block size. New files as they are flushed, have a new block size
whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after
the next major compaction.
Explain how does Hbase actually delete a row?
In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction.
During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In
normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during
compaction.
Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp,
further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until
after the major compaction.
Explain what is the row key?
Every row in an hbase table has a unique identifier known as rowkey(primary key). Row key is defined by the
application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort
order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the
same server. Rowkey is internally regarded as a byte array.
Explain row deletion in Hbase? Mention what are the three types of tombstone markers in Hbase for
deletion?
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted
cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
Version delete marker: For deletion, it marks a single version of a column
Column delete marker: For deletion, it marks all the versions of a column
Family delete marker: For deletion, it marks of all column for a column family
Explain about hlog and WAL in hbase.
All edits in the hstore are stored in the hlog. Every region server has one hlog. Hlog contains entries for edits of all
regions performed by a particular region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the hlog
edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
What are the main features of Apache hbase?
Apache hbase has many features which supports both linear and modular scaling, hbase tables are distributed on the
cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).
hbase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom
Filters).
What are datamodel operations in hbase or Mention how many operational commands in Hbase?
Get(returns attributes for a specified row,Gets are executed via htable.get)
put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).
Puts are executed via htable.put (writebuffer) or htable.batch (non-writebuffer))
scan(Scan allow iteration over multiple rows for specified attributes)
Delete(Delete removes a row from a table. Deletes are executed via htable.delete)
Increment
hbase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These
tombstones, along with the dead values, are cleaned up on major compaction.
How should filters are useful in Apache hbase?
Filters in Hbase Shell, Filter Language was introduced in apache hbase 0.92. It allows you to perform server-side
filtering when accessing hbase over Thrift or in the hbase shell.
How many filters are available in Apache hbase?
Total we have 18 filters are support to hbase. They are:
Columnprefixfilter, timestampsfilter, pagefilter, multiplecolumnprefixfilter, familyfilter, columnpaginationfilter,
singlecolumnvaluefilter, rowfilter, qualifierfilter, columnrangefilter, valuefilter, prefixfilter, columncountgetfilter,
singlecolumnvalueexcludefilter, inclusivestopfilter, dependentcolumnfilter, firstkeyonlyfilter, keyonlyfilter
How can we use mapreduce with hbase?
Apache mapreduce is a software framework used to analyze large amounts of data, and is the framework used most
often with Apache Hadoop. Hbase can be used as a data source, tableinputformat, and data sink, tableoutputformat
or multitableoutputformat, for mapreduce jobs. Writing mapreduce jobs that read or write hbase, it is advisable to
subclass tablemapper and/or tablereducer.
How do we back up my hbase cluster?
There are two broad strategies for performing hbase backups: backing up with a full cluster shutdown, and backing
up on a live cluster. Each approach has pros and cons.
1)Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their hbase cluster, for example if it is being used a
back-end analytic capacity and not serving front-end web-pages. The benefits are that the namenode/Master are
regionservers are down, so there is no chance of missing any in-flight changes to either storefiles or metadata. The
obvious con is that the cluster is down.
2)Live Cluster Backup
live clusterbackup-copytable:copy table utility could either be used to copy data from one table to another on the
same cluster, or to copy data to another table on another cluster.
Live cluster backup-export:export approach dumps the content of a table to HDFS on the same cluster.
Does hbase support SQL?
Not really. SQL-ish support for hbase via Hive is in development, however Hive is based on mapreduce which is not
generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from hbase by using sql
queries.
What is bloommapfile used for?
The bloommapfile is a class that extends mapfile. So its functionality is similar to mapfile. Bloommapfile uses
dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
How can I troubleshoot my HBase cluster?
Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over
again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions
you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be
hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of
problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print
some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and
max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads )
aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase
point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly
unable to access files on your local file system, well it’s the same with HBase and HDFS. Another very common
reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last
longer than the default ZooKeeper session timeout. For more information on GC pauses, see the 3 part blog post by
Todd Lipcon and Long GC pauses above.
It’s easier to understand the data model as a multidimensional map. The first row from the table in Figure 1 has been
represented as a multidimensional map in figure 2
CASSANDRA
Explain what is Cassandra?
Cassandra is an open source data storage system developed at Facebook for inbox search and designed for storing
and managing large amounts of data across commodity servers without any failure. It can server as both
Real time data store system for online applications
Also as a read intensive database for business intelligence system
Cassandra is perfect for managing large amounts of structured, semi-structured, and unstructured data across
multiple data centers and the cloud. Cassandra delivers continuous availability, linear scalability, and operational
simplicity across many commodity servers with no single point of failure, along with a powerful dynamic data
model designed for maximum flexibility and fast response times.
Why Cassandra? Why not any other no SQL like Hbase?
Apache Cassandra is an open source, free to use, distributed, decentralized, elastically and linearly scalable, highly
available, fault-tolerant, tune-ably consistent, column-oriented database that bases its distribution design on
Amazon’s Dynamo and its data model on Google’s Bigtable.
Our use case was more of write intensive. Since Cassandra provide Consistency and Availability, which was
requirement of our use case we preferred Cassandra. Hbase is good for Low latency read write kind of use cases.
Explain Cassandra Data Model.
– The Cassandra data model has 4 main concepts which are cluster, keyspace, column, column family.
– Clusters contain many nodes (machines) and can contain multiple keyspaces.
– A keyspace is a namespace to group multiple column families, typically one per application.
– A column contains a name, value and timestamp.
– A column family contains multiple columns referenced by a row keys.
What platforms does Cassandra run on?
Cassandra is a Java Application, meaning that a compiled binary distribution of Cassandra can run on any platform
that has a Java Runtime Environment (JRE), also referred to as a Java Virtual Machine (JVM). Datastax Strongly
recommends using the Oracle Sun Java Runtime Environment (JRE), version 1.6.0_19 or later, for optimal
performance. Packaged releases are provided for redhat, centos , Debian and Ubuntu Linux Platforms.
What management tools exist for Cassandra?
Datastax supplies both a free and commercial version of opscenter, which is a visual, browser-based management
toll for Cassandra. With opscenter, a user can visually carry out many administrative tasks, monitor a cluster for
performance, and do much more. Downloads of opscenter are available on the datastax Website.
A number of command line tools also ship with Cassandra for querying/writing to the database, performing
administration functions, etc.
Cassandra also exposes a number of statistics and management operations via Java Management Extensions(JMX).
Java Management Extensions (JMX) is a Java technology that supplies tools for managing and monitoring Java
Applications and services. Any statistics or operation that a Java application has exposed as an mbean can then be
monitored or manipulated using JMX.
During normal operation, Cassandra outputs information and statistics that you can monitor using JMX-compliant
tools such as jconsole, the Cassandra nodetool utility, or the datastax opscenter centralized management console.
With the same tools, you can perform certain administrative commands and operation such as flushing caches or
doing a repair.
Explain CAP theorem.
The CAP theorem (also called as Brewer’s theorem after its author, Eric Brewer) states that within a large-scale
distributed data system, there are three requirements that have a relationship of sliding dependency: Consistency,
Availability, and Partition Tolerance.
CAP theorem states that in any given system, you can strongly support only two of these three. Cassandra lies in CA
bucket of CAP Theorem.
What do you understand by Elastic Scalability?
Elastic Scalability means that your cluster can seamlessly scale up and scale back down. That actually means that
adding more servers to cluster would improve and scale performance of cluster in linear fashion without any manual
interventions. Vice versa is equally true.
Cassandra is said to be Tuneable Consistent. Why?
Consistency essentially means that a read always returns the most recently written value. Cassandra allows you to
easily decide the level of consistency you require, in balance with the level of availability. This is controlled by
parameters like replication factor and consistency level.
How Cassandra Achieve High Availability and Fault Tolerance?
Cassandra is highly available. You can easily remove few of Cassandra failed node from cluster without actually
losing any data and without bring whole cluster down. In similar fashion you can also improve performance by
replicating data to multiple data center.
What is cluster in Cassandra?
In Cassandra, the cluster is an outermost container for keyspaces that arranges the nodes in a ring format and assigns
data to them. These nodes have a replica which takes charge in case of data handling failure.
What are the differences between a node, a cluster, and datacenter in Cassandra?
Node: A node is a single machine running Cassandra.
Cluster: A cluster is a collection of nodes that contains similar types of data together.
Datacenter: A datacenter is a useful component when serving customers in different geographical areas. Different
nodes of a cluster can be grouped into different data centers. A data center can be a physical data center or virtual
data center. Replication is set by data center. Depending on the replication factor, data can be written to multiple
data centers. However, data centers should never span physical locations whereas a cluster contains one or more data
centers. It can span physical locations
Explain what is composite type in Cassandra?
In Cassandra, composite type allows to define key or a column name with a concatenation of data of different type.
You can use two types of Composite Type
Row Key
Column Name
How Cassandra stores data?
Cassandra stores all data as bytes. When you specify validator, Cassandra ensures that those bytes are encoded as
per requirement and then a comparator orders the column based on the ordering specific to the encoding. While
composite are just byte arrays with a specific encoding, for each component it stores a two byte length followed by
the byte encoded component followed by a termination bit.
How does Cassandra work?
Cassandra’s built-for-scale architecture means that it is capable of handling petabytes of information and thousands
of concurrent users/operations per second.
Mention what are the main components of Cassandra Data Model?
The main components of Cassandra Data Model are
Cluster
Keyspace
Column
Column & Family
List out the other components of Cassandra?
The other components of Cassandra are
Node
Data Center
Cluster
Commit log
Mem-table
Sstable
Bloom Filter
What are key spaces and column family in Cassandra?
In Cassandra logical division that associates similar data is called as column family. Basic Cassandra data structures:
the column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a column
family, which is a container for rows that have similar, but not identical, column sets. We have a unique identifier
for each row could be called a row key. A keyspace is the outermost container for data in Cassandra, corresponding
closely to a relational database.
Explain what is a keyspace in Cassandra?
In Cassandra, a keyspace is a namespace that determines data replication on nodes. A cluster consist of one
keyspace per node.
What is the syntax to create keyspace in Cassandra?
Syntax for creating keyspace in Cassandra is
CREATE KEYSPACE <identifier> WITH <properties>
What is the difference between Column and supercolumn?
The values in columns are string while the values in supercolumn are Map of Columns with different data types.
Unlike Columns, Super Columns do not contain the third component of timestamp.
What is supercolumn in Cassandra?
In Cassandra, supercolumn is a unique element containing similar collection of data. They are actually key-value
pairs with values as columns.
Why are super columns in Cassandra no longer favoured?
Super columns suffer from a number of problems, not least of which is that it is necessary for Cassandra to
deserialize all of the sub-columns of a super column when querying (even if the result will only return a small
subset). As a result, there is a practical limit to the number of sub-columns per super column that can be stored
before performance suffers.
In theory, this could be fixed within Cassandra by properly indexing sub-columns, but consensus is that composite
columns are a better solution, and they work without the added complexity.
Mention what are the values stored in the Cassandra Column?
In Cassandra Column, basically there are three values
Column Name
Value
Time Stamp
Mention when you can use Alter keyspace?
ALTER KEYSPACE can be used to change properties such as the number of replicas and the durable_write of a
keyspace.
Mention what does the shell commands Capture and Consistency determines?
There are various Cqlsh shell commands in Cassandra. Command Capture, captures the output of a command and
adds it to a file while, command Consistency display the current consistency level or set a new consistency level.
What is mandatory while creating a table in Cassandra?
While creating a table primary key is mandatory, it is made up of one or more columns of a table.
Mention what needs to be taken care while adding a Column?
While adding a column you need to take care that the
Column name is not conflicting with the existing column names
Table is not defined with compact storage option
Explain how Cassandra writes.
Cassandra writes first to a commit log on disk for durability then commits to an in-memory structure called a
memtable. A write is successful once both commits are complete. Writes are batched in memory and written to disk
in a table structure called an sstable (sorted string table). Memtables and sstables are created per column family.
With this design Cassandra has minimal disk I/O and offers high speed write performance because the commit log is
append-only and Cassandra doesn’t seek on writes. In the event of a fault when writing to the sstable Cassandra can
simply replay the commit log.
How does Cassandra perform write function?
Cassandra performs the write function by applying two commits:
o First commit is applied on disk and then second commit to an in-memory structure known as memtable.
o When the both commits are applied successfully, the write is achieved.
o Writes are written in the table structure as sstable (sorted string table).
Explain how Cassandra writes data?
Cassandra writes data in three components
Commitlog write
Memtable write
Sstable write
Cassandra first writes data to a commit log and then to an in-memory table structure memtable and at last in sstable
What is a commit log?
It is a crash-recovery mechanism. All data is written first to the commit log (file) for durability. After all its data has
been flushed to sstables, it can be archived, deleted, or recycled.
Explain what is Memtable in Cassandra?
Cassandra writes the data to a in memory structure known as Memtable
It is an in-memory cache with content stored as key/column format
By key, Memtable data are sorted
There is a separate Memtable for each columnfamily, and it retrieves column data from the key
It stores the writes until it is full, and then flushed out.
What is sstable? Is it similar to RDBMS table?
A sorted string table (sstable) is an immutable data file to which Cassandra writes memtables periodically. Sstables
are append only and stored on disk sequentially and maintained for each Cassandra table.
Whereas RDBMS Table collection of ordered columns fetched by row.
Explain what is sstable consist of?
Sstable consist of mainly 2 files
Index file ( Bloom filter & Key offset pairs)
Data file (Actual column data)
Explain what is Bloom Filter is used for in Cassandra?
A bloom filter is a space efficient data structure that is used to test whether an element is a member of a set. Bloom
filters are accessed after every query.In other words, it is used to determine whether an sstable has data for a
particular row. In Cassandra it is used to save IO when performing a KEY LOOKUP.
Explain how Cassandra writes changed data into commitlog?
Cassandra concatenate changed data to commitlog
Commitlog acts as a crash recovery log for data
Until the changed data is concatenated to commitlog write operation will be never considered successful
Data will not be lost once commitlog is flushed out to file.
Explain how Cassandra delete Data?
Sstables are immutable and cannot remove a row from sstables. When a row needs to be deleted, Cassandra assigns
the column value with a special value called Tombstone. When the data is read, the Tombstone value is considered
as deleted.
What is Gossip protocol?
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about
themselves and about other nodes they know about. The gossip process runs every second and exchanges state
messages with up to three other nodes in the cluster.
What is Order Preserving partitioner?
This is a kind of Partitioner that stores rows by key order, aligning the physical structure of the data with your sort
order. Configuring your column family to use order-preserving partitioning allows you to perform range slices,
meaning that Cassandra knows which nodes have which keys. This partitioner is somewhat the opposite of the
Random Partitioner; it has the advantage of allowing for efficient range queries, but the disadvantage of unevenly
distributing keys.
The order-preserving partitioner (OPP) is implemented by the org.apache.cassandra
.dht.orderpreservingpartitionerclass. There is a special kind of OPP called the collating order-preserving partitioner
(COPP). This acts like a regular OPP, but sorts the data in a collated manner according to English/US lexicography
instead of byte ordering. For this reason, it is useful for locale-aware applications. The COPP is implemented by the
org.apache.cassandra .dht.collatingorderpreservingparti tioner class. This is implemented in Cassandra by
org.apache.cassandra.dht.orderpreservingpartitioner.
What is materialized view? Why is it normal practice in Cassandra to have it?
Materialized means storing a full copy of the original data so that everything you need to answer a query is right
there, without forcing you to look up the original data. This is because you don’t have a SQL WHERE clause, you
can recreate this effect by writing your data to a second column family that is created specifically to represent that
query.
Why Time stamp is so important while inserting data in Cassandra?
This is important because Cassandra use timestamps to determine the most recent write value.
What are advantages and disadvantages of secondary indexes in Cassandra?
Querying becomes more flexible when you add secondary indexes to table columns. You can add indexed columns
to the WHERE clause of a SELECT.
When to use secondary indexes: You want to query on a column that isn’t the primary key and isn’t part of a
composite key. The column you want to be querying on has few unique values (what I mean by this is, say you have
a column Town, that is a good choice for secondary indexing because lots of people will be form the same town,
date of birth however will not be such a good choice).
When to avoid secondary indexes: Try not using secondary indexes on columns contain a high count of unique
values and that will produce few results. Remember it makes writing to DB much slower, you can find value only by
exact index and you need to make requests to all servers in cluster to find value by index.
What is the CQL Language?
Cassandra 0.8 is the first release to introduce Cassandra Query Language(CQL), the first standardized query
language for Apache Cassandra. CQL pushes all of the implementation details to the server in the form of a CQL
parser. Clients built on CQL only need to know how to interpret query result objects. CQL is the start of the first
officially supported client API for Apache Cassandra. CQL drivers for the various languages are hosted with the
Apache Cassandra project.
CQL Syntax is based on SQL (Structured Query Language), the standard for relational database manipulation.
Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no
support for JOINS, for example.
Mention what is Cassandra- CQL collections?
Cassandra CQL collections help you to store multiple values in a single variable. In Cassandra, you can use CQL
collections in following ways
List: It is used when the order of the data needs to be maintained, and a value is to be stored multiple times (holds
the list of unique elements)
SET: It is used for group of elements to store and returned in sorted orders (holds repeating elements)
MAP: It is a data type used to store a key-value pair of elements
What is Cassandra-Cqlsh?
Cassandra-Cqlsh is a query language, used to communicate with its database. Cassandra cqlsh facilitates you to do
the following things:
o Define a schema
o Insert a data and
o Execute a query
It’s a Python-based command-line client for cassandra.
What is the use of SOURCE command?
SOURCE command is used to execute a file that contains CQL statements.
How do you query Cassandra?
We query Cassandra using cql (Cassandra query language). We use cqlsh for interacting with DB.
What is the use of HELP command?
It is used to display a synopsis and a brief description of all cqlsh commands.
What is the use of capture command?
Capture command is used to captures the output of a command and adds it to a file.
Does Cassandra works on Windows?
Yes Cassandra works pretty well on windows. Right now we have linux and windows compatible versions available.
Why renormalization is preferred in Cassandra?
This is because Cassandra does not support joins. User can join data at its own end.
What is the use of consistency command?
Consistency command is used to copy data to and from Cassandra to a file.
Does Cassandra Support Transactions?
Yes and No, depending on what you mean by ‘transactions’. Unlike relational databases, Cassandra does not offer
fully ACID-compliant transactions. There are no locking or transactional dependencies when concurrently updating
multiple rows or column families. But if by ‘transactions’ you mean real-time data entry and retrieval, with
durability and tunable consistency, then yes.
Cassandra does not support transactions in the sense of bundling multiple row updates into one all-or-nothing
operation. Nor Does it roll back when a write succeeds on one replica, but fails on other replicas. It is possible in
Cassandra to have a write operation report a failure to the client, but still actually persist the write to a replica.
However, this does not mean that Cassandra cannot be used as an operational or real time data store. Data is very
safe in Cassandra because writes in Cassandra are durable. All writes to a replica node are recorded both in memory
and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the
memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.
What is Compaction in Cassandra?
The compaction process merges keys, combines columns, evicts tombstones, consolidates sstables, and creates a
new index in the merged sstable.
What is Anti-Entropy?
Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring
that data on different nodes is updated to the newest version.
What are consistency levels for read operations?
ONE: This is used to returns the value on the first node that responds and Performs a read repair in the background.
QUORUM: It queries all nodes and returns the record with the most recent timestamp after quorums of nodes have
responded where a quorum is (n /2) + 1.
DCQUORUM: It ensures that only nodes in the same data center are queried. It is applicable when using Rack-
Aware placement strategy.
ALL: Queries all nodes and returns the value with the most recent timestamp. This level waits for all nodes to
respond, and if one doesn’t, it fails the read operation.
What do you understand by Consistency in Cassandra?
Consistency means to synchronize and how up-to-date a row of Cassandra data is on all of its replicas.
Explain Zero Consistency?
In this write operations will be handled in the background, asynchronously. It is the fastest way to write data, and the
one that is used to offer the least confidence that operations will succeed.
Explain Any Consistency?
Ii assures that our write operation was successful on at least one node, even if the acknowledgment is only for a hint.
It is a relatively weak level of consistency.
Explain ONE consistency?
It is used to ensure that the write operation was written to at least one node, including its commit log and memtable.
Explain QUORUM consistency?
A quorum is a number of nodes that is used to represent the consensus on an operation. It is determined by
<replicationfactor> / 2 + 1.
Explain ALL consistency?
Every node as specified in your <replicationfactor> configuration entry must successfully acknowledge the write
operation. If any nodes do not acknowledge the write operation, the write fails. This has the highest level of
consistency and the lowest level of performance.
What do you mean by hint handoff?
It is mechanism to ensure availability, fault tolerance, and graceful degradation. If a write operation occurs and a
node that is intended to receive that write goes down, a note (the hint) is given (handed off) to a different live node
to indicate that it should replay the write operation to the unavailable node when it comes back online. This does two
things: it reduces the amount of time that it takes for a node to get all the data it missed once it comes back online,
and it improves write performance in lower consistency levels.
What is Merkle Tree? Where is it used in Cassandra?
Merkle tree is a binary tree data structure that summarizes in short form the data in a larger dataset. Merkle trees are
used in Cassandra to ensure that the peer-to-peer network of nodes receives data blocks unaltered and unharmed.
What do you mean by multiget?
It means a query by column name for a set of keys.
What is a SEED node in Cassandra?
A seed is a node that already exists in a Cassandra cluster and is used by newly added nodes to get up and running.
The newly added node can start gossiping with the seed node to get state information and learn the topology of the
node ring. There may be many seeds in a cluster.
What is Slice and Range slice in Cassandra?
This is a type of read query. Use get_slice() to query by a single column name or a range of column names. Use
get_range_slice() to return a subset of columns for a range of keys.
What is Multiget Slice?
It means query to get a subset of columns for a set of keys.
What is Tombstone in Cassandra world?
Cassandra does not immediately delete data following a delete operation. Instead, it marks the data with a
tombstone, an indicator that the column has been deleted but not removed entirely yet. The tombstone can then be
propagated to other replicas.
What is Thrift?
Thrift is the name of the RPC client used to communicate with the Cassandra server.
What is Batch Mutates?
Like a batch update in the relational world, the batch_mutate operation allows grouping calls on many keys into a
single call in order to save on the cost of network round trips. If batch_mutate fails in the middle of its list of
mutations, there will be no rollback, so any updates that have already occurred up to this point will remain intact.
What is Random Partitioner?
This is a kind of Partitioner that uses a bigintegertoken with an MD5 hash to determine where to place the keys on
the node ring. This has the advantage of spreading your keys evenly across your cluster, but the disadvantage of
causing inefficient range queries. This is the default partitioner.
What is Read Repair?
This is another mechanism to ensure consistency throughout the node ring. In a read operation, if Cassandra detects
that some nodes have responded with data that is inconsistent with the response of other, newer nodes, it makes a
note to perform a read repair on the old nodes. The read repair means that Cassandra will send a write request to the
nodes with stale data to get them up to date with the newer data returned from the original read operation. It does
this by pulling all the data from the node, performing a merge, and writing the merged data back to the nodes that
were out of sync. The detection of inconsistent data is made by comparing timestamps and checksums.
What is Snitch in Cassandra?
A snitch is Cassandra’s way of mapping a node to a physical location in the network.
It helps determine the location of a node relative to another node in order to assist with discovery and ensure
efficient request routing.
What are the benefits/ advantages of Cassandra?
o Cassandra was designed to handle big data workloads across multiple nodes without any single point of
failure.
o Cassandra delivers near real-time performance simplifying the work of Developers, Administrators, Data
Analysts and Software Engineers.
o It provides extensible scalability and can be easily scaled up and scaled down as per the requirements.
o It is fault tolerant and consistent.
o It is a column-oriented database.
o It has no single point of failure.
o There is no need for separate caching layer.
o It has flexible schema design.
o It has flexible data storage, easy data distribution, and fast writes.
o It supports ACID (Atomicity, Consistency, Isolation, and Durability) properties.
o It has multi-data center and cloud capable.
Libraries Separate tools available Spark Core, SQL, Streaming, mllib, graphx
Special operations can be performed on rdds in Spark using key/value pairs and such rdds are referred to as Pair
rdds. Pair rdds allow users to access each key in parallel. They have a reducebykey () method that collects data
based on each key and a join () method that combines different rdds together, based on the elements having the same
key.
Scala:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to
bring them back to the driver program as an array of objects.
Note: when using custom objects as the key in key-value pair operations, you must be sure that a custom equals()
method is accompanied with a matching hashCode() method.
Difference between coalesce and repartition
coalesce uses existing partitions to minimize the amount of data that's shuffled. repartition creates new partitions
and does a full shuffle. coalesce results in partitions with different amounts of data (sometimes partitions that have
much different sizes) and repartition results in roughly equal sized partitions.
Is coalesce or repartition faster?
coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal
sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition to be
faster overall because Spark is built to work with equal sized partitions.
What do you understand by Lazy Evaluation?
When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations
in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
Because of lazy evaluation the errors are not displayed in transformations
Define a worker node.
A node that can run the Spark application code in a cluster can be called as a worker node. A worker node can have
more than one worker which is configured by setting the SPARK_ WORKER_INSTANCES property in the spark-
env.sh file. Only one worker is started if the SPARK_ WORKER_INSTANCES property is not defined.
What do you understand by Executor Memory in a Spark application?
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is
what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the
–executor-memory flag. Every spark application will have one executor on each worker node. The executor memory
is basically a measure on how much RAM memory of the worker node will the application utilize.
What is a Spark Executor?
When sparkcontext connects to a cluster manager, it acquires an Executor on the cluster nodes. Executors are Spark
processes that run computations and store the data on the worker node. The final tasks by sparkcontext are
transferred to executors.
How RDD persist the data?
There are two methods to persist the data, such as persist() to persist permanently and cache() to persist temporarily
in the memory which is default storage level. Different storage level options there such as MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY and many more.
What are the various levels of persistence in Apache Spark?
Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel
operations. Spark has various persistence levels to store the rdds on disk or in memory or as a combination of both
with different replication levels.
The various storage/persistence levels in Spark are -
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP
How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs to run in a fast and reliable manner.
Spark supports 2 types of shared variables called broadcast variables (like Hadoop distributed cache) and
accumulators (Hadoop counters). The various ways in which data transfers can be minimized when working with
Apache Spark are:
1. Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large rdds.
2. Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations bykey, repartition or any other operations which trigger shuffles.
What are broadcast variables?
Broadcast variables let programmer keep a read-only variable cached on each machine, rather than shipping a copy
of variables with every tasks so data can be processed faster. Broadcast variables stored as Array Buffers, which
sends read-only values to work nodes. Broadcast variables help in storing a lookup table inside the memory which
enhances the retrieval efficiency when compared to an RDD lookup ().
Scala:
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
What are Accumulators in Spark?
Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution
is split up across worker nodes in a cluster.
Spark accumulators are similar to Hadoop counters, to count the number of events and what’s happening during job
you can use accumulators. A numeric accumulator can be created by
calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate values of type Long
or Double, respectively. Tasks running on a cluster can then add to it using the add method. However, they cannot
read its value. Only the driver program can read the accumulator’s value, using its value method. Aggregratebykey()
and combinebykey() uses accumulators.
Scala:
val accum = sc.longAccumulator("My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
accum.value
Spark cluster client mode difference
Client: we assign the task to master node
Cluster: random worker node is picked for doing the task
In yarn-cluster mode, the driver runs in the Application Master (inside a YARN container). In yarn-client mode, it
runs in the client.
A common deployment strategy is to submit your application from a gateway machine that is physically co-located
with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate.
In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.
The input and output of the application is attached to the console. Thus, this mode is especially suitable for
applications that involve the REPL (e.g. Spark shell).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your
laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Currently, standalone mode does not support cluster mode for Python applications.
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.
What is Hive on Spark?
Hive is a component of Hortonworks’ Data Platform (HDP). Hive provides an SQL-like interface to data stored in
the HDP. Spark users will automatically get the complete set of Hive’s rich features, including any new features that
Hive might introduce in the future.
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive
operator plans from the semantic analyzer which is translated to a task plan that Spark can execute. It also includes
query execution, where the generated Spark plan gets actually executed in the Spark cluster.
What is Catalyst framework?
Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically
transform SQL queries by adding new optimizations to build a faster processing system.
Why is blinkdb used?
Blinkdb is a query engine for executing interactive SQL queries on huge volumes of data and renders query results
marked with meaningful error bars. Blinkdb helps users balance ‘query accuracy’ with response time.
How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.
Which one will you choose for a project –Hadoop mapreduce or Apache Spark?
The answer to this question depends on the given project scenario - as it is known that Spark makes use of memory
instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to
produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the
project and budget of the organization.
Explain about the different types of transformations on dstreams?
Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples –
map (), reducebykey (), filter ().
Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch.
Examples –Transformations that depend on sliding windows.
Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
What is a Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on
data rdds. The driver also delivers RDD graphs to the Master, where the standalone cluster manager runs.
How can you remove the elements with a key present in any other RDD?
Use the subtractbykey () function
How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job
statistics. The log output for each job is written to the work directory of the slave nodes.
How can you launch Spark jobs inside Hadoop mapreduce?
Using SIMR (Spark in mapreduce) users can run any spark job inside mapreduce without requiring any admin
rights.
How can you achieve high availability in Apache Spark?
Implementing single node recovery with local file system
Using standby Masters with Apache zookeeper.
Explain about the core components of a distributed Spark application.
Driver- The process that runs the main () method of the program to create rdds and perform transformations and
actions on them.
Executor –The worker processes that run the individual tasks of a Spark job.
Cluster Manager-A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows
Spark to run on top of other external managers like Apache Mesos or YARN.
What are the disadvantages of using Apache Spark over Hadoop mapreduce?
Apache spark does not scale well for compute intensive jobs and consumes large number of system resources.
Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data.
Also, Spark does have its own file management system and hence needs to be integrated with other cloud based data
platforms or apache hadoop.
What makes Apache Spark good at low-latency workloads like graph processing and machine learning?
Apache Spark stores data in-memory for faster model building and training. Machine learning algorithms require
multiple iterations to generate a resulting optimal model and similarly graph algorithms traverse all the nodes and
edges.These low latency workloads that need multiple iterations can lead to increased performance. Less disk access
and controlled network traffic make a huge difference when there is lots of data to be processed.
What is Catchable?
Keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times
faster than Hadoop.
How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3
and other data resources.
Is it mandatory to start Hadoop to run spark application?
No not mandatory, but there is no separate storage in Spark, so it use local file system to store the data. You can load
data from local system and process it, Hadoop or HDFS is not mandatory to run spark application.
Feature RDD DataFrames Datasets
Data RDD is a distributed collection A DataFrame is a It is an extension of
Representation of data elements spread across distributed collection of DataFrame API that
many machines in the cluster. data organized into named provides the functionality
columns. It is conceptually of – type-safe, object-
equal to a table in a oriented programming
relational database. interface of the RDD API
Data Formats It can easily and efficiently It can process structured It also efficiently
process data which is and unstructured data processes structured and
structured as well as efficiently. It organizes the unstructured data. It
unstructured. RDD does not data into named column. represents data in the
infer the schema of the DataFrames allow the form of JVM objects of
ingested data and requires the Spark to manage schema. row or a collection of row
user to specify it. object which is
represented in tabular
forms through encoders.
Data Sources API Data source API allows that an Data source API allows Dataset API of spark also
RDD could come from any Data processing in support data from
data source e.g. text file, a different formats (AVRO, different sources.
database via JDBC etc. and CSV, JSON, and storage
easily handle data with no system HDFS, HIVE
predefined structure. tables, MySQL). It can
read and write from
various data sources
Immutability and RDDs contains the collection After transforming into It overcomes the
Interoperability of records which are DataFrame one cannot limitation of DataFrame
partitioned. The basic unit of regenerate a domain object. to regenerate the RDD
parallelism in an RDD is For example, if you from Dataframe.
called partition. Each partition generate testDF from Datasets allow you to
is one logical division of data testRDD, then you won’t convert your existing
which is immutable and be able to recover the RDD and DataFrames
created through some original RDD of the test into Datasets.
transformation on existing class.
partitions
Compile-time type RDD provides a familiar If you are trying to access It provides compile-time
safety object-oriented programming the column which does not type safety.
style with compile-time type exist in the table in such
safety. case Dataframe APIs does
not support compile-time
error. It detects attribute
error only at runtime.
Optimization No inbuilt optimization engine Optimization takes place It includes the concept of
is available in RDD. using catalyst optimizer. Dataframe Catalyst
Developers optimize each Dataframes use catalyst optimizer for optimizing
RDD on the basis of its tree transformation query plan.
attributes. framework in four phases:
Analyzing a logical plan to
resolve references.
Logical plan optimization.
Physical planning.
Code generation to
compile parts of the query
to Java bytecode.
Serialization Whenever Spark needs to Spark DataFrame Can When it comes to
distribute the data within the serialize the data into off- serializing data, the
cluster or write the data to heap storage (in memory) Dataset API in Spark has
disk, it does so use Java in binary format and then the concept of an encoder
serialization. The overhead of perform many which handles conversion
serializing individual Java and transformations directly on between JVM objects to
Scala objects is expensive and this off heap memory tabular representation. It
requires sending both data and because spark understands stores tabular
structure between nodes. the schema. There is no representation using spark
need to use java internal Tungsten binary
serialization to encode the format. Dataset allows
data. It provides a performing the operation
Tungsten physical on serialized data and
execution backend which improving memory use. It
explicitly manages allows on-demand access
memory and dynamically to individual attribute
generates bytecode for without desterilizing the
expression evaluation. entire object.
Garbage There is overhead for garbage Avoids the garbage There is also no need for
Collection collection that results from collection costs in the garbage collector to
creating and destroying constructing individual destroy object because
individual objects. objects for each row in the serialization
dataset. takes place through
Tungsten. That uses off
heap data serialization.
Efficiency/Memory Efficiency is decreased when Use of off heap memory It allows performing an
use serialization is performed for serialization reduces operation on serialized
individually on a java and the overhead. It generates data and improving
scala object which takes lots of bytecode dynamically so memory use. It allows on-
time. that many operations can demand access to
be performed on that individual attribute
serialized data. No need for without deserializing the
deserialization for small entire object.
operations.
Schema Projection In RDD APIs use schema Auto-discovering the Auto discover the schema
projection is used explicitly. schema from the files and of the files because of
We need to define the schema exposing them as tables using Spark SQL engine.
(manually). through the Hive Meta
store. We did this to
connect standard SQL
clients to our engine. And
explore our dataset without
defining the schema of our
files.
Aggregation RDD API is slower to perform DataFrame API is very In Dataset it is faster to
simple grouping and easy to use. It is faster for perform aggregation
aggregation operations. exploratory analysis, operation on plenty of
creating aggregated data sets.
statistics on large data sets.
What is the bottom layer of abstraction in the Spark Streaming API ?
Dstream.
What is Spark Streaming?
Spark supports stream processing, essentially an extension to the Spark API. This allows stream processing of live
data streams. The data from different sources like Flume and HDFS is streamed and processed to file systems, live
dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever you want to analyze
data with the latency of less than 15 minutes and greater than 2 minutes i.e. Near real time is when you use Spark
streaming. It can be processed using complex algorithms expressed with high-level functions like map, reduce, join
and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you
can apply Spark’s machine learning and graph processing algorithms on data streams.
When did we use Spark Streaming?
Spark Streaming is a real time processing of streaming data API. Spark streaming gather streaming data from
different resources like web server log files, social media data, stock market data or Hadoop ecosystems like Flume,
and Kafka.
How Spark Streaming API works?
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches,
which are then processed by the Spark engine to generate the final stream of results in batches.
Spark Streaming provides a high-level abstraction called discretized stream or dstream, which represents a
continuous stream of data. Dstreams can be created either from input data streams from sources such as Kafka,
Flume, and Kinesis, or by applying high-level operations on other dstreams. Internally, a dstream is represented as a
sequence of rdds. Core engine can generate the final results in the form of streaming batches. The output also in the
form of batches. It can allows streaming data and batch data for processing.
This guide shows you how to start writing Spark Streaming programs with dstreams. You can write Spark Streaming
programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find
tabs throughout this guide that let you choose between code snippets of different languages.
What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library
provides windowed computations where the transformations on rdds are applied over a sliding window of data.
Whenever the window slides, the rdds that fall within the particular window are combined and operated upon to
produce new rdds of the windowed dstream.
What is a dstream?
Discretized Stream is a sequence of Resilient Distributed Datasets that represent a stream of data. Micro batch data
which is processed. Dstreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume.
Dstreams have two operations –
Transformations that produce a new dstream.
Output operations that write data to an external system.
What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and mapreduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly
work on top of micro batching.
Explain about the major libraries that constitute the Spark Ecosystem
Spark mlib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression,
classification, etc.
Spark Streaming – This library is used to process real time streaming data.
Spark graphx – Spark API for graph parallel computations with basic operators like joinvertices, subgraph,
aggregatemessages, etc.
Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.
Explain about the different cluster managers in Apache Spark
The 3 different clusters managers supported in Apache Spark are:
YARN
Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other
applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation
between commands.
Standalone deployments – Well suited for new deployments which only run and are easy to set up.
How can Spark be connected to Apache Mesos?
To connect Spark with Mesos-
Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by
Mesos. (or)
Install Apache Spark in the same location as that of Apache Mesos and configure the property
‘spark.mesos.executor.home’ to point to the location where it is installed.
Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
Is it possible to run Spark and Mesos along with Hadoop?
Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the
machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other
big data frameworks.
When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN
or Mesos clusters without affecting any change to the cluster.
What are the functions of Spark Core?
The sparkcore performs an array of critical functions like memory management, monitoring jobs, fault tolerance, job
scheduling and interaction with storage systems.
It is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic input and
output functionalities. RDD in Spark Core makes it fault tolerance. RDD is a collection of items distributed across
many nodes that can be manipulated in parallel. Spark Core provides many apis for building and manipulating these
collections.
What is Spark mllib?
Mahout is a machine learning library for Hadoop, similarly mllib is a Spark library. Metlib provides different
algorithms, that algorithms scale out on the cluster for data processing. Most of the data scientists use this mllib
library.
What is graphx?
Graphx is a Spark API for manipulating Graphs and collections. It unifies ETL, other analysis, and iterative graph
computation. It’s fastest graph system, provides fault tolerance and ease of use without special skills.
What is the function of mllib?
Mllib is Spark’s machine learning library. It aims at making machine learning easy and scalable with common
learning algorithms and real-life use cases including clustering, regression filtering, and dimensional reduction
among others.spark
What is File System API?
FS API can read data from different storage devices like HDFS, S3 or local filesystem. Spark uses FS API to read
data from different storage engines.
Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to compute average:
Def myavg(x, y):
Return (x+y)/2.0;
Avg = myrdd.reduce(myavg);
What is wrong with it? And How would you correct it?
The average function is not commutative and associative;
I would simply sum it and then divide by count.
Def sum(x, y):
Return x+y;
Total = myrdd.reduce(sum);
Avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus over flow. So, I would rather
divide each number by count and then sum in the following way.
Cnt = myrdd.count();
Def devidebycnd(x):
Return x/cnt;
Myrdd1 = myrdd.map(devidebycnd);
Avg = myrdd.reduce(sum);
Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to compute the
square root of sum of squares of these numbers. How would you do it?
# We would first load the file as RDD from HDFS on spark
Numsastext = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt);
# Define the function to compute the squares
Def tosqint(str):
V = int(str);
Return v*v;
#Run the function on spark rdd as transformation
Nums = numsastext.map(tosqint);
#Run the summation as reduce action
Total = nums.reduce(sum)
#finally compute the square root. For which we need to import math.
Import math;
Print math.sqrt(total);
Is the following approach correct? Is the sqrtofsumofsq a valid reducer?
Numsastext =sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/mynumbersfile.txt);
Def toint(str):
Return int(str);
Nums = numsastext.map(toint);
Def sqrtofsumofsq(x, y):
Return math.sqrt(x*x+y*y);
Total = nums.reduce(sum)
Import math;
Print math.sqrt(total);
A: Yes. The approach is correct and sqrtofsumofsq is a valid reducer.
Could you compare the pros and cons of your approach (in Question 2 above) and my approach (in Question
3 above)?
You are doing the square and square root as part of reduce action while I am squaring in map() and summing in
reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt() and reducer
code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am computing the
sum of squares as part of map.
If you have to compute the total counts of each of the unique words on spark, how would you go about it?
#This will load the bigtextfile.txt as RDD in the spark
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt);
#define a function that can break each line into words
Def towords(line):
Return line.split();
# Run the towords function on each element of RDD on spark as flatmap transformation.
# We are going to flatmap instead of map because our function is returning multiple values.
Words = lines.flatmap(towords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
Def totuple(word):
Return (word, 1);
Wordstuple = words.map(totuple);
# Now we can easily do the reducebykey() action.
Def sum(x, y):
Return x+y;
Counts = wordstuple.reducebykey(sum)
# Now, print
Counts.collect()
In a very huge text file, you want to just check if a particular keyword exists. How would you do this using
Spark?
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/bigtextfile.txt);
Def isfound(line):
If line.find(mykeyword) > -1:
Return 1;
Return 0;
Foundbits = lines.map(isfound);
Sum = foundbits.reduce(sum);
If sum > 0:
Print FOUND;
Else:
Print NOT FOUND;
Can you improve the performance of this code in previous answer?
Yes. The search is not stopping even after the word we are looking for has been found. Our map code would keep
executing on all the nodes which is very inefficient. We could utilize accumulators to report whether the word has
been found or not and then stop the job. Something on these line:
Import thread, threading
From time import sleep
Result = Not Set
Lock = threading.Lock()
Accum = sc.accumulator(0)
Def map_func(line):
#introduce delay to emulate the slowness
Sleep(1);
If line.find(Adventures) > -1:
Accum.add(1);
Return 1;
Return 0;
Def start_job():
Global result
Try:
Sc.setjobgroup(job_to_cancel, some description)
Lines = sc.textfile(hdfs://hadoop1.knowbigdata.com/user/student/sgiri/wordcount/input/big.txt);
Result = lines.map(map_func);
Result.take(1);
Except Exception as e:
Result = Cancelled
Lock.release()
Def stop_job():
While accum.value < 3 :
Sleep(1);
Sc.canceljobgroup(job_to_cancel)
Supress = lock.acquire()
Supress = thread.start_new_thread(start_job, tuple())
Supress = thread.start_new_thread(stop_job, tuple())
Supress = lock.acquire()
Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3
What is YARN?
YARN is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark,
providing a central and resource management platform to deliver scalable operations across the cluster.
Define pagerank.
Pagerank is the measure of each vertex in a graph.
Apache Spark Examples Input
On the top of the Crumpetty Tree
The Quangale Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
In the above input file we have so many characters but here our task is we have to show how the list of characters
which character is appeared for one time and which character is appeared for two times etc..
(1,(d, ., v, H, Q, C, W, m))
(2,(T, p, B, ,, O, i, y, g))
(3,(f, l, r))
(4,(, s, c))
(5,(h))
(6,(n, u))
(7,(o))
(8,(a))
(10,(t))
(13,(e))
(20,( ))
Here is the clear steps for Apache Spark Examples on Character Count
Starting Spark on terminal
start-master.sh
start-slave.sh spark://your hostname:7077
spark-shell
Step 1 :: Load the text file form local or HDFS
val textfile= sc.textfile(inputpath)
Step 2:: Mapper
val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Step 3:: Reducer
counts.reducebykey(_+_).collect()
Step 4:: swapping key and value to value and key
val reversemap=for((k,v)<-counts)yield(v,k)
Step 5:: grouping key values and saving as textfile
reversemap.groupbykey().sortbykey().coalesce(1,true).saveastextfile(outputpath)
Apache Spark Examples Program
Val textfile= sc.textfile(inputpath)
Val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Val counts=textfile.flatmap(line=>line.split().map(char=>(char,1))
Val reversemap=for((k,v)<-counts)yield(v,k)
Reversemap.groupbykey().sortbykey().coalesce(1,true).saveastextfile(outputpath)
Apache Spark Examples on Character Count Output
(1,(d, ., v, H, Q, C, W, m))
(2,(T, p, B, ,, O, i, y, g))
(3,(f, l, r))
(4,(, s, c))
(5,(h))
(6,(n, u))
(7,(o))
(8,(a))
(10,(t))
(13,(e))
(20,( ))
So this is the Apache Spark Examples on Character Count using Scala language.
If you have configured Java version 8 for Hadoop and Java version 7 for Apache Spark, how will you set the
environment variables in the basic configuration file?
KAFKA
What is the main difference between Kafka and Flume?
Criteria Kafka Flume
Data flow Pull Push
Hadoop Integration Loose Tight
Functionality Publish-subscribe model System for data collection,
messaging system aggregation & movement
What is Kafka?
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Message broker
Fast: A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of
clients.
Scalable: Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the
capability of any single machine and to allow clusters of co-ordinated consumers
Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can
handle terabytes of messages without performance impact.
Distributed by Design: Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance
guarantees.
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging
system, but with a unique design.
List the various components in Kafka.
The four major components of Kafka are:
Topic – a stream of messages belonging to the same type
Producer – that can publish messages to a topic
Brokers – a set of servers where the publishes messages are stored
Consumer – that subscribes to various topics and pulls data from the brokers.
Elaborate Kafka architecture.
A cluster contains multiple brokers since it is a distributed system. Topic in the system will get divided into multiple
partitions and each broker store one or more of those partitions so that multiple producers and consumers can
publish and retrieve messages at the same time.
So, at high level, producers send messages over the network to the Kafka cluster which in turn serves them up to
consumers like this
Mention what is the maximum size of the message does Kafka server can receive?
The maximum size of the message that Kafka server can receive is 1000000 bytes.
Explain the role of the offset.
Messages contained in the partitions are assigned a unique ID number that is called the offset. The role of the offset
is to uniquely identify every message within the partition.
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
Where is the offset stored in zk or kafka
Older versions of Kafka (pre 0.9) store offsets in ZK only, while newer version of Kafka, by default store offsets in
an internal Kafka topic called __consumer_offsets (newer version might still commit to ZK though).
The advantage of committing offsets to the broker is, that the consumer does not depend on ZK and thus clients only
need to talk to brokers which simplifies the overall architecture. If you use brokers 0.10.1.0 you could commit
offsets to topic __consumer_offsets.
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The
messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable
period of time. For example if the log retention is set to two days, then for the two days after a message is published
it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively
constant with respect to data size so retaining lots of data is not a problem.
In fact the only metadata retained on a per-consumer basis is the position of the consumer in the log, called the
offset. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads
messages, but in fact the position is controlled by the consumer and it can consume messages in any order it likes.
For example a consumer can reset to an older offset to reprocess.
Kafka has stronger ordering guarantees than a traditional messaging system, too.
A traditional queue retains messages in-order on the server, and if multiple consumers consume from the queue then
the server hands out messages in the order they are stored. However, although the server hands out messages in
order, the messages are delivered asynchronously to consumers, so they may arrive out of order on different
consumers. This effectively means the ordering of the messages is lost in the presence of parallel consumption.
Messaging systems often work around this by having a notion of exclusive consumer that allows only one process to
consume from a queue, but of course this means that there is no parallelism in processing.
Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide
both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the
partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one
consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes
the data in order. Since there are many partitions this still balances the load over many consumer instances. Note
however that there cannot be more consumer instances in a consumer group than partitions.
Kafka only provides a total order over messages within a partition, not between different partitions in a topic. Per-
partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if
you require a total order over messages this can be achieved with a topic that has only one partition, though this will
mean only one consumer process per consumer group.
Guarantees
At a high-level Kafka gives the following guarantees:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That
is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a
lower offset than M2 and appear earlier in the log.
A consumer instance sees messages in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages
committed to the log.
STORM
Why use Storm?
Storm makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did
for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC,
ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is
scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Storm integrates with the queueing and database technologies you already use. A Storm topology consumes streams
of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the
computation however needed.
Components of a Storm cluster
A Storm cluster is superficially similar to a Hadoop cluster. Whereas on Hadoop you run mapreduce jobs, on Storm
you run topologies. Jobs and topologies themselves are very different -- one key difference is that a mapreduce job
eventually finishes, whereas a topology processes messages forever (or until you kill it).
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a
daemon called Nimbus that is similar to Hadoop's jobtracker. Nimbus is responsible for distributing code around the
cluster, assigning tasks to machines, and monitoring for failures.
Each worker node runs a daemon called the Supervisor. The supervisor listens for work assigned to its machine and
starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process
executes a subset of a topology; a running topology consists of many worker processes spread across many
machines.
All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster. Additionally, the
Nimbus daemon and Supervisor daemons are fail-fast and stateless; all state is kept in Zookeeper or on local disk.
This means you can kill -9 Nimbus or the Supervisors and they'll start back up like nothing happened. This design
leads to Storm clusters being incredibly stable.
Topologies
To do realtime computation on Storm, you create what are called topologies. A topology is a graph of computation.
Each node in a topology contains processing logic, and links between nodes indicate how data should be passed
around between nodes.
Running a topology is straightforward. First, you package all your code and dependencies into a single jar. Then,
you run a command like the following:
Storm jar all-my-code.jar backtype.storm.mytopology arg1 arg2
This runs the class backtype.storm.mytopology with the arguments arg1 and arg2. The main function of the class
defines the topology and submits it to Nimbus. The storm jar part takes care of connecting to Nimbus and uploading
the jar.
Since topology definitions are just Thrift structs, and Nimbus is a Thrift service, you can create and submit
topologies using any programming language. The above example is the easiest way to do it from a JVM-based
language. See Running topologies on a production cluster for more information on starting and stopping topologies.
Streams
The core abstraction in Storm is the stream. A stream is an unbounded sequence of tuples. Storm provides the
primitives for transforming a stream into a new stream in a distributed and reliable way. For example, you may
transform a stream of tweets into a stream of trending topics.Streams are defined with a schema that names the fields
in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans,
and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.
Every stream is given an id when declared. Since single-stream spouts and bolts are so
common, outputfieldsdeclarer has convenience methods for declaring a single stream without specifying an id. In
this case, the stream is given the default id of default.
Resources:
Tuple: streams are composed of tuples
Outputfieldsdeclarer: used to declare streams and their schemas
Serialization: Information about Storm's dynamic typing of tuples and declaring custom serializations
Iserialization: custom serializers must implement this interface
CONFIG.TOPOLOGY_SERIALIZATIONS: custom serializers can be registered using this configuration
The basic primitives Storm provides for doing stream transformations are spouts and bolts. Spouts and bolts have
interfaces that you implement to run your application-specific logic.
A spout is a source of streams. For example, a spout may read tuples off of a Kestrel queue and emit them as a
stream. Or a spout may connect to the Twitter API and emit a stream of tweets.
Spouts
A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit
them into the topology (e.g. A Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A
reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout
forgets about the tuple as soon as it is emitted.
Spouts can emit more than one stream. To do so, declare multiple streams using the declarestream method
of outputfieldsdeclarer and specify the stream to emit to when using the emit method onspoutoutputcollector.
The main method on spouts is nexttuple. Nexttuple either emits a new tuple into the topology or simply returns if
there are no new tuples to emit. It is imperative that nexttuple does not block for any spout implementation, because
Storm calls all the spout methods on the same thread.
The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from
the spout either successfully completed through the topology or failed to be completed. Ack andfail are only called
for reliable spouts. See the Javadoc for more information.
Resources:
Irichspout: this is the interface that spouts must implement.
Guaranteeing message processing
A bolt consumes any number of input streams, does some processing, and possibly emits new streams. Complex
stream transformations, like computing a stream of trending topics from a stream of tweets, require multiple steps
and thus multiple bolts. Bolts can do anything from run functions, filter tuples, do streaming aggregations, do
streaming joins, talk to databases, and more.
Bolts
All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins,
talking to databases, and more.
Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and
thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least
two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X
images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
Bolts can emit more than one stream. To do so, declare multiple streams using the declarestream method
of outputfieldsdeclarer and specify the stream to emit to when using the emit method on outputcollector.
When you declare a bolt's input streams, you always subscribe to specific streams of another component. If you
want to subscribe to all the streams of another component, you have to subscribe to each one
individually.inputdeclarer has syntactic sugar for subscribing to streams declared on the default stream id.
Saying declarer.shufflegrouping(1) subscribes to the default stream on component 1 and is equivalent
todeclarer.shufflegrouping(1, DEFAULT_STREAM_ID).
The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using
the outputcollector object. Bolts must call the ack method on the outputcollector for every tuple they process so that
Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples).
For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the
input tuple, Storm provides an ibasicbolt interface which does the acking automatically.
Please note that outputcollector is not thread-safe, and all emits, acks, and fails must happen on the same thread.
Please refer Troubleshooting for more details.
Resources:
Irichbolt: this is general interface for bolts.
Ibasicbolt: this is a convenience interface for defining bolts that do filtering or simple functions.
Outputcollector: bolts emit tuples to their output streams using an instance of this class
Guaranteeing message processing
Networks of spouts and bolts are packaged into a topology which is the top-level abstraction that you submit to
Storm clusters for execution. A topology is a graph of stream transformations where each node is a spout or bolt.
Edges in the graph indicate which bolts are subscribing to which streams. When a spout or bolt emits a tuple to a
stream, it sends the tuple to every bolt that subscribed to that stream.
Links between nodes in your topology indicate how tuples should be passed around. For example, if there is a link
between Spout A and Bolt B, a link from Spout A to Bolt C, and a link from Bolt B to Bolt C, then everytime Spout
A emits a tuple, it will send the tuple to both Bolt B and Bolt C. All of Bolt B's output tuples will go to Bolt C as
well.
Each node in a Storm topology executes in parallel. In your topology, you can specify how much parallelism you
want for each node, and then Storm will spawn that number of threads across the cluster to do the execution.
A topology runs forever, or until you kill it. Storm will automatically reassign any failed tasks. Additionally, Storm
guarantees that there will be no data loss, even if machines go down and messages are dropped.
Stream groupings
A stream grouping tells a topology how to send tuples between two components. Remember, spouts and bolts
execute in parallel as many tasks across the cluster. If you look at how a topology is executing at the task level, it
looks something like this:
Stream groupings
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping
defines how that stream should be partitioned among the bolt's tasks.
There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by
implementing the customstreamgrouping interface:
1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is
guaranteed to get an equal number of tuples.
2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the
stream is grouped by the user-id field, tuples with the same user-id will always go to the same task, but
tuples with different user-id's may go to different tasks.
3. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields
grouping, but are load balanced between two downstream bolts, which provides better utilization of
resources when the incoming data is skewed. This paper provides a good explanation of how it works and
the advantages it provides.
4. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
5. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task
with the lowest id.
6. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none
groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none
groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
7. Direct grouping: This is a special kind of grouping. A stream grouped this way means that
the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can
only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream
must be emitted using one of the emitdirect methods. A bolt can get the task ids of its consumers by either
using the provided topologycontext or by keeping track of the output of the emit method
in outputcollector (which returns the task ids that the tuple was sent to).
8. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will
be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
Resources:
Topologybuilder: use this class to define topologies
Inputdeclarer: this object is returned whenever setbolt is called on topologybuilder and is used for declaring a bolt's
input streams and how those streams should be grouped
Coordinatedbolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct
groupings
When a task for Bolt A emits a tuple to Bolt B, which task should it send the tuple to?
A stream grouping answers this question by telling Storm how to send tuples between sets of tasks. Before we dig
into the different kinds of stream groupings, let's take a look at another topology from storm-starter.
Guaranteeing message processing
Earlier on in this tutorial, we skipped over a few aspects of how tuples are emitted. Those aspects were part of
Storm's reliability API: how Storm guarantees that every message coming off a spout will be fully processed.
See Guaranteeing message processing for information on how this works and what you have to do as a user to take
advantage of Storm's reliability capabilities
Transactional topologies
Storm guarantees that every message will be played through the topology at least once. A common question asked is
how do you do things like counting on top of Storm? Won't you overcount? Storm has a feature called transactional
topologies that let you achieve exactly-once messaging semantics for most computations. Read more about
transactional topologies here.
Distributed RPC
This tutorial showed how to do basic stream processing on top of Storm. There's lots more things you can do with
Storm's primitives. One of the most interesting applications of Storm is Distributed RPC, where you parallelize the
computation of intense functions on the fly. Read more about Distributed RPC here.
Topologies
The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a
mapreduce job. One key difference is that a mapreduce job eventually finishes, whereas a topology runs forever (or
until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings.
These concepts are described below.
Resources:
Topologybuilder: use this class to construct topologies in Java
Running topologies on a production cluster
Local mode: Read this to learn how to develop and test topologies in local mode.
Reliability
Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of
tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed.
Every topology has a message timeout associated with it. If Storm fails to detect that a spout tuple has been
completed within that timeout, then it fails the tuple and replays it later.
To take advantage of Storm's reliability capabilities, you must tell Storm when new edges in a tuple tree are being
created and tell Storm whenever you've finished processing an individual tuple. These are done using
theoutputcollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that
you're finished with a tuple using the ack method.
This is all explained in much more detail in Guaranteeing message processing.
Tasks
Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and
stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for
each spout or bolt in the setspout and setbolt methods of topologybuilder.
Workers
Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a
subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50
workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the
tasks evenly across all the workers.
Here are some typical prevent and optimize use cases for Storm
Prevent Use Cases Optimize Use Cases
Securities fraud
Order routing
Financial Services Operational risks &
Pricing
compliance violations
Security breaches Bandwidth allocation
Telecom
Network outages Customer service
Shrinkage Offers
Retail
Stock outs Pricing
Preventative maintenance Supply chain optimization
Manufacturing
Quality assurance Reduced plant downtime
Driver monitoring Routes
Transportation
Predictive maintenance Pricing
Application failures
Web Personalized content
Operational issues
CORE JAVA
1. What are the principle concepts of OOPS?
These are the four principle concepts of object oriented design and programming:
Abstraction
Polymorphism
Inheritance
Encapsulation
2. How does abstraction differ from encapsulation?
Abstraction focuses on the interface of an object whereas Encapsulation prevents clients from seeing it’s inside
view i.e. Where the behavior of the abstraction is implemented.
Abstraction solves the problem in the design side while Encapsulation is the Implementation.
Encapsulation is the deliverables of Abstraction. Encapsulation barely talks about grouping up your abstraction
to suit the developer needs.
3. What is an immutable object? How do you create one in Java?
Immutable objects are those whose state cannot be changed once they are created. Any modification will result in a
new object e.g. String, Integer, and other wrapper class.
4. What are the differences between processes and threads?
A process is an execution of a program whereas a Thread is a single execution sequence within a process. A
process can contain multiple threads.
Thread is at times called a lightweight process.
5. What is the purpose of garbage collection in Java? When is it used?
The purpose of garbage collection is to identify and discard the objects that are no longer needed by the application
to facilitate the resources to be reclaimed and reused.
6. What is Polymorphism?
Polymorphism is briefly described as one interface, many implementations. Polymorphism is a characteristic of
being able to assign a different meaning or usage to something in different contexts – specifically, to allow an entity
such as a variable, a function, or an object to have more than one form. There are two types of polymorphism:
Compile time polymorphism
Run time polymorphism.
Compile time polymorphism is method overloading. Runtime time polymorphism is done using inheritance and
interface.
7. In Java, what is the difference between method overloading and method overriding?
Method overloading in Java occurs when two or more methods in the same class have the exact same name, but
different parameters. On the other hand, method overriding is defined as the case when a child class redefines the
same method as a parent class. Overridden methods must have the same name, argument list, and return type. The
overriding method may not limit the access of the method it overrides.
8. How do you differentiate abstract class from interface?
Abstract keyword is used to create abstract class. Interface is the keyword for interfaces.
Abstract classes can have method implementations whereas interfaces can’t.
A class can extend only one abstract class but it can implement multiple interfaces.
You can run abstract class if it has main () method but not an interface.
9. Can you override a private or static method in Java?
You cannot override a private or static method in Java. If you create a similar method with same return type and
same method arguments in child class then it will hide the super class method; this is known as method hiding.
Similarly, you cannot override a private method in sub class because it’s not accessible there. What you can do is
create another private method with the same name in the child class.
10. What is Inheritance in Java?
Inheritance in Java is a mechanism in which one object acquires all the properties and behaviors of the parent object.
The idea behind inheritance in Java is that you can create new classes building upon existing classes. When you
inherit from an existing class, you can reuse methods and fields of parent class, and you can also add new methods
and fields.
Inheritance represents the IS-A relationship, also known as parent-child relationship.
Inheritance is used for:
Method Overriding (so runtime polymorphism can be achieved)
Code Reusability
11. What is super in Java?
The super keyword in Java is a reference variable that is used to refer the immediate parent class object. Whenever
you create the instance of subclass, an instance of parent class is created i.e. Referred by super reference variable.
Java super Keyword is used to refer:
Immediate parent class instance variable
Immediate parent class constructor
Immediate parent class method
12. What is constructor?
Constructor in Java is a special type of method that is used to initialize the object. It is invoked at the time of object
creation. It constructs the values i.e. Provides data for the object and that is why it is known as constructor. Rules for
creating Java constructor:
Constructor name must be same as its class name
Constructor must have no explicit return type
Types of Java constructors:
Default constructor (no-arg constructor)
Parameterized constructor
13. What is the purpose of default constructor?
A constructor that has no parameter is known as default constructor.
Syntax of default constructor:
<class_name>(){}
14. What kind of variables can a class consist?
A class consists of Local Variable, Instance Variables and Class Variables.
15. What is the default value of the local variables?
The local variables are not initialized to any default value; neither primitives nor object references.
16. What are the differences between path and classpath variables?
PATH is an environment variable used by the operating system to locate the executables. This is the reason we need
to add the directory location in the PATH variable when we install Java or want any executable to be found by OS.
Classpath is specific to Java and used by Java executables to locate class files. We can provide the classpath location
while running a Java application and it can be a directory, ZIP file or JAR file.
17. What does the ‘static’ keyword mean? Is it possible to override private or static method in Java?
The static keyword denotes that a member variable or method can be accessed, without requiring an instantiation of
the class to which it belongs. You cannot override static methods in Java, because method overriding is based upon
dynamic binding at runtime and static methods are statically binded at compile time. A static method is not
associated with any instance of a class, so the concept is not applicable.
18. What are the differences between Heap and Stack Memory?
Major difference between Heap and Stack memory are:
Heap memory is used by all the parts of the application whereas stack memory is used only by one thread of
execution.
When an object is created, it is always stored in the Heap space and stack memory contains the reference to it.
Stack memory only contains local primitive variables and reference variables to objects in heap space.
Memory management in stack is done in LIFO manner; it is more complex in Heap memory as it is used
globally.
19. Explain different ways of creating a Thread. Which one would you prefer and why?
There are three ways of creating a Thread:
1) A class may extend the Thread class
2) A class may implement the Runnable interface
3) An application can use the Executor framework, in order to create a thread pool.
The Runnable interface is preferred, as it does not require an object to inherit the Thread class.
20. What is synchronization?
Synchronization refers to multi-threading. A synchronized block of code can be executed by only one thread at a
time. As Java supports execution of multiple threads, two or more threads may access the same fields or objects.
Synchronization is a process which keeps all concurrent threads in execution to be in sync. Synchronization avoids
memory consistence errors caused due to inconsistent view of shared memory. When a method is declared as
synchronized the thread holds the monitor for that method’s object. If another thread is executing the synchronized
method the thread is blocked until that thread releases the monitor.
21. How can we achieve thread safety in Java?
The ways of achieving thread safety in Java are:
Synchronization
Atomic concurrent classes
Implementing concurrent Lock interface
Using volatile keyword
Using immutable classes
Thread safe classes.
22. What are the uses of synchronized keyword?
Synchronized keyword can be applied to static/non-static methods or a block of code. Only one thread at a time can
access synchronized methods and if there are multiple threads trying to access the same method then other threads
have to wait for the execution of method by one thread. Synchronized keyword provides a lock on the object and
thus prevents race condition.
23. What are the differences between wait() and sleep()?
Wait() is a method of Object class. Sleep() is a method of Object class.
Sleep() allows the thread to go to sleep state for x milliseconds. When a thread goes into sleep state it doesn’t
release the lock.
Wait() allows the thread to release the lock and go to suspended state. The thread is only active when a notify()
or notifall() method is called for the same object.
24. How does hashmap work in Java ?
A hashmap in Java stores key-value pairs. The hashmap requires a hash function and uses hashcode and equals
methods in order to put and retrieve elements to and from the collection. When the put method is invoked, the
hashmap calculates the hash value of the key and stores the pair in the appropriate index inside the collection. If the
key exists then its value is updated with the new value. Some important characteristics of a hashmap are its capacity,
its load factor and the threshold resizing.
25. What are the differences between String, stringbuffer and stringbuilder?
String is immutable and final in Java, so a new String is created whenever we do String manipulation. As String
manipulations are resource consuming, Java provides two utility classes: stringbuffer and stringbuilder.
Stringbuffer and stringbuilder are mutable classes. Stringbuffer operations are thread-safe and synchronized
where stringbuilder operations are not thread-safe.
Stringbuffer is to be used when multiple threads are working on same String and stringbuilder in the single
threaded environment.
Stringbuilder performance is faster when compared to stringbuffer because of no overhead of synchronize
Monitoring, Management and Orchestration Components of Hadoop Ecosystem- Oozie and Zookeeper
ZOOKEEPER
What is zookeeper?
Apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable
distributed coordination. Zookeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. An open source server that reliably coordinates
distributed processes. Apache zookeeper provides operational services for a Hadoop cluster. Zookeeper provides a
distributed configuration service, a synchronization service and a naming registry for distributed systems.
Distributed applications use Zookeeper to store and mediate updates to important configuration information.
Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
What is the role of Zookeeper in hbase architecture?
In hbase architecture, zookeeper is the monitoring server that provides different services like –tracking server failure
and network partitions, maintaining the configuration information, establishing communication between the clients
and region servers, usability of ephemeral nodes to identify the available servers in the cluster.
Explain about zookeeper in Kafka
Apache Kafka uses zookeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store
various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness,
configurations are distributed and replicated throughout the leader and follower nodes in the zookeeper ensemble.
We cannot directly connect to Kafka by bye-passing zookeeper because if the zookeeper is down it will not be able
to serve the client request.
What zookeeper Does
Zookeeper provides a very simple interface and services. Zookeeper brings these key benefits:
Fast.zookeeper is especially fast with workloads where reads to the data are more common than writes.
The ideal read/write ratio is about 10:1.
Reliable.zookeeper is replicated over a set of hosts (called an ensemble) and the servers are aware of each
other. As long as a critical mass of servers is available, the zookeeper service will also be available. There
is no single point of failure.
Simple.zookeeper maintain a standard hierarchical name space, similar to files and directories.
Ordered.The service maintains a record of all transactions, which can be used for higher-level abstractions,
like synchronization primitives.
How zookeeper Works
Zookeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of
data registers, known as znodes. Every znode is identified by a path, with path elements separated by a slash (/).
Aside from the root, every znode has a parent, and a znode cannot be deleted if it has children.
This is much like a normal file system, but zookeeper provides superior reliability through redundant services. A
service is replicated over a set of machines and each maintains an in-memory image of the the data tree and
transaction logs. Clients connect to a single zookeeper server and maintains a TCP connection through which they
send requests and receive responses.
This architecture allows zookeeper to provide high throughput and availability with low latency, but the size of the
database that zookeeper can manage is limited by memory.
Time Dependency(Frequency)
Data Dependency
Zookeeper-
Zookeeper is the king of coordination and provides simple, fast, reliable and ordered operational services for a
Hadoop cluster. Zookeeper is responsible for synchronization service, distributed configuration service and for
providing a naming registry for distributed systems.