Professional Documents
Culture Documents
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
(15)How many job tracker processes can run on a single Hadoop cluster?
(17)How does job tracker schedule a job for the task tracker?
(19)What is PID?
(20)What is jps?
(23)What is fsck?
(33)Name the most common Input Formats defined in Hadoop? Which one is default?
(39)After the Map phase finishes, the Hadoop framework does Partitioning, Shuffle and sort.
Explain what happens in this phase?
(40)If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the
reducer?
(41)What is JobTracker?
(43)What is TaskTracker?
(46)Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
(47)Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few
slow nodes to rate-limit the rest of the program and slow down the program. What mechanism
Hadoop provides to combat this?
- Kill a job?
(51)What is the characteristic of streaming API that makes it flexible run MapReduce jobs in
languages like Perl, Ruby, Awk etc.?
(53)Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job?
(54)Is it possible to have Hadoop job output in multiple directories? If yes, how?
(55)What will a Hadoop job do if you try to run it with an output directory that is already present?
Will it
- Overwrite it
(56)How can you set an arbitrary number of mappers to be created for a job in Hadoop?
(57)How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
(62)Can you give a detailed overview about the Big Data being generated by Facebook?
(73)What is HDFS?
(77)Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node
will also be replicated on the other two?
(81)What is a metadata?
(82)Why do we use HDFS for applications having large data sets and not when there are lot of small
files?
(83)What is a daemon?
(89)If we want to copy 10 blocks from one machine to another, but another machine can copy only
8.5 blocks, can the blocks be broken at the time of replication?
(94)When we send a data to a node, do we allow settling in time, before sending another data to
that node?
(96)On what basis Namenode will decide which datanode to write on?
(101)What is a rack?
(106)What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
(110)Do we require two servers for the Namenode and the datanodes?
(117)What is a Task Tracker in Hadoop? How many instances of Task Tracker run on a hadoop
cluster
(122)What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
(138)unable to read options file while i tried to import data from mysql to hdfs. Narendra
(139)What problems have you faced when you are working on Hadoop code?
(140)how would you modify that solution to only count the number of unique words in all the
documents?
(141)What is the difference between a Hadoop and Relational Database and Nosql?
(145)If reducers do not start before all mappers finish then why does the progress on MapReduce
job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed
when mapper is not finished yet?
(160)How would you tackle calculating the number of unique visitors for each hour by mining a huge
Apache log? You can use post processing on the output of the MapReduce job.
(163)How can you add the arbitrary key-value pairs in your mapper?
(164)what is a datanode?
(167)Which interface needs to be implemented to create Mapper and Reducer for the Hadoop?
(168)What happens if you don?t override the Mapper methods and keep them as it is?
(181)What is the Hadoop MapReduce API contract for a key and value Class?
(186)can we write map reduce program in other than java programming language. how.
(187)What alternate way does HDFS provides to recover data in case a Namenode, without backup,
fails and cannot be recovered?
(193)what is namenode?
(201)What do you understand about Object Oriented Programming (OOP)? Use Java examples.
(202)What are the main differences between versions 1.5 and version 1.6 of Java?
(205)Did you ever built a production process in Hadoop ? If yes then what was the process when
your hadoop job fails due to any reason
(206)Did you ever ran into a lop sided job that resulted in out of memory error, if yes then how did
you handled it
(208)What is the benifit of Distributed cache, why can we just have the file in HDFS and have the
application read it
(211)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
(212)What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a
slave node?
(215)Does MapReduce programming model provide a way for reducers to communicate with each
other? In a MapReduce job can a reducer communicate with another reducer?
(220)If reducers do not start before all mappers finish then why does the progress on MapReduce
job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed
when mapper is not finished yet?
(221)What is HDFS Block size? How is it different from traditional file system block size?
(223)What is NoSQL?
(227)Why would NoSQL be better than using a SQL Database? And how much better is it?
(247)In Map Reduce why map write output to Local Disk instead of HDFS?
(252)What type of data we should put in Distributed Cache? When to put the data in DC? How much
volume we should put in?
(255) what are mapreduce new and old apis while writing map reduce program . explain how it
works
(257)What is the utility of using Writable Comparable (Custom Class) in Map Reduce code?
(258)What are Input Format, Input Split & Record Reader and what they do?
(259)Why we use IntWritable instead of Int? Why we use LongWritable instead of Long?
(261)If data is present in HDFS and RF is defined, then how can we change Replication Factor?
(266)What will be the consideration while we do Hardware Planning for Master in Hadoop
architecture?
(269)In which location Name Node sores its Metadata and why?
(271)How blocks are distributed among all data nodes for a particular chunk of data?
(283)On what basis name node distribute blocks across the data nodes?
(290)How to resolve the following error while running a query in hive: Error in metadata: Cannot
validate serde
(302)What is identity mapper and reducer? In which cases can we use them?
(305)Safe-mode execeptions
(307)What is AMI
(311)What do you understand from Node redundancy and is it exist in hadoop cluster
(314) How to resolve IOException: Cannot create directory, while formatting namenode in hadoop.
Are there any problems which can only be solved by MapReduce and cannot be solved by PIG? In
which kind of scenarios MR jobs will be more useful than PIG?
2. How can we change the split size if our commodity hardware has less storage space?
3. What is the difference between an HDFS Block and Input Split?
4. How can we check whether Namenode is working or not?
5. Why do we need a password-less SSH in Fully Distributed environment?
6. Some details about SSH communication between Masters and the Slaves
7. Why is Replication pursued in HDFS in spite of its data redundancy?
Difference between map-side join and reduce side join?
Difference between static and dynamic partitioning?
What is safe-mode?
For a hadoop developer the questions which mostly asked during interview are:
1. What is shuffling in map reduce?
2. Difference between HDFD block and split?
3. What are the mapfiles in hadoop?
4. What is the use of .pagination class?
5. What are the core components of hadoop?
3.Define RDD.
RDD is the acronym for Resilient Distribution Datasets a fault-tolerant collection of operational
elements that run parallel. The partitioned data in RDD is immutable and distributed. There are
primarily two types of RDD:
Parallelized Collections : The existing RDDs running parallel with one another
Hadoop datasets: perform function on each file record in HDFS or other storage system
Serving as the base engine, SparkCore performs various important functions like memory
management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems.
10.What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using
RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD
always remembers how to build from other datasets.
11.What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares
transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext,
connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
12.What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
13.Name commonly-used Spark Ecosystems.
Spark SQL (Shark)- for developers
Spark Streaming for processing live data streams
GraphX for generating and computing graphs
MLlib (Machine Learning Algorithms)
SparkR to promote R Programming in Spark engine.
SQL Spark, better known as Shark is a novel module introduced in Spark to work with structured
data and perform structured data processing. Through this module, Spark executes relational SQL
queries on the data. The core of the component supports an altogether different RDD called
SchemaRDD, composed of rows objects and schema objects defining data type of each column in
the row. It is similar to a table in relational database.
18.What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL
performs both read and write operations with Parquet file and consider it be one of the best big
data analytics format so far.
19.What file systems Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3
20.What is Yarn?
Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource
management platform to deliver scalable operations across the cluster . Running Spark on Yarn
necessitates a binary distribution of Spar as built on Yarn support.
21.List the functions of Spark SQL.
Spark SQL is capable of:
Loading data from a variety of structured sources
Querying data using SQL statements, both inside a Spark program and from external tools that
connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using
business intelligence tools like Tableau
Providing rich integration between SQL and regular Python/Java/Scala code, including the ability
to join RDDs and SQL tables, expose custom functions in SQL, and more
22.What are benefits of Spark over MapReduce?
Due to the availability of in-memory processing, Spark implements the processing around 10-100x
faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data
processing tasks.
Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like
batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only
supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of performing computations multiple times on the same dataset. This is called
iterative computation while there is no iterative computing implemented by Hadoop.
Kalyan Hadoop
Spark Training in Hyderabad
Hadoop Training in Hyderabad
Interview Questions & Answers on Apache Spark [Part 2]
Q1: Say I have a huge list of numbers in RDD(say myrdd). And I wrote the following code to
compute average:
def myAvg(x, y):
return (x+y)/2.0;
avg = myrdd.reduce(myAvg);
What is wrong with it? And How would you correct it?
Ans: The average function is not commutative and associative;
I would simply sum it and then divide by count.
def sum(x, y):
return x+y;
total = myrdd.reduce(sum);
avg = total / myrdd.count();
The only problem with the above code is that the total might become very big thus over flow. So, I
would rather divide each number by count and then sum in the following way.
cnt = myrdd.count();
def devideByCnd(x):
return x/cnt;
myrdd1 = myrdd.map(devideByCnd);
avg = myrdd.reduce(sum);
Q2: Say I have a huge list of numbers in a file in HDFS. Each line has one number.And I want to
compute the square root of sum of squares of these numbers. How would you do it?
Ans:
# We would first load the file as RDD from HDFS on spark
numsAsText = sc.textFile("hdfs://namenode:9000/user/kayan/mynumbersfile.txt");
# Define the function to compute the squares
def toSqInt(str):
v = int(str);
return v*v;
#Run the function on spark rdd as transformation
nums = numsAsText.map(toSqInt);
#finally compute the square root. For which we need to import math.
import math;
print math.sqrt(total);
numsAsText =sc.textFile("hdfs://namenode:9000/user/kalyan/mynumbersfile.txt");
def toInt(str):
return int(str);
nums = numsAsText.map(toInt);
def sqrtOfSumOfSq(x, y):
return math.sqrt(x*x+y*y);
total = nums.reduce(sum)
import math;
print math.sqrt(total);
Ans: Yes. The approach is correct and sqrtOfSumOfSq is a valid reducer.
Q4: Could you compare the pros and cons of the your approach (in Question 2 above) and my
approach (in Question 3 above)?
Ans:
You are doing the square and square root as part of reduce action while I am squaring in map() and
summing in reduce in my approach.
My approach will be faster because in your case the reducer code is heavy as it is calling math.sqrt()
and reducer code is generally executed approximately n-1 times the spark RDD.
The only downside of my approach is that there is a huge chance of integer overflow because I am
computing the sum of squares as part of map.
Q5: If you have to compute the total counts of each of the unique words on spark, how would you
go about it?
Ans:
#This will load the bigtextfile.txt as RDD in the spark
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/bigtextfile.txt");
# Run the toWords function on each element of RDD on spark as flatMap transformation.
# We are going to flatMap instead of map because our function is returning multiple values.
words = lines.flatMap(toWords);
# Convert each word into (key, value) pair. Her key will be the word itself and value will be 1.
def toTuple(word):
return (word, 1);
wordsTuple = words.map(toTuple);
counts = wordsTuple.reduceByKey(sum)
# Now, print
counts.collect()
Q6: In a very huge text file, you want to just check if a particular keyword exists. How would you
do this using Spark?
Ans:
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/bigtextfile.txt");
def isFound(line):
if line.find(mykeyword) > -1:
return 1;
return 0;
foundBits = lines.map(isFound);
sum = foundBits.reduce(sum);
if sum > 0:
print FOUND;
else:
print NOT FOUND;
Q7: Can you improve the performance of this code in previous answer?
Ans: Yes.
The search is not stopping even after the word we are looking for has been found. Our map code
would keep executing on all the nodes which is very inefficient.
We could utilize accumulators to report whether the word has been found or not and then stop the
job. Something on these line:
import thread, threading
from time import sleep
result = "Not Set"
lock = threading.Lock()
accum = sc.accumulator(0)
def map_func(line):
#introduce delay to emulate the slowness
sleep(1);
if line.find("Adventures") > -1:
accum.add(1);
return 1;
return 0;
def start_job():
global result
try:
sc.setJobGroup("job_to_cancel", "some description")
lines = sc.textFile("hdfs://namenode:9000/user/kalyan/wordcount/input/big.txt");
result = lines.map(map_func);
result.take(1);
except Exception as e:
result = "Cancelled"
lock.release()
def stop_job():
while accum.value < 3 :
sleep(1);
sc.cancelJobGroup("job_to_cancel")
supress = lock.acquire()
supress = thread.start_new_thread(start_job, tuple())
Kalyan Hadoop
Spark Training in Hyderabad
Hadoop Training in Hyderabad
Interview Questions & Answers on Apache Spark [Part 1]
Q1: When do you use apache spark? OR What are the benefits of Spark over Mapreduce?
Ans:
Spark is really fast. As per their claims, it runs programs up to 100x faster than Hadoop MapReduce
in memory, or 10x faster on disk. It aptly utilizes RAM to produce the faster results.
In map reduce paradigm, you write many Map-reduce tasks and then tie these tasks together using
Oozie/shell script. This mechanism is very time consuming and the map-reduce task have heavy
latency.
And quite often, translating the output out of one MR job into the input of another MR job might
require writing another code because Oozie may not suffice.
In Spark, you can basically do everything using single application / console (pyspark or scala console)
and get the results immediately. Switching between 'Running something on cluster' and 'doing
something locally' is fairly easy and straightforward. This also leads to less context switch of the
developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.
Q2: Is there are point of learning Mapreduce, then?
Ans: Yes. For the following reason:
Mapreduce is a paradigm used by many big data tools including Spark. So, understanding the
MapReduce paradigm and how to convert a problem into series of MR tasks is very important.
When the data grows beyond what can fit into the memory on your cluster, the Hadoop MapReduce paradigm is still very relevant.
Almost, every other tool such as Hive or Pig converts its query into MapReduce phases. If you
understand the Mapreduce then you will be able to optimize your queries better.
Q3: When running Spark on Yarn, do I need to install Spark on all nodes of Yarn Cluster?
Ans:
Since spark runs on top of Yarn, it utilizes yarn for the execution of its commands over the cluster's
nodes.
So, you just have to install Spark on one node.
Q4: What are the downsides of Spark?
Ans:
Spark utilizes the memory. The developer has to be careful. A casual developer might make
following mistakes:
She may end up running everything on the local node instead of distributing work over to the
cluster.
She might hit some webservice too many times by the way of using multiple clusters.
The first problem is well tackled by Hadoop Map reduce paradigm as it ensures that the data your
code is churning is fairly small a point of time thus you can make a mistake of trying to handle whole
data on a single node.
The second mistake is possible in Map-Reduce too. While writing Map-Reduce, user may hit a
service from inside of map() or reduce() too many times. This overloading of service is also possible
while using Spark.
Q5: What is a RDD?
Ans:
The full form of RDD is resilience distributed dataset. It is a representation of data located on a
network which is
Immutable - You can operate on the rdd to produce another rdd but you cant alter it.
Partitioned / Parallel - The data located on RDD is operated in parallel. Any operation on RDD is
done using multiple nodes.
Resilience - If one of the node hosting the partition fails, another nodes takes its data.
RDD provides two kinds of operations: Transformations and Actions.
Q6: What is Transformations?
Ans: The transformations are the functions that are applied on an RDD (resilient distributed data
set). The transformation results in another RDD. A transformation is not executed until an action
follows.
The example of transformations are:
map() - applies the function passed to it on each element of RDD resulting in a new RDD.
filter() - creates a new RDD by picking the elements from the current RDD which pass the function
argument.
Q7: What are Actions?
Ans:
An action brings back the data from the RDD to the local machine. Execution of an action results in
all the previously created transformation. The example of actions are:
reduce() - executes the function passed again and again until only one value is left. The function
should take two argument and return one value.
take() - take all the values back to the local node form RDD.
Did you ever built a production process in Hadoop? If yes, what was the process when your
Hadoop job fails due to any reason? (Open Ended Question)
Give some examples of companies that are using Hadoop architecture extensively.
Hadoop Admin Interview Questions
If you want to analyze 100TB of data, what is the best architecture for that?
Explain about the functioning of Master Slave architecture in Hadoop?
What is distributed cache and what are its benefits?
What are the points to consider when moving from an Oracle database to Hadoop clusters?
How would you decide the correct size and number of nodes in a Hadoop cluster?
How do you benchmark your Hadoop Cluster with Hadoop tools?
Hadoop Interview Questions on HDFS
Explain the major difference between an HDFS block and an InputSplit.
Does HDFS make block boundaries between records?
What is streaming access?
What do you mean by Heartbeat in HDFS?
If there are 10 HDFS blocks to be copied from one machine to another. However, the other
machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down
during the time of replication?
What is Speculative execution in Hadoop?
What is WebDAV in Hadoop?
What is fault tolerance in HDFS?
How can you set random number of mappers and reducers for a Hadoop job?
How many Daemon processes run on a Hadoop System?
What happens if the number of reducers is 0?
What is meant by Map-side and Reduce-side join in Hadoop?
How can the NameNode be restarted?
Hadoop attains parallelism by isolating the tasks across various nodes; it is possible for some
of the slow nodes to rate-limit the rest of the program and slows down the program. What
method Hadoop provides to combat this?
What is the significance of conf.setMapper class?
What are combiners and when are these used in a MapReduce job?
How does a DataNode know the location of the NameNode in Hadoop cluster?
How can you check whether the NameNode is working or not?
Pig Interview Questions
When doing a join in Hadoop, you notice that one reducer is running for a very long time.
How will address this problem in Pig?
Are there any problems which can only be solved by MapReduce and cannot be solved by
PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Give an example scenario on the usage of counters.
Hive Interview Questions
Explain the difference between ORDER BY and SORT BY in Hive?
Differentiate between HiveQL and SQL.
Gartner predicted that, "Big Data Movement will generate 4.4 million new IT jobs by end of 2015
and Hadoop will be in most advanced analytics products by 2015. With the increasing demand for
Hadoop for Big Data related issues, the prediction by Gartner is ringing true.
During March 2014, there were approximately 17,000 Hadoop Developer jobs advertised online. As
of 4 th, April 2015 - there are about 50,000 job openings for Hadoop Developers across the world
with close to 25,000 openings in the US alone. Of the 3000 Hadoop students that we have trained so
far, the most popular blog article request was one on Hadoop interview questions.
There are 4 steps which you must take if you are trying to get a job in emerging technology
domains:
Carefully outline the roles and responsibilities
Make your resume highlight the required core skills
Document each and every step of your efforts
Purposefully Network
With more than 30,000 open Hadoop developer jobs, professionals must familiarize themselves with
the each and every component of the Hadoop ecosystem to make sure that they have a deep
understanding of what Hadoop is so that they can form an effective approach to a given big data
problem.
With the help of Hadoop Instructors, we have put together a detailed list of Hadoop latest interview
questions based on the different components of the Hadoop Ecosystem such as MapReduce, Hive,
HBase, Pig, YARN, Flume, Sqoop, HDFS, etc.
Hadoop Basic Interview Questions
What is Big Data?
Any data that cannot be stored into traditional RDBMS is termed as Big Data. As we know most of
the data that we use today has been generated in the past 20 years. And this data is mostly
unstructured or semi structured in nature. More than the volume of the data it is the nature of the
data that defines whether it is considered as Big Data or not.
What do the four Vs of Big Data denote?
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume Scale of data
b) Velocity Different forms of data
c) Variety Analysis of streaming data
d) Veracity Uncertainty of data
IBM has a nice, simple explanation for the four critical features of big data:
a) Volume Scale of data
b) Velocity Different forms of data
Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the
META table is and META table stores all the regions in the system.
For more the Hadoop HBase Interview Questions click here
Hadoop Sqoop Interview Questions
Explain about some important Sqoop commands other than import and export.
Create Job (--create)
Here we are creating a job with the name my job, which can import the table data from RDBMS
table to HDFS. The following command is used to create a job that is importing data from the
employee table in the db database to the HDFS file.
$ Sqoop job --create myjob \
--import \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee --m 1
Verify Job (--list)
--list argument is used to verify the saved jobs. The following command is used to verify the list of
saved Sqoop jobs.
$ Sqoop job --list
Inspect Job (--show)
--show argument is used to inspect or verify particular jobs and their details. The following
command and sample output is used to verify a job called myjob.
$ Sqoop job --show myjob
Execute Job (--exec)
--exec option is used to execute a saved job. The following command is used to execute a saved job
called myjob.
$ Sqoop job --exec myjob
better predictive analytics, providing customized recommendations and launching new products
based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in
online sales for $1 billion in incremental revenue. There are many more companies like Facebook,
Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost
their revenue.
Here is an interesting video that explains how various industries are leveraging big data analysis to
increase their revenue
Name some companies that use Hadoop.
Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
Facebook
Netflix
Amazon
Adobe
eBay
Hulu
Spotify
Rubikloud
Twitter
To view a detailed list of some of the top companies using Hadoop CLICK HERE
Differentiate between Structured and Unstructured data.
Data which can be stored in traditional database systems in the form of rows and columns, for
example the online purchase transactions can be referred to as Structured Data. Data which can be
stored only partially in traditional database systems, for example, data in XML records can be
referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi
structured or structured data is referred to as unstructured data. Facebook updates, Tweets on
Twitter, Reviews, web logs, etc. are all examples of unstructured data.
On what concept the Hadoop framework works?
Hadoop Framework works on the following two core components1)HDFS Hadoop Distributed File System is the java based file system for scalable and reliable
storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master
Slave Architecture.
2)Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that
provides scalability across various Hadoop clusters. MapReduce distributes the workload into
various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks
down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map
job and combines the data tuples to into smaller set of tuples. The reduce job is always performed
after the map job is executed.
Here is a visual that clearly explain the HDFS and Hadoop MapReduce Concepts-
Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper
or Reducers.
What is the best hardware configuration to run Hadoop?
The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB
or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not
low - end. ECC memory is recommended for running Hadoop because most of the Hadoop users
have experienced various checksum errors by using non ECC memory. However, the hardware
configuration also depends on the workflow requirements and can change accordingly.
What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
Text Input Format- This is the default input format defined in Hadoop.
Key Value Input Format- This input format is used for plain text files wherein the files are broken
down into lines.
Sequence File Input Format- This input format is used for reading files in sequence.
We have further categorized Big Data Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,7,8,9
Hadoop Interview Questions and Answers for Experienced - Q.Nos-3,8,9,10
Click here to know more about our IBM Certified Hadoop Developer course
Hadoop HDFS Interview Questions and Answers
What is a block and block scanner in HDFS?
Block - The minimum amount of data that can be read or written is generally referred to as a block
in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner - Block Scanner tracks the list of blocks present on a DataNode and verifies them to
find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk
bandwidth on the datanode.
Explain the difference between NameNode, Backup Node and Checkpoint NameNode.
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the
data of the files is not stored on the NameNode but rather it has the directory tree of all the files
present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespacefsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NodeCheckpoint Node keeps track of the latest checkpoint in a directory that has same structure as that
of NameNodes directory. Checkpoint node creates checkpoints for the namespace at regular
intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The
new image is then again updated back to the active NameNode.
BackupNode:
Backup Node also provides check pointing functionality like that of the checkpoint node but it also
maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active
NameNode.
What is commodity hardware?
Commodity Hardware refers to inexpensive systems that do not have high availability or high
quality. Commodity Hardware consists of RAM because there are specific services that need to be
executed on RAM. Hadoop can be run on any commodity hardware and does not require any super
computer s or high end hardware configuration to execute jobs.
What is the port number for NameNode, Task Tracker and Job Tracker?
NameNode 50070
Job Tracker 50030
Task Tracker 50060
Explain about the process of inter cluster data copying.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If
this data copying is within the hadoop cluster then it is referred to as inter cluster data copying.
DistCP requires both source and destination to have a compatible or same version of hadoop.
How can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified or overwritten in 2 ways1)Using the Hadoop FS Shell, replication factor can be changed per file basis using the below
command$hadoop fs setrep w 2 /my/test_file (test_file is the filename whose replication factor will be set
to 2)
2)Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified
using the below command-
3)$hadoop fs setrep w 5 /my/test_dir (test_dir is the name of the directory and all the files in this
directory will have a replication factor set to 5)
Explain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS
runs on a cluster of different machines thus there is data redundancy because of the replication
protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across
local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be
used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are
moved to data.
Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1
instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust
the number of times the blocks are to be replicated to ensure high data availability. For every block
that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during
the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data.
Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under
any circumstances, then only single copy of the data would be lost.
What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are
written by a single writer in append only format i.e. writes to a file in HDFS are always made at the
end of the file.
Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that
further points to the address where the next part of data chunk is stored.
What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is
referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by
the NameNode. The process of selecting closer data nodes depending on the rack information is
known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the
file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for
each data block. For each data block, there exists 2 copies in one rack and the third copy is present
in another rack. This is generally referred to as the Replica Placement Policy.
We have further categorized Hadoop HDFS Interview Questions for Freshers and Experienced-
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop
MapReduce or the custom partitioner can be added to the job by using the set method of the
partitioner class.
What is the relationship between Job and Task in Hadoop?
A single job can be broken down into one or many tasks in Hadoop.
Is it important for Hadoop MapReduce jobs to be written in Java?
It is not necessary to write Hadoop MapReduce jobs in java but users can write MapReduce jobs in
any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop
Streaming API.
What is the process of changing the split size if there is limited storage space on Commodity
Hardware?
If there is limited storage space on commodity hardware, the split size can be changed by
implementing the Custom Splitter. The call to Custom Splitter can be made from the main
method.
What are the primary phases of a Reducer?
The 3 primary phases of a reducer are
1)Shuffle
2)Sort
3)Reduce
What is a TaskInstance?
The actual hadoop MapReduce jobs that run on each slave node are referred to as Task instances.
Every task instance has its own JVM process. For every new task instance, a JVM process is spawned
by default for a task.
Can reducers communicate with each other?
Reducers always run in isolation and they can never communicate with each other as per the
Hadoop MapReduce programming paradigm.
We have further categorized Hadoop MapReduce Interview Questions for Freshers and
ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,5,6
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1,3,4,7,8,9,10
RDBMS does not have support for in-built partitioning whereas in HBase there is automated
partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
Explain about the different catalog tables in HBase?
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META
table is and META table stores all the regions in the system.
What is column families? What happens if you alter the block size of ColumnFamily on an already
populated database?
The logical deviation of data is represented through a key known as column Family. Column families
consist of the basic unit of physical storage on which compression features can be applied. In an
already populated database, when the block size of column family is altered, the old data will
remain within the old block size whereas the new data that comes in will take the new block size.
When compaction takes place, the old data will take the new block size so that the existing data is
read correctly.
Explain the difference between HBase and Hive.
HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse
infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of
Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary
operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is
an ideal choice for analytical querying of data collected over period of time.
Explain the process of row deletion in HBase.
On issuing a delete command in HBase through the HBase client, data is not actually deleted from
the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are
removed at regular intervals during compaction.
What are the different types of tombstone markers in HBase for deletion?
There are 3 different types of tombstone markers in HBase for deletion1)Family Delete Marker- This markers marks all columns for a column family.
2)Version Delete Marker-This marker marks a single version of a column.
3)Column Delete Marker-This markers marks all the versions of a column.
Explain about HLog and WAL in HBase.
All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains
entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write
Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the
memory till the flush period in case of deferred log flush.
We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced-
To insert only rows Append should be used in import command and for inserting the rows and also
updating Last-Modified should be used in the import command.
What is the standard location or path for Hadoop Sqoop scripts?
/usr/bin/Hadoop Sqoop
How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as followsSqoop list-tables connect jdbc: mysql: //localhost/user;
How are large objects handled in Sqoop?
Sqoop provides the capability to store large sized data into a single field based on the type of data.
Sqoop supports the ability to store1)CLOB s Character Large Objects
2)BLOBs Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as LobFile i.e.
Large Object File. The LobFile has the ability to store records of huge size, thus each record in the
LobFile is a large object.
Can free form SQL queries be used with Sqoop import command? If yes, then how can they be
used?
Sqoop allows us to use free form SQL queries with the import command. The import command
should be used with the e and query options to execute free form SQL queries. When using the
e and query options with the import command the target dir value must be specified.
Differentiate between Sqoop and distCP.
DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer
data only between Hadoop and RDBMS.
What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of hcatalog
database option with the hcatalog table but the limitation to it is that there are several arguments
like as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
We have further categorized Hadoop Sqoop Interview Questions for Freshers and Experienced-
getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp
method is called by the serializer.
Explain about the different channel types in Flume. Which channel type is faster?
The 3 different built in channel types available in Flume areMEMORY Channel Events are read from the source into memory and passed to the sink.
JDBC Channel JDBC Channel stores the events in an embedded Derby database.
FILE Channel File Channel writes the contents to a file on the file system after reading the event
from a source. The file is deleted only after the contents are successfully delivered to the sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss. The
channel that you choose completely depends on the nature of the big data application and the value
of each event.
Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.
Explain about the replication and multiplexing selectors in Flume.
Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event
can be written just to a single channel or to multiple channels. If a channel selector is not specified
to the source then by default it is the Replicating selector. Using the replicating selector, the same
event is written to all the channels in the sources channels list. Multiplexing channel selector is used
when the application has to send different events to different channels.
How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from
external sources and transfer it to external destinations.
Is it possible to leverage real time analysis on the big data collected by Flume directly? If
yes, then explain how.
Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers
usingMorphlineSolrSink
Differentiate between FileSink and FileRollSink
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events
into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the
local file system.
Hadoop Flume Interview Questions and Answers for Freshers - Q.Nos- 1,2,4,5,6,10
Hadoop Flume Interview Questions and Answers for Experienced- Q.Nos- 3,7,8,9
Hadoop Zookeeper Interview Questions and Answers
Can Apache Kafka be used without Zookeeper?
It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka
cannot serve client request.
Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
What is the role of Zookeeper in HBase architecture?
In HBase architecture, ZooKeeper is the monitoring server that provides different services like
tracking server failure and network partitions, maintaining the configuration information,
establishing communication between the clients and region servers, usability of ephemeral nodes to
identify the available servers in the cluster.
Explain about ZooKeeper in Kafka
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by
Kafka to store various configurations and use them across the hadoop cluster in a distributed
manner. To achieve distributed-ness, configurations are distributed and replicated throughout the
leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by byepassing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.
Explain how Zookeeper works
ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to
store and facilitate important configuration information updates. ZooKeeper works by coordinating
the processes of distributed applications. ZooKeeper is a robust replicated synchronization service
with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed
between multiple nodes.
3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client
connects to any of the specific server and migrates if a particular node fails. The ensemble of
ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is
dynamically selected by the consensus within the ensemble so if the master node fails then the role
of master node will migrate to another node which is selected dynamically. Writes are linear and
reads are concurrent in ZooKeeper.
List some examples of Zookeeper use cases.
Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high
priority notifications and discovery. The entire service of Found built up of various systems that read
and write to Zookeeper.
Apache Kafka that depends on ZooKeeper is used by LinkedIn
Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter.
How to use Apache Zookeeper command line interface?
ZooKeeper has a command line client support for interactive use. The command line interface of
ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy
of Znodes where each znode can contain data just similar to a file. Each znode can also have children
just like directories in the UNIX file system.
Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden
by the log messages after entering the command, users can just hit ENTER to view the prompt.
What are the different types of Znodes?
There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.
The Znodes that get destroyed as soon as the client that created it disconnects are referred to as
Ephemeral Znodes.
Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and
is pre-fixed when the client assigns name to the znode.
What are watches?
Client disconnection might be troublesome problem especially when we need to keep a track on the
state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can
be set on Znode to trigger an event whenever it is removed, altered or any new children are created
below it.
What problems can be addressed by using Zookeeper?
In the development of distributed systems, creating own protocols for coordinating the hadoop
cluster results in failure and frustration for the developers. The architecture of a distributed system
can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in
making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache
ZooKeeper can be used as a coordination service to write correct distributed applications without
having to reinvent the wheel from the beginning.
Hadoop ZooKeeper Interview Questions and Answers for Freshers - Q.Nos- 1,2,8,9
Hadoop ZooKeeper Interview Questions and Answers for Experienced- Q.Nos-3,4,5,6,7, 10
Hadoop Pig Interview Questions and Answers
What do you mean by a bag in Pig?
Collection of tuples is referred as a bag in Apache Pig
Does Pig support multi-line commands?
Yes
Sometimes there is data in a tuple or bag and if we want to remove the level of nesting from that
data then Flatten modifier in Pig can be used. Flatten un-nests bags and tuples. For tuples, the
Flatten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is a
little complex because it requires creating new tuples.
We have further categorized Hadoop Pig Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos-1,2,4,7,9
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 3,5,6,8,10
Hadoop Hive Interview Questions and Answers
What is a Hive Metastore?
Hive Metastore is a central repository that stores metadata in external database.
Are multiline comments supported in Hive?
No
What is ObjectInspector functionality?
ObjectInspector is used to analyze the structure of individual columns and the internal structure of
the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in
multiple formats.
We have further categorized Hadoop YARN Interview Questions for Freshers and ExperiencedHadoop Interview Questions and Answers for Freshers - Q.Nos- 2,3
Hadoop Interview Questions and Answers for Experienced - Q.Nos- 1
Hadoop Interview Questions Answers Needed
Interview Questions on Hadoop Hive
1)Explain about the different types of join in Hive.
2)How can you configure remote metastore mode in Hive?
3)Explain about the SMB Join in Hive.
4)Is it possible to change the default location of Managed Tables in Hive, if so how?
5)How data transfer happens from Hive to HDFS?
6)How can you connect an application, if you run Hive as a server?
7)What does the overwrite keyword denote in Hive load statement?
8)What is SerDe in Hive? How can you write yourown customer SerDe?
9)In case of embedded Hive, can the same metastore be used by multiple users?
Hadoop YARN Interview Questions
1)What are the additional benefits YARN brings in to Hadoop?
2)How can native libraries be included in YARN jobs?
3)Explain the differences between Hadoop 1.x and Hadoop 2.x
Or
4)Explain the difference between MapReduce1 and MapReduce 2/YARN
5)What are the modules that constitute the Apache Hadoop 2.0 framework?
Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its
own JVM process
Hadoop performs following actions in Hadoop
Client application submit jobs to the job tracker
JobTracker communicates to the Namemode to determine data location
Near the data or with available slots JobTracker locates TaskTracker nodes
On chosen TaskTracker Nodes, it submits the work
When a task fails, Job tracker notify and decides what to do then.
The TaskTracker nodes are monitored by JobTracker
Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task
tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is
considered there is some issues with data node or task tracker
Explain what combiners is and when you should use a combiner in a MapReduce Job?
To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be
reduced with the help of combiners that need to be transferred across to the reducers. If the
operation performed is commutative and associative you can use your reducer code as a
combiner. The execution of combiner is not guaranteed in Hadoop
What happens when a datanode fails ?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node
Explain what is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On
different slave node, multiple copies of same map or reduce task can be executed using Speculative
Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will
create a duplicate task on another disk. Disk that finish the task first are retained and disks that do
not finish first are killed.
Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the
same reducer, eventually which helps evenly distribution of the map output over the reducers
Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
Explain what happens in textinformat ?
In textinputformat, each line in the text file is a record. Value is the content of the line while Key is
the byte offset of the line. For instance, Key: longWritable, Value: text
Mention what are the main configuration parameters that user need to specify to run
Mapreduce Job ?
The user of Mapreduce framework needs to specify
Jobs input locations in the distributed file system
Jobs output location in the distributed file system
Input format
Output format
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first
tab character is sent as key to the Mapper and the remainder of the line is sent as value to the
mapper.
Q3. What is InputSplit in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to
process. This is called InputSplit.
Q4. How is the splitting of file invoked in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class
(like FileInputFormat) defined by the user.
Q5. Consider case scenario: In M/R system, - HDFS block size is 64 MB
Input format is FileInputFormat
We have 3 files of size 64K, 65Mb and 127Mb
How many input splits will be made by Hadoop framework?
Hadoop will make 5 splits as follows:
1 split for 64K files
2 splits for 65MB files
2 splits for 127MB files
Q6. What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader
class actually loads the data from its source and converts it into (key, value) pairs suitable for
reading by the Mapper. The RecordReader instance is defined by the Input Format.
Q7. After the Map phase finishes, the Hadoop framework does Partitioning, Shuffle and sort.
Explain what happens in this phase?
Partitioning: It is the process of determining which reducer instance will receive which intermediate
keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer
will receive them. It is necessary that for any key, regardless of which mapper instance generated it,
the destination partition is the same.
Shuffle: After the first map tasks have completed, the nodes may still be performing several more
map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to
where they are required by the reducers. This process of moving map outputs to the reducers is
known as shuffling.
Sort: Each reduce task is responsible for reducing the values associated with several intermediate
keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they
are presented to the Reducer.
Q8. If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to
the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this
result.
Q9. What is a Combiner?
The Combiner is a mini-reduce process which operates only on data generated by a mapper. The
Combiner will receive as input all data emitted by the Mapper instances o4n a given node. The
output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
Q10. What is JobTracker?
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
Q11. What are some typical functions of Job Tracker?
The following are some typical tasks of JobTracker:Accepts jobs from clients
It talks to the NameNode to determine the location of the data.
It locates TaskTracker nodes with available slots at or near the data.
It submits the work to the chosen TaskTracker nodes and monitors progress of
each task by receiving heartbeat signals from Task tracker.
Q12. What is TaskTracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from
a JobTracker.
Q13. What is the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.
Q14. Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop
do?
It will restart the task again on some other TaskTracker and only if the task fails more than four
(default setting and can be changed) times will it kill the job.
Q15. Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few
slow nodes to rate-limit the rest of the program and slow down the program. What mechanism
Hadoop provides to combat this?
Speculative Execution.
Q16. How does speculative execution work in Hadoop?
JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they
announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive
copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the
tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper
completed successfully, first.
Q17. Using command line in Linux, how will you
See all jobs running in the Hadoop cluster
Kill a job?
Hadoop job list
Hadoop job kill jobID
Q18. What is Hadoop Streaming?
Streaming is a generic API that allows programs written in virtually any language to be used as
Hadoop Mapper and Reducer implementations.
Q19. What is the characteristic of streaming API that makes it flexible run MapReduce jobs in
languages like Perl, Ruby, Awk etc.?
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a
MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output
(key, value) pairs on stdout.
Q20. What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives,
jars and so on) needed by applications during execution of the job. The framework will copy the
necessary files to the slave node before any tasks for the job are executed on that node.
Q21. What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the
application read it?
This is because distributed cache is much faster. It copies the file to all trackers at the start of the
job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of
distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then
every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try
to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Q.22 What mechanism does Hadoop framework provide to synchronise changes made in
Distribution Cache during runtime of the application?
This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during
the time of Job execution.
Q23. Have you ever used Counters in Hadoop. Give us an example scenario?
Anybody who claims to have worked on a Hadoop project is expected to use counters.
Q24. Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple
directories as input to the Hadoop job?
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.
Q25. Is it possible to have Hadoop job output in multiple directories? If yes, how?
Yes, by using Multiple Outputs class.
Q26. What will a Hadoop job do if you try to run it with an output directory that is already
present? Will it
Overwrite it
Warn you and continue
Throw an exception and exit
The Hadoop job will throw an exception and exit.
Q27. How can you set an arbitrary number of mappers to be created for a job in Hadoop?
You cannot set it.
Q28. How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or
set it up as a configuration setting.
Q29. How will you write a custom partitioner for a Hadoop job?
To have Hadoop use a custom partitioner you will have to do minimum the following three:
Create a new class that extends Partitioner Class