Professional Documents
Culture Documents
Hadoop Interview Questions IV
Hadoop Interview Questions IV
Hadoop Interview Questions IV
Looking out for Hadoop Interview Questions that are frequently asked by employers? Here
is the s list of Hadoop Interview Questions which covers setting up Hadoop Cluster…
Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. Standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode
What are the features of Standalone (local) mode?
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and
utilizes the local file system. Stand-alone mode is suitable only for running MapReduce
programs during development. It is one of the most least used environments.
What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all
the daemons run on the same machine.
Can we call VMs as pseudos? (VM- Virtual Machine )
No, VMs are not pseudos because VM is something different and pesudo is very specific to
Hadoop.
What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number of
machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines.
There is one host onto which Namenode is running and another host on which datanode is
running and then there are machines on which task tracker is running. We have separate masters
and separate slaves in this distribution.
Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the
case of UNIX.
In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd/usr/lib/hadoop-
0.20/.
What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:
1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.
Big Data & Hadoop Interview Questions and Answers for Experienced
Hey here I come with Big data and Hadoop Interview questions with answers for experienced
Database developers and fresher Big data and Hadoop developers. Here are most impotent
Hadoop and Big data interview questions with Answers.
What is NoSQL?
NoSQL is a whole new way of thinking about a database. NoSQL is not a relational database.
The reality is that a relational database model may not be the best solution for all situations. The
easiest way to think of NoSQL, is that of a database which does not adhering to the traditional
relational database management system (RDMS) structure. Sometimes you will also see it
revered to as 'not only SQL'.
Why would NoSQL be better than using a SQL Database? And how much better is it?
It would be better when your site needs to scale so massively that the best RDBMS running on
the best hardware you can afford and optimized as much as possible simply can't keep up with
the load. How much better it is depends on the specific use case (lots of update activity combined
with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in
extreme cases.
How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the
'custom splitter'. There is a feature of customization in Hadoop which can be called from the
main method.
Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper.
Sorting happens only on the reducer side. Mapper method initialization depends upon each input
split. While doing aggregation, we will lose the value of the previous instance. For each row, a
new mapper will get initialized. For each row, inputsplit again gets divided into mapper, thus we
do not have a track of the previous row value.
What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using
MapReduce in any programming language which can accept standard input and can produce
standard output. It could be Perl, Python, Ruby and not necessarily be Java. However,
customization in MapReduce can only be done using Java and not any other programming
language.
What is a Combiner?
A 'Combiner' is a mini reducer that performs the local reduce task. It receives the input from the
mapper on a particular node and sends the output to the reducer. Combiners help in enhancing
the efficiency of MapReduce by reducing the quantum of data that is required to be sent to the
reducers.
What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
-------------------------------------
PIG INTERVIEW QUESTIONS
Can you give us some examples how Hadoop is used in real time environment?
Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20
students appear for that exam. Every student will attempt each question. For each question and
each answer option, a key will be generated. So we have a set of key-value pairs for all the
questions and all the answer options for every student. Based on the options that the students
have selected, you have to analyze and find out how many students have answered correctly.
This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these
problems quickly and without much effort. You may also take the case of how many students
have wrongly attempted a particular question.
What is BloomMapFile used for?
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.
BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is
used in Hbase table format.
What is PIG?
PIG is a platform for analyzing large data sets that consist of high level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs. PIG’s
infrastructure layer consists of a compiler that produces sequence of MapReduce Programs.
What is the difference between logical and physical plans?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After
performing the basic parsing and semantic checking, it produces a logical plan. The logical plan
describes the logical operators that have to be executed by Pig during execution. After this, Pig