Hadoop Interview Questions

1 Why is HDFS fault-tolerant?
HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is
replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes,
the data can still be retrieved from other DataNodes.
2 What is Name Node

NameNode is the master service that hosts metadata in disk and RAM. It holds information about the
various DataNodes, their location, the size of each block, etc.
3 If you have an input file of 350 MB, how many input splits would HDFS
create and what would be the size of each input split?
By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block,
will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is
128 MB, 128MB, and 94 MB.
4 What is rack awareness work in HDFS?

HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across
the racks of a Hadoop Cluster
5 How do you copy data from the local system onto HDFS?
copyFromLoacal or put command
6 What role do RecordReader in a MapReduce operation?

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper
to read.
7 What is Combiner
This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works
on it, and then passes its output to the reducer phase.
8 Name some Hadoop-specific data types that are used in a MapReduce

program.
IntWritable
FloatWritable
LongWritable
DoubleWritable
BooleanWritable
9 What are the major configuration parameters required in a MapReduce

program?
Input location of the job in HDFS
Output location of the job in HDFS
Input and output formats
Classes containing a map and reduce functions
JAR file for mapper, reducer and driver classes
10 Can we have more than one ResourceManager in a YARN-based cluster?
Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability
YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where
the ZooKeeper handles the coordination.
11 What are the different components of a Hive architecture?
User Interface
Metastore
Compiler
Execution Engine
12 What is the difference between an external table and a managed table in

Hive?
External tables in Hive refer to the data that is at an existing location outside the warehouse
director
internal table, these types of tables manage the data and move it into its warehouse directory by
default
Hive deletes the metadata information of a table and does not change the table data present in HDFS
If one drops a managed table, the metadata information along with the table data is deleted from the Hive
warehouse directory
13 What is a partition in Hive and why is partitioning required in Hive

Partition is a process for grouping similar types of data together based on columns or partition keys. Each
table can have one or m
14 What is bucketing in Hive ?

The bucketing in Hive is a data-organising technique. It is used to decompose data into more manageable
parts, known as buckets, which in result, improves the performance of the queries
15 What are the key differences between Hive and Pig?

Hive uses a declarative language, called HiveQL, which is similar to SQL for reporting.
PigUses a high-level procedural language called Pig Latin for programming
16 What are the different ways of executing a Pig script?

Grunt shell
Script file
17 different complex data types in Pig.

Tuple
Bag
Map
18 What are the relational operators in Pig?
COGROUP
CROSS
FOREACH
JOIN
LIMIT
SPLIT
UNION
ORDER
19 What is the use of having filters in Apache Pig?

FilterOperator is used to select the required tuples from a relation based on a condition. It also allows you
to remove unwanted records from the data file.
20 What are the key components of HBase?

Region Server
HMaster
ZooKeeper
21 What is Column families in Hbase?
Column families consist of a group of columns that are defined during table creation, and each column
family has certain column qualifiers that a delimiter separates.
22 Why do we need to disable a table in HBase
The HBase table is disabled to allow modifications to its settings
23 Can you import/export in an HBase table?
Yes using HBase import utility/ HBase export utility

24 Write the HBase command to list the contents of table
scan ‘table_name’
25 Write the HBase command to update the column families of a table.
alter ‘table_name’, ‘column_family_name’
26 What are the default file formats to import data using Sqoop?
The default Hadoop file formats are Delimited Text File Format and SequenceFile Format
27 Different cluster managers available in Apache Spark.
Standalone Mode
Apache Mesos
Hadoop YARN
Kubernetes
28 Types of operations supported by RDD
Transformations:
Actions
29 What is the function of filer()?
filer() function is used to develop a new RDD by selecting the various elements from the existing RDD,
which passes the function argument.
30 What is a SparkSession?
SparkSession is a unified entry point for reading data in Spark. Introduced in Spark 2.0
31 What is Lazy Evaluation in Spark.
Lazy evaluation in Spark means that the execution will not start until an action is triggered
32 What is a Parquet file in Spark?
Parquet is a columnar storage file format optimized for use with big data processing frameworks like
Apache Spark. It provides efficient data compression and encoding schemes with enhanced performance
to handle complex nested data structures.
33 What is DataFrame in Spark?
DataFrame is a distributed collection of data organized into named columns, similar to a table in a
relational database.
34 features of Spark Datasets.
Compile-time analysis
Faster Computation
Less Memory consumption
Query Optimization
Qualified Persistent storage
Single Interface for multiple languages
35 Difference between reduce() and take() function?
take() function is an action that takes into consideration all the values from an RDD to the local node.
reduce() function is an action that is applied repeatedly until the one value is left in the last.

Hadoop Interview Questions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Interview Questions

Uploaded by

Copyright:

Available Formats

1 Why is HDFS fault-tolerant?

2 What is Name Node

4 What is rack awareness work in HDFS?

6 What role do RecordReader in a MapReduce operation?

8 Name some Hadoop-specific data types that are used in a MapReduce

9 What are the major configuration parameters required in a MapReduce

Output location of the job in HDFS

Input and output formats

Classes containing a map and reduce functions

JAR file for mapper, reducer and driver classes

10 Can we have more than one ResourceManager in a YARN-based cluster?

11 What are the different components of a Hive architecture?

12 What is the difference between an external table and a managed table in

13 What is a partition in Hive and why is partitioning required in Hive

14 What is bucketing in Hive ?

15 What are the key differences between Hive and Pig?

PigUses a high-level procedural language called Pig Latin for programming

16 What are the different ways of executing a Pig script?

17 different complex data types in Pig.

19 What is the use of having filters in Apache Pig?

20 What are the key components of HBase?

21 What is Column families in Hbase?

22 Why do we need to disable a table in HBase

The HBase table is disabled to allow modifications to its settings

23 Can you import/export in an HBase table?

Yes using HBase import utility/ HBase export utility

25 Write the HBase command to update the column families of a table.

alter ‘table_name’, ‘column_family_name’

27 Different cluster managers available in Apache Spark.

28 Types of operations supported by RDD

29 What is the function of filer()?

31 What is Lazy Evaluation in Spark.

32 What is a Parquet file in Spark?

33 What is DataFrame in Spark?

34 features of Spark Datasets.

35 Difference between reduce() and take() function?

You might also like