Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

1 Why is HDFS fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes. By default, a block of data is
replicated on three DataNodes. The data blocks are stored in different DataNodes. If one node crashes,
the data can still be retrieved from other DataNodes.

2 What is Name Node


NameNode is the master service that hosts metadata in disk and RAM. It holds information about the
various DataNodes, their location, the size of each block, etc.

3 If you have an input file of 350 MB, how many input splits would HDFS
create and what would be the size of each input split?
By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block,
will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is
128 MB, 128MB, and 94 MB.

4 What is rack awareness work in HDFS?


HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across
the racks of a Hadoop Cluster

5 How do you copy data from the local system onto HDFS?
copyFromLoacal or put command

6 What role do RecordReader in a MapReduce operation?


This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper
to read.

7 What is Combiner
This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works
on it, and then passes its output to the reducer phase.

8 Name some Hadoop-specific data types that are used in a MapReduce


program.

IntWritable

FloatWritable

LongWritable

DoubleWritable
BooleanWritable

9 What are the major configuration parameters required in a MapReduce


program?
Input location of the job in HDFS

Output location of the job in HDFS

Input and output formats

Classes containing a map and reduce functions

JAR file for mapper, reducer and driver classes

10 Can we have more than one ResourceManager in a YARN-based cluster?

Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability
YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where
the ZooKeeper handles the coordination.

11 What are the different components of a Hive architecture?

User Interface

Metastore

Compiler

Execution Engine

12 What is the difference between an external table and a managed table in


Hive?

External tables in Hive refer to the data that is at an existing location outside the warehouse
director

internal table, these types of tables manage the data and move it into its warehouse directory by
default

Hive deletes the metadata information of a table and does not change the table data present in HDFS
If one drops a managed table, the metadata information along with the table data is deleted from the Hive
warehouse directory

13 What is a partition in Hive and why is partitioning required in Hive


Partition is a process for grouping similar types of data together based on columns or partition keys. Each
table can have one or m

14 What is bucketing in Hive ?


The bucketing in Hive is a data-organising technique. It is used to decompose data into more manageable
parts, known as buckets, which in result, improves the performance of the queries

15 What are the key differences between Hive and Pig?


Hive uses a declarative language, called HiveQL, which is similar to SQL for reporting.

PigUses a high-level procedural language called Pig Latin for programming

16 What are the different ways of executing a Pig script?


Grunt shell

Script file

17 different complex data types in Pig.


Tuple

Bag

Map
18 What are the relational operators in Pig?
COGROUP

CROSS

FOREACH

JOIN

LIMIT

SPLIT

UNION

ORDER

19 What is the use of having filters in Apache Pig?


FilterOperator is used to select the required tuples from a relation based on a condition. It also allows you
to remove unwanted records from the data file.

20 What are the key components of HBase?


Region Server

HMaster

ZooKeeper

21 What is Column families in Hbase?

Column families consist of a group of columns that are defined during table creation, and each column
family has certain column qualifiers that a delimiter separates.

22 Why do we need to disable a table in HBase

The HBase table is disabled to allow modifications to its settings

23 Can you import/export in an HBase table?

Yes using HBase import utility/ HBase export utility


24 Write the HBase command to list the contents of table

scan ‘table_name’

25 Write the HBase command to update the column families of a table.

alter ‘table_name’, ‘column_family_name’

26 What are the default file formats to import data using Sqoop?

The default Hadoop file formats are Delimited Text File Format and SequenceFile Format

27 Different cluster managers available in Apache Spark.

Standalone Mode
Apache Mesos
Hadoop YARN
Kubernetes

28 Types of operations supported by RDD

Transformations:
Actions

29 What is the function of filer()?

filer() function is used to develop a new RDD by selecting the various elements from the existing RDD,
which passes the function argument.

30 What is a SparkSession?

SparkSession is a unified entry point for reading data in Spark. Introduced in Spark 2.0

31 What is Lazy Evaluation in Spark.

Lazy evaluation in Spark means that the execution will not start until an action is triggered

32 What is a Parquet file in Spark?

Parquet is a columnar storage file format optimized for use with big data processing frameworks like
Apache Spark. It provides efficient data compression and encoding schemes with enhanced performance
to handle complex nested data structures.

33 What is DataFrame in Spark?

DataFrame is a distributed collection of data organized into named columns, similar to a table in a
relational database.

34 features of Spark Datasets.

Compile-time analysis
Faster Computation
Less Memory consumption
Query Optimization
Qualified Persistent storage
Single Interface for multiple languages

35 Difference between reduce() and take() function?

take() function is an action that takes into consideration all the values from an RDD to the local node.

reduce() function is an action that is applied repeatedly until the one value is left in the last.

You might also like