Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7


1. What is Big data?

A) Big Data is a collection of large volumes of data that grows exponentially
with time.

2. Write data types in hadoop.

A) Integer –> IntWritable: It is used to pass integer numbers as key or
Float –> FloatWritable: It is used to pass floating point numbers as
key or value.
Long –> LongWritable: It is used to store long values.
Short –> ShortWritable: It is used to store short values.
Double –> DoubleWritable: It is used to store double values.
String –> Text: It is used to pass string characters as key or value.
Byte –> ByteWritable: It is used to store sequence of bytes.
null –> NullWritable: It is used to pass null as a key or value.

3. List out the applications of Big Data.

A)  Applications of Big Data:
a.  Tracking Customer Spending Habit, Shopping Behaviour.
b.  Recommendation
c. Virtual Personal Assistant Tool
d. Media and Entertainment Sector

4. What is Sqoop in Hadoop?

A) Apache Sqoop is used to import data from external datastores

into Hadoop Distributed File System or related Hadoop eco-systems
like Hive and HBase.

5. List out the differences between Sqoop and Hadoop.

A) Sqoop in Hadoop is mostly used to extract structured data from
databases like Teradata, Oracle, etc.

6. Where the Shuffle and Sort Process Does?

A) The shuffle and sort phase is done by the framework. Data from all
mappers are grouped by the key, split among reducers and sorted by the
key. Each reducer obtains all values associated with the same key.

7. Indicate the differences between name node and data node.

A) The main difference between Name Node and Data Node in Hadoop is
that the Name Node is the master node in HDFS that manages the file
system metadata while the Data Node is a slave node in HDFS that
stores the actual data as instructed by the Name Node.

8. State the principle of Job tracker?

A) The Job Tracker is the service within Hadoop that farms out
MapReduce tasks to specific nodes in the cluster, ideally the nodes that
have the data, or at least are in the same rack. Client applications
submit jobs to the Job tracker. The Job Tracker submits the work
to the chosen Task Tracker nodes.

9. Name features of Task tracker?

A) A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker. Every TaskTracker is
configured with a set of slots, these indicate the number of tasks that it
can accept.

10. Write map and reduce functions.

A) The Map function takes input from the disk as <key,value> pairs,
processes them, and produces another set of intermediate <key,value>
pairs as output. The Reduce function also takes inputs as <key,value>
pairs, and produces <key,value> pairs as output.

11. Label the usage of combine and split functions.

A) Combiner Function : A Combiner, also known as a semi-reducer, is an
optional class that operates by accepting the inputs from the Map class
and thereafter passing the output key-value pairs to the Reducer class. The
main function of a Combiner is to summarize the map output records
with the same key.
Split Function : Split is user defined and user can control split size in
his MapReduce program. One split can be mapping to multiple blocks
and there can be multiple split of one block. The number of map tasks
(Mapper) are equal to the number of input splits.

12. Explain Four V’s of Big Data.

A) Volume : The volume represents the amount of data which is growing at
an exponential rate i.e. in Petabytes and Exabytes. 
Variety : Variety refers to the heterogeneity of data types. In another
word, the data which are gathered has a variety of formats like videos,
audios, csv, etc. So, these various formats represent the variety of data.
Velocity : Velocity refers to the rate at which data is growing, which is
very fast. Today, yesterday’s data are considered as old data. Nowadays,
social media is a major contributor to the velocity of growing data.
Veracity :  Veracity refers to the data in doubt or uncertainty of data
available due to data inconsistency and incompleteness.

13. What is Map Reduce?

A) MapReduce is a framework using which we can write applications to
process huge amounts of data, in parallel, on large clusters of
commodity hardware in a reliable manner.

14. Write down the features of HDFS.

A) Features of HDFS:
a. Highly – Scalable
b. Portable
c. Distributed Data Storage
d. Replication

15. What is Flume?

A) Apache Flume is a reliable and distributed system for collecting,
aggregating and moving massive quantities of log data. It has a simple
yet flexible architecture based on streaming data flows.

16. List hadoop operation modes.

A) Hadoop Mainly works on 3 different Modes:
a. Standalone Mode
b. Pseudo – Distributed Mode
c. Fully – Distributed Mode
17. What is the fair scheduler?
A) Fair scheduling is a method of assigning resources to applications such
that all apps get, on average, an equal share of resources over time.
The scheduler organizes apps further into “queues”, and shares
resources fairly between these queues.

18. Explain about Blocks in HDFS.

A) Blocks are the smallest continuous location on your hard drive where
data is stored. HDFS stores each file as blocks, and distribute it across
the Hadoop cluster. The default size of a block in HDFS is 128 MB.

19. Which Command is used to list the files in Hadoop?

A) The  hdfs dfs -ls command to list files in Hadoop archives. Run
the hdfs dfs -ls command by specifying the archive directory location.

20. Write about HAR files.

A) HAR is created from a collection of files and the archiving tool will run
a MapReduce job. these Maps reduce jobs to process the input files in
parallel to create an archive file.

21. Compare and contrast NoSQL vs. Relational Databases.

A) SQL databases are known as relational databases, and have a table-
based data structure, with a strict, predefined schema required. NoSQL
databases, or non-relational databases, can be document based, graph
databases, key-value pairs, or wide-column stores.

NoSQL databases don’t require any predefined schema, allowing you to

work more freely with “unstructured data.” Relational databases are
vertically scalable, but usually more expensive, whereas the horizontal
scaling nature of NoSQL databases is more cost-efficient.

22.What is serialization? Why is it needed?

A) Serialization is the process of transforming structured objects into a
byte stream for transmission over a network or for writing to persistent

23. What is the mechanism used by HDFS to ensure data integrity?

A) Data Integrity in Hadoop is achieved by maintaining the checksum of
the data written to the block. Whenever data is written to HDFS blocks
, HDFS calculate the checksum for all data written and verify
checksum when it will read that data. The seperate checksum will
create for every dfs.
24. Briefly explain MapReduce data flow.
A) MapReduce is used to compute the huge amount of data . To handle
the upcoming data in a parallel and distributed form, the data has
to flow from various phases.

25.Why Is a Block in HDFS So Large?

A) HDFS blocks are huge than the disk blocks, and the explanation is to
limit the expense of searching. The time or cost to transfer the data from
the disk can be made larger than the time to seek for the beginning of
the block by simply improving the size of blocks significantly.

26. Explain about the Key Value pairs in a Hadoop MapReduce.

A) Key-value pair in MapReduce is the record entity that Hadoop
MapReduce accepts for execution. We use Hadoop mainly for data
Analysis. It deals with structured, unstructured and semi-structured data.

27.What are the two nodes operating in an HDFS cluster?

A) The two nodes operating in an HDFS Cluster are Name Node &
Secondary Name Node.

28.What is structured, semi-structured, unstructured data?

A) We can classify data as structured data, semi-structured data,
or unstructured data. Structured data resides in predefined formats
and models, Unstructured data is stored in its natural format until it's
extracted for analysis, and Semi structured data basically is a mix of
both structured and unstructured data.

29. What are various File Permissions in HDFS

A) The file or directory has separate permissions for the user that
is the owner, for other users that are members of the group, and
for all other users. For files, the r permission is required to read the file,
and the w permission is required to write or append to the file.

30.What is Hadoop streaming?

A) Hadoop streaming is a utility that comes with the Hadoop distribution.
This utility allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer.
31. Define a Block. Why the HDFS block size is large than disk block?
A) HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks. By making a block large enough, the time to
transfer the data from the disk can be significantly longer than the time to
seek to the start of the block.

32. Define HDFS.

A) The Hadoop Distributed File System ( HDFS ) is a distributed file system
designed to run on commodity hardware. ... HDFS is highly fault-tolerant
and is designed to be deployed on low-cost hardware.

33. What is GenericOptionsParser Class?

A) GenericOptionsParser class is a utility class with in the org. apache.
hadoop. util package. This class parses the standard command line
arguments and sets them on a configuration object which can then be used
with in the application.

34.Name the four independent entities in classic MapReduce of a job run.

A) At the highest level, there are four independent entities: • The client,
which submits the MapReduce job. The jobtracker, which coordinates
the job run. The jobtracker is a Java application whose main class is
JobTracker. The tasktrackers, which run the tasks that the job has been
split into.

35.How the Capacity Scheduler approach differs with Fair Scheduler?

A) Fair Scheduler assigns equal amount of resource to all running jobs.
When the job completes, free slot is assigned to new job with equal
amount of resource. Here, the resource is shared between queues. 
Capacity Scheduler on the other hand, it assigns resource based on
the capacity required by the organisation.

36. Write the advantage of Lazy output.

A. FileOutputFormat subclasses will create output files (part-r-nnnn), even
if they are empty. Some applications prefer not to create empty files,
which is where LazyOutputFormat helps.

LazyOutputFormat is a wrapper OutputFormat. It makes sure that the

output file should create only when it emits its first record for a given
To use LazyOutputFormat, call its SetOutputFormatClass() method with
the JobConf.

To enable LazyOutputFormat, streaming and pipes supports a –

lazyOutput option.

37. What is the use of MultipleOutputs?

A. MultipleOutputs class provide facility to write Hadoop map/reducer
output to more than one folders. Basically, we can use
MultipleOutputs when we want to write outputs other than map reduce
job default output and write map reduce job output to different files
provided by a user.

You might also like