Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Big Data

Q1. What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. Also, Big data is a collection of large datasets that cannot be processed
using traditional computing techniques. It is not a single technique or a tool,
rather it has become a complete subject, which involves various tools, techniques
and frameworks.

Q2. What is Hadoop?


Hadoop is an open-source framework that allows to store and process big data in
a distributed environment across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
Hadoop Architecture
At its core, Hadoop has two major layers namely −
Processing/Computation layer (MapReduce), and
Storage layer (Hadoop Distributed File System).
Hadoop Components
1. HDFS : It is a virtual File system, It is primary storage in Hadoop, Infinitely
scalable.
2. YARN( Yet Another Resource Negotiator) : Responsible for providing
computational resources. Comprises Resource Manager, Node Manager,
Application Manager.
3. Hadoop Common : It is a collection of libraries that implement underlying
capabilities lacked by Hadoop.

Q3. What is MapReduce?


MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important
tasks, namely Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
Q4. Hadoop Streaming?
Hadoop streaming is a utility that comes with the Hadoop distribution. This utility
allows you to create and run Map/Reduce jobs with any executable or script as
the mapper and/or the reducer.

Q5. What is Hadoop Ecosystem?


Below are the Hadoop components, that together form a Hadoop ecosystem:
HDFS -> Hadoop Distributed File System
YARN -> Yet Another Resource Negotiator
MapReduce -> Data processing using programming
Spark -> In-memory Data Processing
PIG, HIVE-> Data Processing Services using Query (SQL-like)
HBase -> NoSQL Database
Mahout, Spark MLlib -> Machine Learning
Apache Drill -> SQL on Hadoop
Zookeeper -> Managing Cluster
Oozie -> Job Scheduling
Flume, Sqoop -> Data Ingesting Services
Solr & Lucene -> Searching & Indexing
Ambari -> Provision, Monitor and Maintain cluster

You might also like