Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

BDA ANSWER BANK

1. Characteristics of Big Data 5V

DEFINITION:
Big data is an evolving term that describes any voluminous amount of structured, semi
structured and unstructured data that has the potential to be mined for information. Big data
challenges include capturing data, data storage, data analysis, search, sharing, transfer,
visualization, querying, and updating and information privacy.

(i)Volume
The name 'Big Data' itself is related to a size which is enormous. Size of data plays very
crucial role in determining value out of data. Also, whether a particular data can actually
be considered as a Big Data or not, is dependent upon volume of data. Hence, 'Volume'
is one characteristic which needs to be considered while dealing with 'Big Data'.
(ii)Variety
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources
of data considered by most of the applications. Now days, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. is also being considered in the
analysis applications. This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
(iii)Velocity
The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business
processes,application logs, networks and social media sites, sensors, Mobile devices,
etc. The flow of data is massive and continuous.
(iv) Variability
This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively
5. Value:
After having the 4 V’s into account there comes one more V which stands for Value!.
The bulk of Data having no Value is of no good to the company, unless you turn it into
something useful. Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information.
2. Compare Big data and Traditional data
3. Diff between NoSQL and RDBMS
4. Explain main source of Big Data in real world (Importance of Big Data)

5. Apply Map Reduce algorithm to perform matrix multiplication of given


matrices
6. Explain HDFS architecture
7. Explain & draw Hadoop Ecosystem
8. Short note on MongoDB
MongoDB, the most popular NoSQL database, is an open-source
document-oriented database. The term ‘NoSQL’ means ‘non-relational’. It means
that MongoDB isn’t based on the table-like relational database structure but
provides an altogether different mechanism for storage and retrieval of data. This
format of storage is called BSON ( similar to JSON format).
A simple MongoDB document Structure:
{
title: 'Geeksforgeeks',
by: 'Harshit Gupta',
url: 'https://www.geeksforgeeks.org',
type: 'NoSQL'
}

SQL databases store data in tabular format. This data is stored in a predefined
data model which is not very much flexible for today’s real-world highly growing
applications. Modern applications are more networked, social and
interactive than ever. Applications are storing more and more data and are
accessing it at higher rates.
Relational Database Management System(RDBMS) is not the correct choice
when it comes to handling big data by the virtue of their design since they
are not horizontally scalable. If the database runs on a single server, then it will
reach a scaling limit. NoSQL databases are more scalable and provide superior
performance. MongoDB is such a NoSQL database that scales by adding more
and more servers and increases productivity with its flexible document model.
Features of MongoDB:
● Document Oriented: MongoDB stores the main subject in the minimal
number of documents and not by breaking it up into multiple relational
structures like RDBMS. For example, it stores all the information of a
computer in a single document called Computer and not in distinct
relational structures like CPU, RAM, Hard disk, etc.
● Indexing: Without indexing, a database would have to scan every
document of a collection to select those that match the query which
would be inefficient. So, for efficient searching Indexing is a must and
MongoDB uses it to process huge volumes of data in very less time.
● Scalability: MongoDB scales horizontally using sharding (partitioning
data across various servers). Data is partitioned into data chunks using
the shard key, and these data chunks are evenly distributed across
shards that reside across many physical servers. Also, new machines
can be added to a running database.
● Replication and High Availability: MongoDB increases the data
availability with multiple copies of data on different servers. By providing
redundancy, it protects the database from hardware failures. If one
server goes down, the data can be retrieved easily from other active
servers which also had the data stored on them.
● Aggregation: Aggregation operations process data records and return
the computed results. It is similar to the GROUPBY clause in SQL. A
few aggregation expressions are sum, avg, min, max, etc

Where do we use MongoDB?


MongoDB is preferred over RDBMS in the following scenarios:
● Big Data: If you have huge amount of data to be stored in tables, think
of MongoDB before RDBMS databases. MongoDB has built-in solution
for partitioning and sharding your database.
● Unstable Schema: Adding a new column in RDBMS is hard whereas
MongoDB is schema-less. Adding a new field does not effect old
documents and will be very easy.
● Distributed data Since multiple copies of data are stored across
different servers, recovery of data is instant and safe even if there is a
hardware failure.
9. Explain Hadoop core component in detail
10. Solve word count problem using map reduce
11. Explain Hadoop physical architecture and limitations of Hadoop

I. Hadoop Cluster - Architecture, Core Components and


Work-flow
1. The architecture of Hadoop Cluster
2. Core Components of Hadoop Cluster
3. Work-flow of How File is Stored in Hadoop

A. Hadoop Cluster
i. Hadoop cluster is a special type of computational cluster designed for
storing and analyzing vast amount of unstructured data in a distributed
computing environment.
ii. These clusters run on low cost commodity computers.
iii. Hadoop clusters are often referred to as "shared nothing" systems because
the only thing that is shared between nodes is the network that connects
them.
iv. Large Hadoop Clusters are arranged in several racks. Network traffic
between different nodes in the same rack is much more desirable than
network traffic across the racks.
Hadoop cluster has 3 components:
1. Client
2. Master
3. Slave
1. Client:
i. It is neither master nor slave, rather play a role of loading the data into
cluster, submit MapReduce jobs describing how the data should be processed
and then retrieve the data to see the response after job completion.
2. Masters:
The Masters consists of 3 components NameNode, Secondary
Node name and JobTracker.
i. NameNode:
➢ NameNode does NOT store the files but only the file's metadata. In later
section we will see it is actually the DataNode which stores the files.
NameNode oversees the health of DataNode and coordinates access to the
data stored in DataNode.
➢ Name node keeps track of all the file system related information such as to
✓ Which section of file is saved in which part of the cluster
✓ Last access time for the files
✓ User permissions like which user have access to the file
ii. JobTracker:
JobTracker coordinates the parallel processing of data using
MapReduce. To know more about JobTracker, please read the
article
All You Want to Know about MapReduce (The Heart of Hadoop)
iii. Secondary Name Node:
➢ The job of Secondary Node is to contact NameNode in a periodic manner
after certain
time interval (by default 1 hour).
➢ NameNode which keeps all filesystem metadata in RAM has no capability
to process that
metadata on to disk.
If NameNode crashes, you lose everything in RAM itself and you don't have
any backup of filesystem.
➢ What secondary node does is it contacts NameNode in an hour and pulls
copy of metadata information out of NameNode.
It shuffle and merge this information into clean file folder and sent to back
again to NameNode, while keeping a copy for itself.
➢ Hence Secondary Node is not the backup rather it does job of
housekeeping.
➢ In case of NameNode failure, saved metadata can rebuild it easily.

3. Slaves:
i. Slave nodes are the majority of machines in Hadoop Cluster and are
responsible to Store the data and process the computation.
ii. Each slave runs both a DataNode and Task Tracker daemon which
communicates to their masters.
iii. The Task Tracker daemon is a slave to the Job Tracker and the DataNode
daemon a slave to the NameNode
12. Explain relational algebra operations using mapreduce
13. Write a note on Combiner

You might also like