Super Important Questions For BDA

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Super Important Questions for BDA

1. What is Big data? Mention its


characteristics, and also the classification of
Big data?
Answer:- Big Data is high-volume, high-velocity
and/or high-variety information asset that requires new
forms of processing for enhanced decision making,
insight discovery and process optimization.
The rise in technology has led to the production and
storage of voluminous amounts of data. Earlier
megabytes (10° B) were used but nowadays petabytes
(10" B) are used for processing, analyse new facts and
generating new knowledge.
Conventional systems for storage, processing an. The
rise in technology has led to the production and storage
of voluminous amounts of d. discovering analysis and
formats, increasing pose challenges in large growth in
volume of data, variety of data, various forms and
formats: complexity, faster generation of data and need
of quickly processing, analyzing and usage.
Figure below shows data usage and growth. As size and
complexity 1ncrease, the proportion of unstructured data
types also increase.
1
Big Data Characteristics
Characteristics of Big Data, called 3Vs (and 4Vs also
used) are:
 Volume The phrase "Big Data' contains the term
big, which is related to size of the data and hence the
characteristic.
Size defines the amount or quantity of data, which is
generated from an application(s). The size determines the
processing considerations necded for handling that data.

 Velocity The term velocity refers to the speed


of generation of data. Velocity is a measure of
how fast the data generates and processes. To
meet the demands and the challenges of

2
processing Big Data, the velocity of generation
of data plays a crucial role.

 Variety Big Data comprises of a variety of


data. Data is generated from multiple sources
in a system. This introduces variety in data and
therefore introduces 'complexity'. Data consists
of various forms and formats.

Veracity is also considered an important


characteristic to take into account the quality of
data captured, which can vary greatly,
affecting its accurate analysis.

Big Data Classification


Big Data can be classified on the basis of its
characteristics that are used for designing data
architecture for processing and analytics.

3
2. With a neat labelled diagram explain
big data architecture design?
Answer: Techopedia defines Big Data
architecture as follows: "Big Data architecture is
the logical and/or physical layout structure of how
Big Data will be stored, accessed and managed
within a Big Data or IT environment.

4
Architecture logically defines how Big Data solution
will work, the core components (hardware, database,
software, storage) used, flow of information, security
and more."

Characteristics of Big Data make designing Big


Data architecture a complex process. Further,
faster additions of new technological innovations
increase the complexity in design.

The requirements for offering competing products


at lower costs in the market make the designing
task more challenging for a Big Data architect.

Data analytics need the number of sequential steps.


Big Data architecture design task simplifies when
using the logical layers approach.

Figure 1.2 shows the logical layers and the


functions which are considered in Big Data
architecture.

5
Five vertically aligned textboxes on the left of Figure
1.2 show the layers. Horizontal textboxes show the
functions in each layer.

Data processing architecture consists of five


layers:

(i) identification of data sources,


(ii) cquisition, ingestion, extraction, pre-
processing, transformation of data,
(iii) data storage at files, servers, cluster or
cloud,
(iv) data-processing, and
(v) data consumption in the number of
programs and tools.

6
3. With a neat diagram, explain data
store export to cloud and cloud
services?

Answer:

7
4. Discuss the various case studies and
applications of bigdata?

Answer:- Big Data Analytics Applications and Case Studies

Many applications such as social network and


social media, cloud applications, public and
commercial web sites, scientific experiments,
simulators and e-government services generate
Big Data.

Big Data analytics find applications in many


areas. Some of the popular ones are marketing,
sales, health care, medicines, advertising etc.
Following subsections describe these use
cases, applications and case studies.

Big Data in Marketing and Sales

Data are important for most aspect of


marketing, sales and advertising. Customer
Value (CV) depends on three factors- quality,
service and price. Big data analytics deploy
large volume of data to identify and derive
intelligence using predictive models about the
8
individuals. The facts enable marketing
companies to decide what products to sell.

A definition of marketing is the creation,


communication and delivery of value to
customers. Customer (desired) value means
what a customer desires from a product.
Customer (perceived) value means what the
Customer believes to have received from a
product after purchase of the product.
Customer value analytics (CVA) means
analyzing what a customer really needs. CVA
makes it possible for leading marketers, such
as Amazon to deliver the consistent customer
experiences. Following are the five application
areas in order of the popularity of Big Data use
cases:
1. CVA using the inputs of evaluated purchase
patterns, preferences, quality, price and post
sales servicing requirements
2.Operational analytics for optimizing
company operations
3.Detection of frauds and compliances

9
4.New products and innovations in service
Enterprise data warehouse optimization.
5. An example of fraud is borrowing money on
already motgage assets. Example of timely
compliances means returning the loan and
interest instalments by the borrowers.
5. Define:(i)Hadoop (ii) Mesos (iii)SQL
and NoSQL(with features) (iv)DDBMS
(v)In-memory column and row format data
Answer:- [i] Hadoop:- Hadoop is a scalable and
reliable parallel computing platform. Hadoop manages
Big Data distributed databases.
Figure below shows Hadoop based
Big Data environment. Small height cylinders represent
MapReduce and big ones represent the Hadoop.

10
[ii] Mesos :- Mesos vo.9 is a resources
management platform which enables sharing
of cluster of nodes by multiple frameworks and
which has compatibility with an open analytics
stack [data processing (Hive, Hadoop, HBase,
Storm), data management (HDFS).
[iii] SQL & NOSQL: An RDBMS uses SQL
(Structured Query Language). SQL is a
language for viewing or changing (update,
insert or append or delete) databases. It is a

11
language for data access control, schema
creation and data modifications.
NOSQL: NoSQL databases are considered as
semi-structured data. Big Data Store uses
NoSQL.
NOSQL stands for No sQL or Not Only SQL.
The stores do not integrate with applications
using SQL.
NoSQL is also used in cloud data store.
Features of NoSQL are as follows: NOSQL or
Not Only SQclass of non-relational data
storage systems, flexible data models and
multiple schemas.
[iv]: DDBMS:- A distributed DBMS
(DDBMS) is a collection of logically
interrelated Distributed DB is a collection
database at multiple system over a computer
network. The features of a distributed database
system are:
1. A collection of logically related
databases.

2. Cooperation between databases in a


transparent manner. Transparent means that

12
each user within the system may access all of
the data within all of the databases as if they
were a single database.

3. Should be 'location independent' which


means the user is unaware of where the data
is located, and it is possible to move the data
from one physical location to another without
affecting the user.
In-Memory Column Formats Data
A columnar format in-memory allows faster
data retrieval when only a few columns in a
table need to be selected during query
processing or aggregation. Data in a column
are kept together in-memory in columnar
format. A single memory access, therefore,
loads many values at the column.
In-Memory Row Formats Data
A row format in-memory allows much faster data
processing during OLTP (online transaction processing).
Each row record has corresponding values in multiple
columns and the on-line value store at the consecutive
memory addresses in row format.

13
A specific day's sale of five different chocolate flavours is
stored in consecutive columns c to c+5 at memory.
A single instance of memory accesses loads values of all
five flavours at successive columns during online
processing.
6. Explain various techniques for data pre-processing?
Answer:- Data pre-processing is an important
step at the ingestion layer .For example,
consider grade point data in Example 1.8. The
outlier needs to be removed. Pre-processing is
a must before data mining and analytics. Pre-
processing is also a must before running a
Machine Learning (ML) algorithm.
Analytics needs prior screening of data quality
also. Data when being exported to a cloud
service or data store needs pre-processing Pre-
processing needs are:
Need of data pre-processing for data store
portability and usability in applications and
services
(i) Dropping out of range, inconsistent and
outlier values
(ii) Filtering unreliable, irrelevant and redundant
information
14
(iii) Data cleaning, editing, reduction and/or
wrangling
(iv) Data validation, transformation or
transcoding
(v).ELT processing.

7. Difference between distributed computing ,


grid computing and cluster computing?.
Answer:-

MODULE 2
8. Desribe the Hadoop core components with a
diagram?
ANSWER:- Hadoop's two main components and their uses :
The Hadoop framework developed by the Apache Software Foundation is
depicted in the diagram below as its basic components.

The framework's Hadoop core components are as follows:

1. Hadoop Common: The Hadoop Common module contains the libraries


and utilities required by the other Hadoop modules. Hadoop common, for
example, includes a number of components and interfaces for distributed file

15
systems and general input/output. Serialization, Java RPC (Remote
Procedure Call), and file-based data structures are examples of this.

2. Hadoop Distributed File System (HDFS) — A distributed file system


based on Java that can store all types of data on cluster disks.

3. Hadoop 1 MapReduce software programming model based on the


Mapper and Reducer.The v1 parallelizes and batches large amounts of data.

4. YARN: Computer resource management software.The user application


tasks or sub-tasks run in parallel in Hadoop, which uses scheduling and
handles resource requests in distributed task execution.

5. Hadoop 2 is a MapReduce system based on YARN that allows for


parallel processing of large datasets as well as distributed processing of
application tasks.

Figure 2.1 Core components of Hadoop

9. Explain using a diagram the Hadoop


Ecosystem Components and features?

16
Answer:-

Figure: Hadoop main components and ecosystem components


Hadoop ecosystem refers to a combination of technologies. Hadoop
ecosystem consists of own family of applications which tie up together with
the Hadoop. The system components support the storage, processing, access,
analysis, governance, security and operations for Big Data.

The system includes the application support layer and application layer
components- AVRO, ZooKeeper, Pig, Hive, Sqoop, Ambari, Chukwa,
Mahout, Spark, Flink and Flume.

The four layers are as follows:

1. Distributed storage layer

17
2. Resource-manager layer for job or application sub-tasks scheduling and
execution
3. Processing-framework layer, consisting of Mapper and Reducer for the
MapReduce process-flow
4. APls at application support layer

The codes communicate run using MapReduce or YARN at processing


framework layer.
Reducer output communicate to APls
AVRO enables data serialization between the layers.

Zookeeper enables coordination among layer components.


Client hosts run applications using Hadoop ecosystem projects, such as
Pig, Hive and Mahout.
Hadoop uses Java programming. Such Hadoop programs run on any
platform with the
Java virtual-machine deployment model.
HDFS is a Java-based distributed file system that can store various kinds
of data on the computers

10. What is HDFS? List the different commands


in HDFS?

Answer:- The HDFS shell is not compliant with the POSIX. Thus,
the shell cannot interact Similar to Unix or Linux. Commands for
interacting with the files in HDES require /bin/hdfs dfs <args>, where
args stands for the command arguments. All Hadoop commands are
invoked by the bin/Hadoop script. Hadoop fsck / -files -blocks.

18
11. Write a note on YARN Based Execution
Mode?

Answer:- The figure shown below is of the YARN-based execution model.


The figure shows the YARN components—Client, Resource Manager (RM),
Node Manager (NM), Application Master (AM) and Containers.

The list of actions of YARN's resource allocation and scheduling functions is


as follows:

• A MasterNode is made up of two parts: (i) a Job History Server and (ii)
a Resource Manager (RM).

• A Client Node submits the request for an application to the RM, who is
the master. One RM exists per cluster. The RM keeps information on all
the slave NMs. Information is about the location (Rack Awareness) and
the number of resources (data blocks and servers) they have. The RM
also renders the Resource Scheduler service that decides how to assign
19
the resources. It, therefore, performs resource management as well as
scheduling.

• Multiple NMs are in a cluster. An NM creates an AM instance (AMI)


and starts The AMI initializes itself and registers with the RM. Multiple
AMIs can be created in an AM.

• The AMI performs the role of an Application Manager (ApplM) that


estimates the resource requirements for running an application
programme or sub-The ApplMs send their requests for the necessary
resources to the RM. Each NM includes several containers for use by
the subtasks of the application.

• NM is a slave to the infrastructure. It signals whenever it initializes. All


active NMs send the controlling signal periodically to the RM, signaling
their

• Each NM assigns a container (s) for each AMI. The container (s)
assigned to an instance may be at the same NM or another ApplM that
uses just a fraction of the resources available. The ApplM, for instance,
uses the assigned container (s) for running the application sub-task.

• RM allots the resources to AM, and thus to ApplMs for using assigned
containers on the same or other NM for running the application subtasks
in parallel.

20

Fig:- YARN Based execution model

12. Describe the Hadoop physical organization?


Write the features of Hadoop?

Answer:- The conventional file system uses directories. A directory


consists of folders. A folder consists of files. When data processes, the data
sources identify by pointers for the resources. A datadictionary stores the
resource pointers. Master tables at the dictionary store at a central location.

The centrally stored tables enable administration easier when the data sources
change during processing.

The files, DataNodes and blocks need the identification during processing at
HDFS.

HDFS use the NameNodes and DataNodes.

21
A NameNode stores the file's meta data. Meta data gives information about
the file of user application, but does not participate in the computations.

The DataNode stores the actual data files in the data blocks.

Few nodes in a Hadoop cluster act as NameNodes. These nodes are termed as
MasterNodes or simply masters.

The masters have a different configuration supporting high DRAM and


processing power. The masters have much less local storage.

Majority of the nodes in Hadoop cluster act as DataNodes and TaskTrackers.


These nodes are referred to as slave nodes or slaves.

The slaves have lots of disk storage and moderate amounts of processing
capabilities and DRAM. Slaves are responsible to store the data and process
the computation tasks submitted by the clients.

Figure: The client, master NameNode, MasterNodes and slave nodes

Hadoop features:-
22
1. Fault-efficient scalable, flexible and modular design which uses
simple and modular programming model. The system provides servers
at high scalability. The system is scalable by adding new nodes to handle
larger data. Hadoop proves very helpful in storing, managing.
processing and analyzing Big Data. Modular functions make the system
flexible. One can add or replace components at ease. Modularity allows
replacing its components for a different software tool.

2. Robust design of HDFS: Execution of Big Data applications continue


even when an individual server or cluster fails. This is because of
Hadoop provisions for backup and a data recovery mechanism. HDFS
thus has high reliability.

3. Store and process Big Data: Processes Big Data of 3V characteristics.

4. Distributed clusters computing model with data locality: Processes


Big Data at high speed as the application tasks and sub-tasks submit to
the DataNodes. One can achieve more computing power by increasing
the number of computing nodes. The processing splits across multiple
DataNodes (servers), and thus fast processing and aggregated results.

5. Hardware fault-tolerant: A fault does not affect data and application


processing. If a node goes down, the other nodes take care of the residue.
This is due to multiple copies of all data blocks which replicate
automatically. Default is three copies of data blocks.

6. Open-source framework: Open source access and cloud services


enable large data store. Hadoop uses a cluster of multiple inexpensive
servers or the cloud.

7. Java and Linux based: Hadoop uses Java interfaces. Hadoop base is
Linux but has its own set of shell common support.
13. Explain any three Essential Hadoop tools?

23
Answer:- Using Apache Pig
Apache Pig is a high-level language that enables programmers to write
complex MapReduce transformations using a simple scripting language.

Pig Latin defines a set of transformations on a data set such as aggregate, join,
and sort.

Pig is used to extract, transform, and load (ETL) data pipelines, quick research
on raw data, and iterative data processing.

Apache Pig has several usage modes. The first is a local mode in which all
processing is done on the local machine.

The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez
engine.

Figure: Apache Pig Usage Modes

Using Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Hadoop for


providing data summarization, ad hoc queries, and the analysis of large data
sets using a SQL-like language called HiveQL.

24
Hive is considered the de facto standard for interactive SQL queries over
petabytes of data using Hadoop and offers the following features:

 Tools to enable easy data extraction, transformation, and loading (ETL)


 A mechanism to impose structure on a variety of data formats
 Access to files stored either directly in HDFS or in other data storage
systems such as HBase
 Query execution via MapReduce and Tez

Hive provides users to query the data on Hadoop clusters using SQL

Hive makes it possible for programmers who are familiar with the MapReduce
framework to add their custom mappers and reducers to Hive queries.

Hive queries can be dramatically accelerated using the Apache Tez framework
under YARN in Hadoop version 2.

Using Apache Sqoop to Acquire Relational Data

Sqoop is a tool designed to transfer data between Hadoop and relational


databases.

Sqoop can be used with any Java Database Connectivity (JDBC)–compliant


database and has been tested on Microsoft SQL Server, Postgres SQL,
MySQL, and Oracle.

In version 1 of Sqoop, data were accessed using connectors written for specific
databases.

Version 2 does not support connectors or version 1 data transfer from a


RDBMS directly to Hive or HBase, or data transfer from Hive or HBase to

25
your RDBMS. Instead, version 2 offers more generalized ways to accomplish
these tasks.

14. Classify the map-reduce framework(also learn


process placement) and Programing model?
Answer:-

26

You might also like