Professional Documents
Culture Documents
Super Important Questions For BDA
Super Important Questions For BDA
Super Important Questions For BDA
2
processing Big Data, the velocity of generation
of data plays a crucial role.
3
2. With a neat labelled diagram explain
big data architecture design?
Answer: Techopedia defines Big Data
architecture as follows: "Big Data architecture is
the logical and/or physical layout structure of how
Big Data will be stored, accessed and managed
within a Big Data or IT environment.
4
Architecture logically defines how Big Data solution
will work, the core components (hardware, database,
software, storage) used, flow of information, security
and more."
5
Five vertically aligned textboxes on the left of Figure
1.2 show the layers. Horizontal textboxes show the
functions in each layer.
6
3. With a neat diagram, explain data
store export to cloud and cloud
services?
Answer:
7
4. Discuss the various case studies and
applications of bigdata?
9
4.New products and innovations in service
Enterprise data warehouse optimization.
5. An example of fraud is borrowing money on
already motgage assets. Example of timely
compliances means returning the loan and
interest instalments by the borrowers.
5. Define:(i)Hadoop (ii) Mesos (iii)SQL
and NoSQL(with features) (iv)DDBMS
(v)In-memory column and row format data
Answer:- [i] Hadoop:- Hadoop is a scalable and
reliable parallel computing platform. Hadoop manages
Big Data distributed databases.
Figure below shows Hadoop based
Big Data environment. Small height cylinders represent
MapReduce and big ones represent the Hadoop.
10
[ii] Mesos :- Mesos vo.9 is a resources
management platform which enables sharing
of cluster of nodes by multiple frameworks and
which has compatibility with an open analytics
stack [data processing (Hive, Hadoop, HBase,
Storm), data management (HDFS).
[iii] SQL & NOSQL: An RDBMS uses SQL
(Structured Query Language). SQL is a
language for viewing or changing (update,
insert or append or delete) databases. It is a
11
language for data access control, schema
creation and data modifications.
NOSQL: NoSQL databases are considered as
semi-structured data. Big Data Store uses
NoSQL.
NOSQL stands for No sQL or Not Only SQL.
The stores do not integrate with applications
using SQL.
NoSQL is also used in cloud data store.
Features of NoSQL are as follows: NOSQL or
Not Only SQclass of non-relational data
storage systems, flexible data models and
multiple schemas.
[iv]: DDBMS:- A distributed DBMS
(DDBMS) is a collection of logically
interrelated Distributed DB is a collection
database at multiple system over a computer
network. The features of a distributed database
system are:
1. A collection of logically related
databases.
12
each user within the system may access all of
the data within all of the databases as if they
were a single database.
13
A specific day's sale of five different chocolate flavours is
stored in consecutive columns c to c+5 at memory.
A single instance of memory accesses loads values of all
five flavours at successive columns during online
processing.
6. Explain various techniques for data pre-processing?
Answer:- Data pre-processing is an important
step at the ingestion layer .For example,
consider grade point data in Example 1.8. The
outlier needs to be removed. Pre-processing is
a must before data mining and analytics. Pre-
processing is also a must before running a
Machine Learning (ML) algorithm.
Analytics needs prior screening of data quality
also. Data when being exported to a cloud
service or data store needs pre-processing Pre-
processing needs are:
Need of data pre-processing for data store
portability and usability in applications and
services
(i) Dropping out of range, inconsistent and
outlier values
(ii) Filtering unreliable, irrelevant and redundant
information
14
(iii) Data cleaning, editing, reduction and/or
wrangling
(iv) Data validation, transformation or
transcoding
(v).ELT processing.
MODULE 2
8. Desribe the Hadoop core components with a
diagram?
ANSWER:- Hadoop's two main components and their uses :
The Hadoop framework developed by the Apache Software Foundation is
depicted in the diagram below as its basic components.
15
systems and general input/output. Serialization, Java RPC (Remote
Procedure Call), and file-based data structures are examples of this.
16
Answer:-
The system includes the application support layer and application layer
components- AVRO, ZooKeeper, Pig, Hive, Sqoop, Ambari, Chukwa,
Mahout, Spark, Flink and Flume.
17
2. Resource-manager layer for job or application sub-tasks scheduling and
execution
3. Processing-framework layer, consisting of Mapper and Reducer for the
MapReduce process-flow
4. APls at application support layer
Answer:- The HDFS shell is not compliant with the POSIX. Thus,
the shell cannot interact Similar to Unix or Linux. Commands for
interacting with the files in HDES require /bin/hdfs dfs <args>, where
args stands for the command arguments. All Hadoop commands are
invoked by the bin/Hadoop script. Hadoop fsck / -files -blocks.
18
11. Write a note on YARN Based Execution
Mode?
• A MasterNode is made up of two parts: (i) a Job History Server and (ii)
a Resource Manager (RM).
• A Client Node submits the request for an application to the RM, who is
the master. One RM exists per cluster. The RM keeps information on all
the slave NMs. Information is about the location (Rack Awareness) and
the number of resources (data blocks and servers) they have. The RM
also renders the Resource Scheduler service that decides how to assign
19
the resources. It, therefore, performs resource management as well as
scheduling.
• Each NM assigns a container (s) for each AMI. The container (s)
assigned to an instance may be at the same NM or another ApplM that
uses just a fraction of the resources available. The ApplM, for instance,
uses the assigned container (s) for running the application sub-task.
• RM allots the resources to AM, and thus to ApplMs for using assigned
containers on the same or other NM for running the application subtasks
in parallel.
20
•
The centrally stored tables enable administration easier when the data sources
change during processing.
The files, DataNodes and blocks need the identification during processing at
HDFS.
21
A NameNode stores the file's meta data. Meta data gives information about
the file of user application, but does not participate in the computations.
The DataNode stores the actual data files in the data blocks.
Few nodes in a Hadoop cluster act as NameNodes. These nodes are termed as
MasterNodes or simply masters.
The slaves have lots of disk storage and moderate amounts of processing
capabilities and DRAM. Slaves are responsible to store the data and process
the computation tasks submitted by the clients.
Hadoop features:-
22
1. Fault-efficient scalable, flexible and modular design which uses
simple and modular programming model. The system provides servers
at high scalability. The system is scalable by adding new nodes to handle
larger data. Hadoop proves very helpful in storing, managing.
processing and analyzing Big Data. Modular functions make the system
flexible. One can add or replace components at ease. Modularity allows
replacing its components for a different software tool.
7. Java and Linux based: Hadoop uses Java interfaces. Hadoop base is
Linux but has its own set of shell common support.
13. Explain any three Essential Hadoop tools?
23
Answer:- Using Apache Pig
Apache Pig is a high-level language that enables programmers to write
complex MapReduce transformations using a simple scripting language.
Pig Latin defines a set of transformations on a data set such as aggregate, join,
and sort.
Pig is used to extract, transform, and load (ETL) data pipelines, quick research
on raw data, and iterative data processing.
Apache Pig has several usage modes. The first is a local mode in which all
processing is done on the local machine.
The non-local (cluster) modes are MapReduce and Tez. These modes execute
the job on the cluster using either the MapReduce engine or the optimized Tez
engine.
24
Hive is considered the de facto standard for interactive SQL queries over
petabytes of data using Hadoop and offers the following features:
Hive provides users to query the data on Hadoop clusters using SQL
Hive makes it possible for programmers who are familiar with the MapReduce
framework to add their custom mappers and reducers to Hive queries.
Hive queries can be dramatically accelerated using the Apache Tez framework
under YARN in Hadoop version 2.
In version 1 of Sqoop, data were accessed using connectors written for specific
databases.
25
your RDBMS. Instead, version 2 offers more generalized ways to accomplish
these tasks.
26