VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

USN

VTU Examination – February/March 2022

Sub: BIG DATA AND ANALYTICS Sub Code: 18CS72 Branch: ISE

Seventh Semester B.E Degree Examination, Feb/Mar 2022


Big Data Analytics
Module-1
1. a. Discuss the Evolution of Big Data.
Big Data is high-volume, high-velocity and/or high-variety information asset
that requires new forms of processing for Data enhanced decision making,
insight discovery and process optimization.

b. Explain the characteristics of Big Data.


Characteristics of Big Data, called 3Vs (and 4Vs also used) are:
Volume - The phrase Big Data' contains the term big, which is related to size of
the data and hence the characteristic. Size defines the amount or quantity of
data, which is generated from an application(s). The size determines the
processing considerations needed for handling that data.
Velocity - The term velocity refers to the speed of generation of data. Velocity
is a measure of how fast the data generates and processes. To meet the
demands and the challenges of processing Big Data, the velocity of generation
of data plays a crucial role.
Variety - Big Data comprises of a variety of data. Data is generated from
multiple sources in a system. This introduces variety in data and therefore
introduces 'complexity'. Data consists of various forms and formats. The
variety is due to the availability of a large number of heterogeneous platforms
in the industry.
Veracity – It is also considered an important characteristic to take into account
the quality of data captured, which can vary greatly, affecting its accurate
analysis.

c. With a neat block diagram, explain Data Architecture Design.


The logical layers and the functions which are considered in Big Data
architecture. Five vertically aligned textboxes on the left show the layers.
Horizontal textboxes show the functions in each layer.

Data processing architecture consists of five layers:


 Identification of data sources,
 Acquisition, ingestion, extraction, pre-processing, transformation of
data,
 Data storage at files, servers, cluster or cloud,
 Data-processing, and
 Data consumption
L1 considers the following aspects in a design:
• Amount of data needed at ingestion layer 2 (L2)
• Push from L1 or pull by L2 as per the mechanism for the usages
• Source data-types: Database, files, web or service
• Source formats, i.e., semi-structured, unstructured or structured.
L2 considers the following aspects:
• Ingestion and ETL processes either in real time, which means store and use
the data as generated, or in batches. Batch processing is using discrete
datasets at scheduled or periodic intervals of time.
L3 considers the followings aspects:
• Data storage type (historical or incremental), format, compression, incoming
data frequency, querying patterns and consumption requirements for L4 or L5
• Data storage using Hadoop distributed file system or NOSQL data stores-
HBase,Cassandra, MongoDB.
L4 considers the followings aspects:
• Data processing software such as MapReduce, Hive, Pig, Spark, Spark
Mahout, Spark Streaming
• Processing in scheduled batches or real time or hybrid
• Processing as per synchronous or asynchronous processing requirements at
L5.
L5 considers the consumption of data for the following:
• Data integration
Datasets usages for reporting and visualization Analytics (real time, near real
time, scheduled batches), BPs, Bls, knowledge discovery
• Export of datasets to cloud, web or other systems.
OR
2. a. Write notes on Analytics Scalability to Big Data and Massive Parallel
Processing Platforms.
Scalability enables increase or decrease in the capacity of data storage, processing and
analytics. Scalability is the capability of a system to handle the workload as per the
magnitude of the work. System capability needs increment with the increased workloads.
When the workload and complexity exceed the system capacity, scale it up and scale it out.
 Analytics Scalability to Big Data
Vertical scalability means scaling up the given system's resources and increasing the
system's analytics, reporting and visualization capabilities. This is an additional way to solve
problems of greater complexities.
Scaling up means designing the algorithm according to the architecture that uses resources
efficiently.Horizontal scalability means increasing the number of systems working in
coherence and scaling out the workload. Processing different datasets of a large dataset
deploys horizontal scalability. Scaling out means using more resources and distributing the
processing and storage tasks in parallel. Alternative ways for scaling up and out processing
of analytics software and Big Data analytics deploy the Massively Parallel Processing
Platforms (MPPS), cloud, grid, clusters, and distributed computing software.
 Massively Parallel Processing Platforms
Scaling uses parallel processing systems. Many programs are so large and/or complex that it
is impractical or impossible to parallel and distributed execute them on a single computer
system, especially in limited computer memory. Here, it is required to enhance (scale) up
the computer system or use massive parallel processing (MPPS) platforms.
Parallelization of tasks can be done at several levels:
(i) distributing separate tasks onto separate threads on the same CPU,
(ii) distributing separate tasks onto separate CPUS on the same computer
(iii) distributing separate tasks onto separate computers.
The computational problem is broken into discrete pieces of sub-tasks that can be
processed simultaneously. The system executes multiple program instructions or sub-tasks
at any moment in time. Total time taken will be much less than with a single compute
resource.
 Distributed Computing Model
A distributed computing model uses cloud, grid or clusters, which process and analyze big
and large datasets on distributed computing nodes connected by high-speed networks.
 Cloud Computing
Cloud computing is a type of Internet-based computing that provides shared processing
resources and data to the computers and other devices on demand. One of the best
approach for data processing is to perform parallel and distributed computing in a cloud-
computing environment. Cloud usages circumvent the single point failure due to failing of
one node. Cloud design performs as a whole. Its multiple nodes perform automatically and
interchangeably. It offers high data security compared to other distributed technologies.
Cloud resources can be Amazon Web Service (AWS) Elastic Compute Cloud (EC2), Microsoft
Azure or Apache CloudStack and Amazon Simple Storage Service (S3).
Cloud computing features are:
(i) on-demand service
(ii) resource pooling,
(iii) scalability,
(iv) accountability,
(v) broad network access.
Cloud services can be accessed from anywhere and at any time through the Internet. A local
private cloud can also be set up on a local cluster of computers. Cloud Services are:
Infrastructure as a Service (laas), Platform as a Service (Paas) and Software as a Service
(Saas).
 Grid and Cluster Computing
Grid Computing refers to distributed computing, in which a group of computers from several
locations are connected with each other to achieve a common task. The computer resources
are heterogeneously and geographically dispersed. A group of computers that might spread
over remotely comprise a grid. A grid is used for a variety of purposes. A single grid of
course, dedicates at an instance to a particular application only.
Cloud computing depends on sharing of resources (for example, networks, servers, storage,
applications and services) to attain coordination and coherence among resources similar to
grid computing. Similarly, grid also forms a distributed network for resource
integration.Drawbacks of Grid Computing Grid computing is the single point, which leads to
failure in case of underperformance or failure of any of the participating nodes. A system's
storage capacity varies with the number of users, instances and the amount of data
transferred at a given time. Sharing resources among a large number of users helps in
reducing infrastructure costs and raising load capacities.

 Cluster Computing
A cluster is a group of computers connected by a network. The group works together to
accomplish the same task. Clusters are used mainly for load balancing. They shift processes
between nodes to keep an even load on the group of connected computers.

 Volunteer Computing
Volunteers provide computing resources to projects of importance that use resources to do
distributed computing and/or storage. Volunteer computing is a distributed computing
paradigm which uses computing resources of the volunteers. Volunteers are organizations
or members who own personal computers. Projects examples are science-related projects
executed by universities or academia in general.

b. Highlight Big Data Analytics applications with one case study.


Applications of Big Data Analytics:

 Big Data in Marketing and Sales


 Big Data Analytics in Detection of Marketing Frauds
 Big Data Risks
 Big Data Credit Risk Management
 Big Data and Algorithmic Trading
 Big Data and Healthcare
 Big Data in Medicine
 Big Data in Advertising
Module-2
3. a. What are the core components of Hadoop? Explain in brief its each
of its components.
b. Explain Hadoop Distributed File System.
HDFS COMPONENTS:
i. HDFS Block Replication
ii. HDFS Safe Mode
iii. Rack Awareness
iv. NameNode High Availability
v. HDFS NameNode Federation
vi. HDFS Checkpoints and Backups
vii. HDFS Snapshots
viii. HDFS NFS Gateway
 The design of HDFS is based on two types of nodes: a NameNode and multiple
DataNodes
 NameNode manages all the metadata needed to store and retrieve the actual data
from the DataNodes.
 No data is actually stored on the NameNode.
 The design is a master/slave architecture in which the master (NameNode) manages
the file system namespace and regulates access to files by clients.
 File system namespace operations such as opening, closing, and renaming files and
directories are all managed by the NameNode
 The NameNode also determines the mapping of blocks to DataNodes and handles
DataNode failures
 The NameNode manages block creation, deletion, and replication
 The slaves (DataNodes) are responsible for serving read and write requests from the
file system to the clients

 The design of HDFS is based on two types of nodes: a NameNode and multiple
DataNodes.
i. HDFS Block Replication
When HDFS writes a file, it is replicated across the cluster. The amount of replication is
based on the value of dfs.replication in the hdfs-site.xml file. This default value can be
overruled with the hdfsdfs-setrep command. For Hadoop clusters containing more than
eight DataNodes, the replication value is usually set to 3.
ii. HDFS Safe Mode
When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted.
iii. Rack Awareness
 Rack awareness deals with data locality.
 The main design goals of Hadoop MapReduce is to move the computation to the
data. Assuming that most data center networks do not offer full bisection
bandwidth.
iv. NameNode High Availability
 The NameNode was a single point of failure that could bring down the entire
Hadoop cluster.
 NameNode hardware often employed redundant power supplies and storage to
guard against such problems, but it was still susceptible to other failures.
 The solution was to implement NameNode High Availability (HA) as a means to
provide true failover service.
v. HDFS NameNode Federation
Another important feature of HDFS is NameNode Federation. Older versions of HDFS
provided a single namespace for the entire cluster managed by a single NameNode.
Thus, the resources of a single NameNode determined the size of the namespace.
Federation addresses this limitation by adding support for multiple
NameNodes/namespaces to the HDFS file system.
vi. HDFS Checkpoints and Backups
 The NameNode stores the metadata of the HDFS file system in a file called
fsimage.
 File systems modifications are written to an edits log file, and at startup the
NameNode merges the edits into a new fsimage.
 The Secondary NameNode or CheckpointNode periodically fetches edits from
the NameNode, merges them, and returns an updated fsimage to the
NameNode.
vii. HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using the
hdfsdfs snapshot command.
viii. HDFS NFS Gateway
The HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as part of the
client's local file system. Users can browse the HDFS file system through their local file
systems that provide an NFSv3 client compatible operating system.
OR
4. a. Define MapReduce Framework and its functions.
b. Write down the steps on the request to MapReduce and the types of
process in MapReduce.
c. Write short notes on Flume Hadoop Tool.

Module-3
5. a. Discuss the characteristics of NoSQL data store along with the
features in NoSQL transactions.
b. With neat diagrams, explain the following for shared-Nothing Architecture
for Big Data tasks.
(i) Single Server Model

(ii) Sharding very large databases


(iii) Master Slave distribution model.

(iv) Peer-to Peer distribution model.

OR
6. a. Define key-value store with example. What are the advantages of
key-value store?
b. Write down the steps to provide client to read and write values using key-
value store. What are the typical uses of key value store?
Module-4
7. a. With a neat diagram, explain the process in MapReduce when client
submitting a Job.

b. Explain Hive Integration and work flow steps involved with a diagram.

OR
8. a. Using HiveQL for the following:
(i) Create a table with partition.
(ii) Add, rename and drop a partition to a table.

b. What is PIG in Big Data? Explain the features of PIG.


Module-5
9. a. In Machine Learning explain linear and non-linear relationship with
essential graphs.

Linear and Non-Linear Relationship

b. Write the block diagram of text mining process and explain its phases.
Text mining refers to the process of deriving high-quality information from text.
Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that
uses natural language processing (NLP) to transform the free (unstructured) text in
documents and databases into normalized, structured data suitable for analysis or to drive
machine learning (ML) algorithms.
Phase 1:
Text pre-processing enables Syntactic/Semantic text-analysis and does the followings:
Text cleanup is a process of removing unnecessary or unwanted information.
Tokenization is a process of splitting the cleanup text into tokens (words) using white spaces
and punctuation marks as delimiters.
Part of Speech (POS) tagging is a method that attempts labeling of each token (word) with
an appropriate POS. Tagging helps in recognizing names of people, places, organizations and
titles.
Parsing is a method, which generates a parse-tree for each sentence. Parsing attempts and
infers the precise grammatical relationships between different words in a given sentence
Phase 2:
Features Generation is a process which first defines features (variables, predictors). Some of
the ways of feature generations are: Text document is represented by the words it contains
(and their occurrences). Document classification methods commonly use the bag-of-words
model.
Stemming-identifies a word by its root. (i) Normalizes or unifies variations of the same
concept, such as speak for three variations, i.e., speaking, speaks, speakers denoted by
[speaking, speaks, speaker ---+ speak] (ii) Removes plurals, normalizes verb tenses and
remove affixes.
Removing stop words from the feature space-they are the common words, unlikely to help
text mining. The search program tries to ignore stop words. For example, ignores a, at, for,
it, in and are.
Vector Space Model (VSM) is an algebraic model for representing text documents as vector
of identifiers, word frequencies or terms in the document index. VSM uses the method of
term frequency-inverse document frequency (TF-IDF) and evaluates how important is a
word in a document.
Phase 3:
Features Selection is the process that selects a subset of features by rejecting irrelevant
and/or redundant features (variables, predictors or dimension) according to defined criteria.
Feature selection process does the following:
Dimensionality reduction - Feature selection is one of the methods of division and
therefore, dimension reduction. The basic objective is to eliminate irrelevant and redundant
data. Redundant features are those, which provide no extra information. Irrelevant features
provide no useful or relevant information in any context
N-gram evaluation - finding the number of consecutive words of interest and extract them.
Noise detection and evaluation of outliers methods do the identification of unusual or
suspicious items, events or observations from the data set. This step helps in cleaning the
data.
Phase 4:
Data mining techniques enable insights about the structured database that resulted from
the previous phases. Examples of techniques are:
1. Unsupervised learning (for example, clustering) (i) The class labels (categories) of training
data are unknown (ii) Establish the existence of groups or clusters in the data
2. Supervised learning (for example, classification) (i) The training data is labeled indicating
the class (ii) New data is classified based on the training set.
Phase 5:
Analysing results (i) Evaluate the outcome of the complete process. (ii) Interpretation of
Result- If acceptable then results obtained can be used as an input for next set of
sequences. Else, the result can be discarded, and try to understand what and why the
process failed. (iii) Visualization - Prepare visuals from data, and build a prototype. (iv) Use
the results for further improvement in activities at the enterprise, industry or institution
OR
10. a. Define multiple regressions. Write down the examples involved in
forecasting and optimization in regression.

Least Square Estimation


b. Explain the parameters in social graph network topological analysis using
centralities and PageRank.
Social Network Analysis (SNA) is the process of investigating social structures through the
use of networks and graph theory. It characterizes networked structures in terms of nodes
(individual actors, people, or things within the network) and the ties, edges, or links
(relationships or interactions) that connect them.
Examples of social structures commonly visualized through social network analysis include
social media networks, memes spread, information circulation, business networks, knowledge
networks, social networks, collaboration networks
Nodes (A,B,C,D,E in the example) are usually representing entities in the network, and can
hold self-properties (such as weight, size, position and any other attribute) and network-based
properties (such as Degree- number of neighbors or Cluster- a connected component the node
belongs to etc.).
Edges represent the connections between the nodes, and might hold properties as well (such
as weight representing the strength of the connection, direction in case of asymmetric relation
or time if applicable). These two basic elements can describe multiple phenomena, such as
social connections, virtual routing network, physical electricity networks, roads network,
biology relations network and many other relationships.
Page Rank:
The in-degree (visibility) of a link is the measure of number of in-links from other links. The
out-degree (luminosity) of a link is number of other links to which that link points

Social Network as Graphs


Social network as graphs provide a number of metrics for analysis. The metrics enable the
application of the graphs in a number of fields. Network topological analysis tools compute
the degree, closeness, betweenness, egonet, K-neighbourhood, top-K shortest paths,
PageRank, clustering, SimRank
Centralities, Ranking and Anomaly Detection
Important metrics are degree (centrality), closeness (centrality), betweenness (centrality) and
eigenvector (centrality). Eigenvector consists of elements such as status, rank and other
properties. Social graph-network analytics discovers the degree of interactions, closeness,
betweenness, ranks, probabilities, beliefs and potentials.

You might also like