Professional Documents
Culture Documents
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
Sub: BIG DATA AND ANALYTICS Sub Code: 18CS72 Branch: ISE
Cluster Computing
A cluster is a group of computers connected by a network. The group works together to
accomplish the same task. Clusters are used mainly for load balancing. They shift processes
between nodes to keep an even load on the group of connected computers.
Volunteer Computing
Volunteers provide computing resources to projects of importance that use resources to do
distributed computing and/or storage. Volunteer computing is a distributed computing
paradigm which uses computing resources of the volunteers. Volunteers are organizations
or members who own personal computers. Projects examples are science-related projects
executed by universities or academia in general.
The design of HDFS is based on two types of nodes: a NameNode and multiple
DataNodes.
i. HDFS Block Replication
When HDFS writes a file, it is replicated across the cluster. The amount of replication is
based on the value of dfs.replication in the hdfs-site.xml file. This default value can be
overruled with the hdfsdfs-setrep command. For Hadoop clusters containing more than
eight DataNodes, the replication value is usually set to 3.
ii. HDFS Safe Mode
When the NameNode starts, it enters a read-only safe mode where blocks cannot be
replicated or deleted.
iii. Rack Awareness
Rack awareness deals with data locality.
The main design goals of Hadoop MapReduce is to move the computation to the
data. Assuming that most data center networks do not offer full bisection
bandwidth.
iv. NameNode High Availability
The NameNode was a single point of failure that could bring down the entire
Hadoop cluster.
NameNode hardware often employed redundant power supplies and storage to
guard against such problems, but it was still susceptible to other failures.
The solution was to implement NameNode High Availability (HA) as a means to
provide true failover service.
v. HDFS NameNode Federation
Another important feature of HDFS is NameNode Federation. Older versions of HDFS
provided a single namespace for the entire cluster managed by a single NameNode.
Thus, the resources of a single NameNode determined the size of the namespace.
Federation addresses this limitation by adding support for multiple
NameNodes/namespaces to the HDFS file system.
vi. HDFS Checkpoints and Backups
The NameNode stores the metadata of the HDFS file system in a file called
fsimage.
File systems modifications are written to an edits log file, and at startup the
NameNode merges the edits into a new fsimage.
The Secondary NameNode or CheckpointNode periodically fetches edits from
the NameNode, merges them, and returns an updated fsimage to the
NameNode.
vii. HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using the
hdfsdfs snapshot command.
viii. HDFS NFS Gateway
The HDFS NFS Gateway supports NFSv3 and enables HDFS to be mounted as part of the
client's local file system. Users can browse the HDFS file system through their local file
systems that provide an NFSv3 client compatible operating system.
OR
4. a. Define MapReduce Framework and its functions.
b. Write down the steps on the request to MapReduce and the types of
process in MapReduce.
c. Write short notes on Flume Hadoop Tool.
Module-3
5. a. Discuss the characteristics of NoSQL data store along with the
features in NoSQL transactions.
b. With neat diagrams, explain the following for shared-Nothing Architecture
for Big Data tasks.
(i) Single Server Model
OR
6. a. Define key-value store with example. What are the advantages of
key-value store?
b. Write down the steps to provide client to read and write values using key-
value store. What are the typical uses of key value store?
Module-4
7. a. With a neat diagram, explain the process in MapReduce when client
submitting a Job.
b. Explain Hive Integration and work flow steps involved with a diagram.
OR
8. a. Using HiveQL for the following:
(i) Create a table with partition.
(ii) Add, rename and drop a partition to a table.
b. Write the block diagram of text mining process and explain its phases.
Text mining refers to the process of deriving high-quality information from text.
Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that
uses natural language processing (NLP) to transform the free (unstructured) text in
documents and databases into normalized, structured data suitable for analysis or to drive
machine learning (ML) algorithms.
Phase 1:
Text pre-processing enables Syntactic/Semantic text-analysis and does the followings:
Text cleanup is a process of removing unnecessary or unwanted information.
Tokenization is a process of splitting the cleanup text into tokens (words) using white spaces
and punctuation marks as delimiters.
Part of Speech (POS) tagging is a method that attempts labeling of each token (word) with
an appropriate POS. Tagging helps in recognizing names of people, places, organizations and
titles.
Parsing is a method, which generates a parse-tree for each sentence. Parsing attempts and
infers the precise grammatical relationships between different words in a given sentence
Phase 2:
Features Generation is a process which first defines features (variables, predictors). Some of
the ways of feature generations are: Text document is represented by the words it contains
(and their occurrences). Document classification methods commonly use the bag-of-words
model.
Stemming-identifies a word by its root. (i) Normalizes or unifies variations of the same
concept, such as speak for three variations, i.e., speaking, speaks, speakers denoted by
[speaking, speaks, speaker ---+ speak] (ii) Removes plurals, normalizes verb tenses and
remove affixes.
Removing stop words from the feature space-they are the common words, unlikely to help
text mining. The search program tries to ignore stop words. For example, ignores a, at, for,
it, in and are.
Vector Space Model (VSM) is an algebraic model for representing text documents as vector
of identifiers, word frequencies or terms in the document index. VSM uses the method of
term frequency-inverse document frequency (TF-IDF) and evaluates how important is a
word in a document.
Phase 3:
Features Selection is the process that selects a subset of features by rejecting irrelevant
and/or redundant features (variables, predictors or dimension) according to defined criteria.
Feature selection process does the following:
Dimensionality reduction - Feature selection is one of the methods of division and
therefore, dimension reduction. The basic objective is to eliminate irrelevant and redundant
data. Redundant features are those, which provide no extra information. Irrelevant features
provide no useful or relevant information in any context
N-gram evaluation - finding the number of consecutive words of interest and extract them.
Noise detection and evaluation of outliers methods do the identification of unusual or
suspicious items, events or observations from the data set. This step helps in cleaning the
data.
Phase 4:
Data mining techniques enable insights about the structured database that resulted from
the previous phases. Examples of techniques are:
1. Unsupervised learning (for example, clustering) (i) The class labels (categories) of training
data are unknown (ii) Establish the existence of groups or clusters in the data
2. Supervised learning (for example, classification) (i) The training data is labeled indicating
the class (ii) New data is classified based on the training set.
Phase 5:
Analysing results (i) Evaluate the outcome of the complete process. (ii) Interpretation of
Result- If acceptable then results obtained can be used as an input for next set of
sequences. Else, the result can be discarded, and try to understand what and why the
process failed. (iii) Visualization - Prepare visuals from data, and build a prototype. (iv) Use
the results for further improvement in activities at the enterprise, industry or institution
OR
10. a. Define multiple regressions. Write down the examples involved in
forecasting and optimization in regression.