Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

1) a) Discuss role of combiners with example.

(5)

The combiner is a powerful mechanism that aggregates data in the map phase to cut down on
data sent to the reducer. It’s a map-side optimization, where your code is invoked with a number
of map output values for the same output key. The combiner is invoked on the map side as part
of writing map output data to disk in both the spill and merge phases. To help with grouping
values together to maximize the effectiveness of a combiner, use a sorting step in both phases
prior to calling the combiner function.

1) b) Explain Hadoop core Components. (5)

1
(List And Explain)

Hadoop Common Package

Hadoop Distributed File System (HDFS)

HDFS is a distributed file system that provides a limited interface for managing the file system to
allow it to scale and provide high throughput. HDFS creates multiple replicas of each data block
and distributes them on computers throughout a cluster to enable reliable and rapid access. When
a file is loaded into HDFS, it is replicated and fragmented into “blocks” of data, which are stored
across the cluster nodes; the cluster nodes are also called the DataNodes. The NameNode is
responsible for storage and management of metadata, so that when MapReduce or another
execution framework calls for the data, the NameNode informs it where the data that is needed
resides.

1. HDFS creates multiple replicas of data blocks for reliability, placing them on the computer
nodes around the cluster.

2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.

3. A file consists of many 64 MB blocks.

MapReduce

The MapReduce algorithm aids in parallel processing and basically comprises two sequential
phases:- map and reduce

1. In the map phase, a set of key–value pairs forms the input and over each key–value pair, the
desired function is executed so as to generate a set of intermediate key–value pairs.

2. In the reduce phase, the intermediate key–value pairs are grouped by key and the values are
combined together according to the reduce algorithm provided by the user. Sometimes no reduce
phase is required, given the type of operation coded by the user.

MapReduce processes are divided between two applications, JobTracker and TaskTracker at the
cluster level. JobTracker is responsible for scheduling job runs and managing computational
resources across the cluster; hence it runs on only one node of the cluster. Each MapReduce job
is split into a number of tasks which are assigned to the various TaskTrackers depending on
which data is stored on that node. So TaskTracker runs on every slave node in the cluster.
JobTracker oversees the progress of each TaskTracker as they complete their individual tasks.

The main components of MapReduce are listed below:

2
1. JobTrackers: JobTracker is the master which manages the jobs and resources in the cluster.
The JobTracker tries to schedule each map on the TaskTracker which is running on the same
DataNode as the underlying block.

2. TaskTrackers: TaskTrackers are slaves which are deployed on each machine in the cluster.
They are responsible for running the map and reduce tasks as instructed by the JobTracker.

3. JobHistoryServer: JobHistoryServer is a daemon that saves historical information about


completed tasks/applications.

Yet Another Resource Negotiator (YARN)

YARN addresses problems with MapReduce 1.0s architecture, specifically the one faced by the
JobTracker service. Hadoop generally has up to tens of thousands of nodes in the cluster.
Obviously, MapReduce 1.0 had issues with scalability, memory usage, synchronization, and also
Single Point of Failure (SPOF) issues. In effect, YARN became another core component of
Apache Hadoop.

It splits up the two major functionalities “resource management” and “job scheduling and
monitoring” of the JobTracker into two separate daemons. One acts as a “global Resource
Manager (RM)” and the other as a “ApplicationMaster (AM)” per application. Thus, instead of
having a single node to handle both scheduling and resource management for the entire cluster,
YARN distributes this responsibility across the cluster. The RM and the NodeManager manage
the applications in a distributed manner. The RM is the one that arbitrates resources among all
the applications in the system. The per-application AM negotiates resources from the RM and
works with the NodeManager(s) to execute and monitor the component tasks

1) c) Explain sharding in database. (5)

Sharding is a method for distributing a single dataset across multiple databases, which can then
be stored on multiple machines. This allows for larger datasets to be split into smaller chunks
and stored in multiple data nodes, increasing the total storage capacity of the systemEach
partition forms part of a shard, meaning small part of the whole. Each part (shard) can be located
on a separate database server or any physical location.
Need for Sharding: Consider a very large database whose sharding has not been done. For
example, let’s take a DataBase of a college in which all the student’s records (present and past) in
the whole college are maintained in a single database. So, it would contain very very large
number of data, say 100, 000 records. Now when we need to find a student from this Database,
each time around 100, 000 transactions have to be done to find the student, which is very very
costly. Now consider the same college students records, divided into smaller data shards based

3
on years. Now each data shard will have around 1000-5000 students records only. So not only
the database became much more manageable, but also the transaction cost of each time also
reduces by a huge factor, which is achieved by Sharding. Hence this is why Sharding is needed.

Advantages of sharding

(list)

1) d) List and explain various data stream sources. (5)

A data stream is a flow of data which arrives at uncertain intervals of time. Various data stream
sources are:(Explain any five)
● Real-time stock trades
● Marketing, sales, and business analytics
● Customer/user activity
● Monitoring and reporting on internal IT systems
● Log Monitoring: Troubleshooting systems, servers, devices, and more
● SIEM (Security Information and Event Management): analyzing logs and real-time event
data for monitoring, metrics, and threat detection
● Retail/warehouse inventory: inventory management across all channels and locations, and
providing a seamless user experience across all devices
● Sensor data
● Data streams related to images
● Internet services and web services traffic

2) a) Explain working of different phases of Map Reduce with one example? (10)

MapReduce can be used to write applications to process large amounts of data, in parallel, on
large clusters of commodity hardware in a reliable manner. It is a processing technique as well as
programming model for distributed computing based on java programming language or java
framework. The term “MapReduce” refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples(key/value pairs). The reduce
job takes the output from a map as input and combines those data tuples into a smaller set of
tiples. As the sequence of the name MapReduce implies, the reduce job is always performed
after the map job.
Discuss Each in Detail

4
The Map Tasks
Grouping by Key
The Reduce Tasks

Example: Below figure shows word count using MapReduce algorithm

Workflow
1. Splitting – The splitting parameter can be anything, e.g., splitting by space, comma,
semicolon, or even by a new line (‘\n’). First in the map stage, the input data is split and
distributed across the cluster.
2. Mapping – The Map function converts the elements to zero or more <key, value>pairs for
every word. These pairs thow how many times a word occurs. A word is a key and a value is a
count.
3. Shuffling – After input splitting and mapping completes, the outputs of every map task are
shuffled. This is the first step of the Reduce stage. The shuffle step ensures the keys Bus, Car,
Train, Plane are sorted for the reduce step. This process groups the values by keys in the form of
<key, value-list> pairs.
4. Reduce – The reduce tasks will add all the values for each key. Similar to the map stage, all
reduce tasks occur at the same time, and they work independently. The data is aggregated and
combined to deliver the desired output. The final result is a reduced set of <key, value> pairs
which MapReduce, by default, stores in HDFS.
5. Combining – The last phase where all the data (individual result set from each cluster) is
combined together to form a result

5
2) b) HDFS architecture in detail with its features. (10)

HDFS architecture

❖ The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly
scalable Hadoop clusters.
❖ HDFS supports the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with MapReduce, a programmatic framework for data processing.
❖ When HDFS takes in data, it breaks the information down into separate blocks and
distributes them to different nodes in a cluster, thus enabling highly efficient parallel
processing.
❖ Moreover, the Hadoop Distributed File System is specially designed to be highly fault
tolerant. The file system replicates, or copies, each piece of data multiple times and
distributes the copies to individual nodes, placing at least one copy on a different server
rack than the others. As a result, the data on nodes that crash can be found elsewhere
within a cluster. This ensures that processing can continue while data is recovered.

Components:(Explain all the components)


1. Name Node:
2. Data Node:
3.Secondary Namenode

Features of HDFS(List and Discuss)

1. Cost-effective
2. Large Datasets/ Variety and volume of data

6
3. Replication
4. Fault Tolerance and reliability
5. High availability
7. Data Integrity
8. High Throughput

3) a) List and explain Big data :- (10)


Big data refers to the massive datasets that are collected from a variety of data sources for
business needs to reveal new insights for optimized decision making.
❖ It is a data set that is so huge and complicated that no typical data management
technologies can effectively store or process it

1. Characteristics of big data(Explain each with example)

1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value

2. Types of big data

There are three types of big data(Explain with example)


❖ Structured
❖ Unstructured
❖ Semi-structured

3) b) What is NoSQL?Explain graph store and column family store with relevant examples.
(10)

❖ NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with
a focus on performance, reliability, and agility.
❖ NoSQL refers to all databases and data stores that are not based on the relational database
management system or RDBMS principles.
❖ In real time, data requirements are changed a lot. Data is easily available with Facebook,
Twitter and others. The data includes user information, social graphs, geographic location
data and other user generated content
❖ To make use of such abundant resources and data, it is necessary to work with a
technology which can operate such data. SQL databases are not ideally designed to
operate such data
❖ NoSQL databases specially designed for operating huge amount of data

7
Graph stores

❖ A graph store is a system that contains a sequence of nodes and relationships that, when
combined, create a graph.
❖ A graph store has three data fields: nodes, relationships, and properties. Some types of
graph stores are referred to as triple stores because of their node-relationship-node
structure
❖ This pattern of architecture clearly deals with information storage management in graphs
❖ Graphs are essentially structures that represent relations between two or more objects in
some data

❖ Objects or entities are referred to as nodes and are connected with relationships known as
edges. There is a unique identifier on each edge
❖ For the graph, each node serves as a point of touch. In social networks where there are
many and large numbers of entities, this pattern is very widely used and each entity has
one or many characteristics that are linked by edges

❖ When stored in a graph store, the two statements are independent and may even be stored

8
on different systems around the world. But if the URI of the Person123 structure is the
same in both assertions, your application can figure out that the author of the book has a
name of "Dan".
❖ Examples of graph stores are Neo4J, FlockDB, ArangoDB, OrientDB, Titan, DataStax,
Amazon Neptune, etc.

Column family stores

❖ This pattern employs data storage in individual cells that is further divided into columns,
rather than storing data in relational tuples
❖ It offers very high performance and a highly scalable architecture
❖ Databases that are column-oriented operate only on columns. They together store vast
quantities of data in columns. The column format and titles will diverge from one row to
another

Column family

Super column family

❖ Each column is handled differently, but still, like conventional databases, each individual
column will contain several other columns
❖ Basically, columns are in this sort of storage mode. Data is readily available and it is
possible to perform queries such as Number, Average, Count on column easily
❖ Examples of column family stores are HBase,Google’s BigTable, Cassandra, etc.

9
4) a) Suppose a stream consists of the integers 2,1,6,1,5,9,2,3,5. Find out the number of
distinct elements in this stream using Flajolet-Martin algorithm if the hash function is:
a. h(x)=2x+3 mod 16
b. h(x)=4x+1 mod 16
c. 5x mod 16 (10)

a) Input stream : 2,1,6,1,5,9,2,3,5


h(x)=2x+3 mod 16

Input Calculation Binary equivalent Trailing zeros

2 h(2)=2(2)+3 mod 16 = 7 mod 16 = 7 0111 0

1 h(1)=2(1)+3 mod 16 = 5 mod 16 = 5 0101 0

6 h(6)=2(6)+3 mod 16 = 15 mod 16 = 15 1111 0

1 h(1)=2(1)+3 mod 16 = 5 mod 16 = 5 0101 0

5 h(5)=2(5)+3 mod 16 = 13 mod 16 = 13 1101 0

9 h(9)=2(9)+3 mod 16 = 21 mod 16 = 5 0101 0

2 h(2)=2(2)+3 mod 16 = 7 mod 16 = 7 0111 0

3 h(3)=2(3)+3 mod 16 = 9 mod 16 = 9 1001 0

5 h(5)=2(5)+3 mod 16 = 13 mod 16 = 13 1101 0

Calculating distinct elements:-


r = max [Trailing Zero] = 0
R = 2 ^r
R = 2^0 = 1
No: of distinct elements = 1

b) Input stream : 2,1,6,1,5,9,2,3,5


h(x)=4x+1 mod 16

Input Calculation Binary equivalent Trailing zeros

10
2 h(2)=4(2)+1 mod 16 = 9 mod 16 =9 1001 0

1 h(1)=4(1)+1 mod 16 = 5 mod 16 = 5 0101 0

6 h(6)=4(6)+1 mod 16 = 25 mod 16 = 9 1001 0

1 h(1)=4(1)+1 mod 16 = 5 mod 16 = 5 0101 0

5 h(5)=4(5)+1 mod 16 = 21 mod 16 = 5 0101 0

9 h(9)=4(9)+1 mod 16 = 37 mod 16 = 5 0101 0

2 h(2)=4(2)+1 mod 16 = 9 mod 16 = 9 1001 0

3 h(3)=4(3)+1 mod 16 = 13 mod 16 = 13 1101 0

5 h(5)=4(5)+1 mod 16 = 21 mod 16 = 5 0101 0

Calculating distinct elements:-


r = max [Trailing Zero] = 0
R = 2 ^r
R = 2^0 = 1
No: of distinct elements = 1

c) Input stream : 2,1,6,1,5,9,2,3,5


h(x)= 5x mod 16

Input Calculation Binary equivalent Trailing zeros

2 h(2)=5(2) mod 16 = 10 mod 16 = 10 1010 1

1 h(1)=5(1) mod 16 = 5 mod 16 = 5 0101 0

6 h(6)=5(6) mod 16 = 30 mod 16 = 14 1110 1

1 h(1)=5(1) mod 16 = 5 mod 16 = 5 0101 0

5 h(5)=5(5) mod 16 = 25 mod 16 = 9 1001 0

9 h(9)=5(9) mod 16 = 45 mod 16 = 13 1101 0

2 h(2)=5(2) mod 16 = 10 mod 16 = 10 1010 1

11
3 h(3)=5(3) mod 16 = 15 mod 16 = 15 1111 0

5 h(5)=5(5) mod 16 = 25 mod 16 = 9 1001 0

Calculating distinct elements:-


r = max [Trailing Zero] = 1
R = 2 ^r
R = 2^1 = 2
No: of distinct elements = 2

Out of the above , no function gives an accurate number of distinct elements in a given data
stream.

4) b) Explain collaborative filtering system. How is it different from content based system?
(10)

Collaborative-filtering system

❖ It uses community data from peer groups for recommendations. This exhibits all those
things that are popular among the peers. Collaborative filtering systems recommend items
based on similarity measures between users and/or items. The items recommended to a
user are those preferred by similar users (community data). In this, user profile and
contextual parameters along with the community data are used by the recommender
systems to personalize the recommendation list.
❖ Instead of using features of items to determine their similarity, we focus on the similarity
of the user ratings for two items.
❖ The recommendations are done based on the user’s behavior. History of the user plays an
important role. For example, if the user ‘A’ likes ‘Coldplay’, ‘The Linkin Park’ and
‘Britney Spears’ while the user ‘B’ likes ‘Coldplay’, ‘The Linkin Park’ and ‘Tailor Swift’
then they have similar interests. So, there is a huge probability that the user ‘A’ would
like ‘Taylor Swift’ and the user ‘B’ would like ‘Britney Spears’. This is the way
collaborative filtering is done
❖ The underlying assumption of the collaborative filtering approach is that if A and B buy
similar products, A is more likely to buy a product that B has bought than a product
which a random person has bought
❖ Unlike content based, there are no features corresponding to users or items here. All we
have is the Utility Matrix. This is what it looks like:

12
❖ A, B, C, D are the users, and the columns represent movies. The values represent ratings
(1-5) a user has given a movie. In other cases, these values could be 0/1 depending on
whether the user watched the movie or not.

Collaborative Filtering Example


Consider a movie rating system. Rating is done on the scale of 1–5. Rating 1 denotes “dislike”
and rating 5 “love it”. In the Table below, the ratings given by Jack, Tom, Dick and Harry for
five different movies are given. Tim has seen four of those movies. Predict whether the fifth
movie is to be recommended for Tim.
Table : Movie rating by different users

Steps for finding the recommendation for Movie 5 to Tim:


1. A set of users (peers) who had seen the movies that Timsaw in the past and who have rated
them.
2. The average of the peer ratings in the movies seen by Tim is used to find the similarity
between
Tim and others in order to recommend Movie 5.
3. Use Pearson Correlation for finding the similarity measure.
Table below gives the similarity measure among different users with their movie ratings.

13
Table : Similarity measure among different users with their movie ratings
But in content based recommendation system, we match users to the content or items they have
liked or brought. Here the attributes of the users and the products are important. The aim of
content-based recommendation is to create a ‘profile’ for each user and each item.

5) a) For the given graph show how clique percolation method will find community. (10)

The clique percolation method is as follows:


Input: The social graph G, representing a network, and a clique size, k
Output: Set of Discovered Communities, C
1) All K-cliques present in graph G are extracted.
2) A new clique graph GC is created -
a) Here each extracted K - CLIQUE is compressed as one vertex.
b) The two vertices are connected by an edge in GC if they have k - 1 common vertical.
3) connected components in GC are identified.
4) Each connected component in GC represents a community
5) Set C will be the set of communities formed for G.

14
We now form a clique-graph with six vertices, each vertex representing one of these following
six cliques:
In given example we have six 3-cliques as:
a : 1,2,3
b : 1,2,8
c : 2,4,5
d : 2,4,6
e : 2,5,6
f : 4,5,6
It has one 4 clique as:
g : 2,4,5,6
In the clique-graph since k = 3, we add an edge if two cliques share minimum of two vertices.
Clique a and clique b have vertices 1 and 2 in common; therefore, they will be connected through
an edge. Similarly we add the other edges to form the clique graph GC as shown in Fig. 11.9.
Connected components in GC are (a, b) and (c, d, e, f) and these form the communities. So, in
this case the two connected components correspond to two communities:
1. c1 : (1, 2, 3, 8)
2. c2 : (2, 4, 5, 6)
Thus the community set C = {c1, c2}, where vertex 2 overlaps both the communities. Vertex 7 is
not part of any community as it is not a part of any 3-cliques. so new graph will be

These graphs represents the communities C1 and C2

15
5) b) List and explain different data visualization techniques. (5)
❖ Data visualization is the technique used to deliver insights in data using visual cues such
as graphs, charts, maps and many others.This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding
it.
Types of Data Visualizations(Discuss types)
Some of the various types of visualizations offered by R are:

Bar Plot
Histogram
Box Plot
Scatter Plot
Heat Map

5) c) What are different data structures in R. Explain with example. (5)

❖ A data structure is a particular way of organizing data in a computer so that it can be used
effectively. The idea is to reduce the space and time complexities of different tasks. Data
structures in R programming are tools for holding multiple values.
❖ R’s base data structures are often organized by their dimensionality (1D, 2D, or nD) and
whether they’re homogeneous (all elements must be of the identical type) or
heterogeneous (the elements are often of various types). This gives rise to the six data
types which are most frequently utilized in data analysis.

R has many data structures which include Vector, List, Array, Matrices, Data frame and
Factors.(Discuss in details)

Vectors
Lists
Arrays
Matrices
Factors
Data frames

16

You might also like