Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Important Questions for Sessional -1

Programme: M.Tech (CSE) Subject: Big Data Analytics

UNIT I (Introduction to Big Data)


1. Illustrate role of Big Data in Health Care and Credit Risk Management

Big Data in Health Care: Big data is a massive amount of information on a given topic. Big
data includes information that is generated, stored, and analyzed on a vast scale — too vast
to manage with traditional information storage systems. Big data collection and analysis
enables doctors and health administrators to make more informed decisions about treatment
and services. For example, doctors who have big data samples to draw from may be able to
identify the warning signs of a serious illness before it arises. Treating disease at an early
stage can be simpler and costs less overall than treating it once it has progressed.
Advantages of using Big data in health care:
• Empowering patients to engage with their own health histories with easy-to-access
medical records
• Informing providers of patients’ ongoing health status so they can in turn assess
treatment methods faster
• Harnessing data-driven findings to predict and solve medical issues earlier than ever
before

Big Data in Credit risk management: Credit risk assessment is key to the success of fintech
companies. It need to evaluate the creditworthiness of individuals as well as the corporations to
whom it provide credit. The main goal of credit risk assessment is to compare the risk versus
reward ratio of lending to specific individuals and companies in order to make informed credit
decision. The role of Big Data Analytics in credit risk assessment is significantly vital. Big Data
Analytics will ensure that fintech companies have all the information they need about specific
customer groups as well as individuals. Once the problem of insufficient data is solved, proper
analysis of the available information becomes essential to the process of credit risk assessment.
The ability of Big Data Analytics to assess and analyse high volumes of data in minimal time gives
an edge to fintech companies utilising this technology. Big Data Analytics not only provide all the
relevant data, but they also analyse it accurately to provide considerably improved evaluations of
credit risk. Fintech companies can also broaden their customer base by using Big Data Analytics.
Fintech companies can use this information to minimise their risks.

2. Differentiate Structured Vs Unstructured data

Structured Data: The data which is to the point, factual, and highly organized is referred to as
structured data. It is quantitative in nature, i.e. it is related to quantities that means it contains
measurable numerical values like numbers, dates, and times. It is easy to search and analyze
structured data. Structured data exists in a predefined format. Relational database consisting of
tables with rows and columns is one of the best examples of structured data. The programming
language SQL (structured query language) is used for managing the structured data. Common
applications of relational databases with structured data include sales transactions, Airline
reservation systems, inventory control, and others.

Unstructured Data: All the unstructured files, log files, audio files, and image files are included in
the unstructured data. Unstructured data is the data that lacks any predefined model or format. It
requires a lot of storage space, and it is hard to maintain security in it. It cannot be presented in a
data model or schema. That's why managing, analyzing, or searching for unstructured data is hard.
It resides in various different formats like text, images, audio and video files, etc. It is qualitative in
nature and sometimes stored in a non-relational database or NO-SQL. It is not stored in relational
databases. The amount of unstructured data is much more than the structured or semi-structured
data. Examples of human-generated unstructured data are Text files, Email, social media, media,
mobile data, business applications, and others. The machine-generated unstructured data includes
satellite images, scientific data, sensor data, digital surveillance, and many more.

3. Describe Characteristics of Big Data?

The characteristics of Big Data can be explained with the five V’s of Big Data.
1. Volume: The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many more.
Example: Facebook can generate approximately a billion messages, 4.5 billion "Like"s and
more than 350 million new posts are uploaded each day. Big data can handle large
amounts of data.
2. Variety: Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will come in array forms, PDFs, Emails, audios, SM
posts, photos, videos, etc.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.

3. Veracity: Veracity means how much the data is reliable. It defines the degree of
trustworthiness of the data. Here, mostly, data is unstructured, it is important to filter out
the unnecessary information and use reliable information for processing.
Example: huge amount of data can create much confusion on the other hand, when there
is a fewer amount of data, that creates inadequate information.

4. Value: Value is an essential characteristic of big data. It is not the data that being
processed or stored. It is valuable and reliable data that is being stored, processed,
analyzed.

5. Velocity: Velocity means the speed by which the data is created in real-time. It contains
the linking of incoming data sets speeds, rate of change, and activity bursts. The primary
aspect of Big Data is to provide demanding data rapidly
Example: Big data can deal with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile devices etc.
4. Explain Advantages and Disadvantages of Crowd Sourcing?

Crowdsourcing combines the terms "crowd" and outsourcing. When a


company/organization engages in crowdsourcing, it outsources internal work processes. It
is therefore an independent form of division of labour. This does not relate to the
outsourcing of production (classic outsourcing), but rather to corporate processes such as
the collection of ideas for new products.

Advantages:
Advantage 1: Crowdsourcing offers high probability of success
Digital crowdsourcing platforms guarantee that outsourced company/people can work on given
project at any place, at any time which gives high probability for success.
Advantage 2: Crowdsourcing saves costs and time
There could be possibility that outsourced company could charge less and bring more productivity
which saves time.
Advantage 3: Building customer contact and database
Participants become intensively involved with the brand, a product or an idea. It goes without
saying that this can have a positive effect on future purchasing decisions. Companies also collect
valuable data from a valuable target group that they can contact in the future
Advantage 4: Gain brand ambassadors - or even employees
If a company manages to inspire people with its innovation as part of its crowdsourcing project,
participants can quickly become brand ambassadors.

Disadvantage:
1. Compromised Quality: When numerous people are hired to do a job, it could easily lead to
lack of consistency, even if people are pros.
2. Crossed Deadline: even if quality of work is good, There could be a chance that job is
delivered after the deadline.
3. Confidentiality: Since sensitive information related to project is given out to get work
done, there is no guarantee that sensitive information would be confidential even if NDAs
are signed.

UNIT II (No SQL)


1. Explain materialized views in No SQL with Examples

Materialized views are introduced to address the shortcomings of indexes. Creating indexes
on columns with high cardinality tends to result in poor performance. Materialized views
address this problem by storing preconfigured views that support queries. These are the views
that are computed in advance and cached on disk. Each materialized view supports queries
based on single column that is not part of original primary key. Materialized views simplify
application development.
Example:

2. Illustrate sharding, master-slave replication of No SQL

Sharding: Sharding is a method for distributing a single dataset across multiple databases, which
can then be stored on multiple machines. This allows for larger datasets to be split into smaller
chunks and stored in multiple data nodes, increasing the total storage capacity of the system.
Sharding is a form of scaling known as horizontal scaling or scale-out, as additional nodes are
brought on to share the load. Horizontal scaling allows for near-limitless scalability to handle big
data and intense workloads.

Master-slave Replication: master-slave technique of NoSQL Data Replication creates a master copy
of your database and maintains it as the key data source. Any updates that is made to this master
copy and later transferred to the slave copies. To maintain fast performance, all read requests are
managed by the slave copies as it will not be feasible to put all the burden on the master copy
alone. In case a master copy fails, one of the slave copies is automatically assigned as the new
master. Like Riak database shards the data and also replicates it based on the replication factor.
3. Write short notes on master-slave replication

master-slave technique of NoSQL Data Replication creates a master copy of your database and
maintains it as the key data source. Any updates that is made to this master copy and later
transferred to the slave copies. To maintain fast performance, all read requests are managed by the
slave copies as it will not be feasible to put all the burden on the master copy alone. In case a
master copy fails, one of the slave copies is automatically assigned as the new master. Master-slave
replication makes one node the authoritative copy that handles writes while slaves synchronize
with the master and may handle reads. Master-slave replication reduces the chance of update
conflicts but peer-to-peer replication avoids loading all writes onto a single server creating a single
point of failure. A system may use either or both techniques. Like Riak database shards the data and
also replicates it based on the replication factor.

4. Demonstrate architecture of Cassandra and explain its components with neat sketch

The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes,
and data is distributed among all the nodes in a cluster.

• All the nodes in a cluster play the same role. Each node is independent and at the
same time interconnected to other nodes.
• Each node in a cluster can accept read and write requests, regardless of where the
data is actually located in the cluster.
• When a node goes down, read/write requests can be served from other nodes in the
network.

Cassandra follows CAP theorem which states that a distributed system can deliver only two of three
desired characteristics: consistency, availability, and partition tolerance and Cassandra falls under
AP system meaning it holds true for Availability and Partition Tolerance but not for Consistency
but this can further tuned via replication factor and consistency level.
Data Replication in Cassandra:

In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data.
If it is detected that some of the nodes responded with an out-of-date value, Cassandra will
return the most recent value to the client. After returning the most recent value, Cassandra
performs a read repair in the background to update the stale values. Cassandra uses the
Gossip Protocol in the background to allow the nodes to communicate with each other and
detect any faulty nodes in the cluster

The key components of Cassandra are as follows −

• Node − It is the place where data is stored.


• Data center − It is a collection of related nodes.
• Cluster − A cluster is a component that contains one or more data centers.
• Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every
write operation is written to the commit log.
• Mem-table − A mem-table is a memory-resident data structure. After commit log,
the data will be written to the mem-table. Sometimes, for a single-column family,
there will be multiple mem-tables.
• SSTable − It is a disk file to which the data is flushed from the mem-table when its
contents reach a threshold value.
• Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing
whether an element is a member of a set. It is a special kind of cache. Bloom filters
are accessed after every query.

UNIT III (Hadoop)


1. Describe building blocks of Hadoop with neat sketch

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop mainly consists of 4 components.
a. MapReduce
b. HDFS(Hadoop Distributed File System)
c. YARN(Yet Another Resource Negotiator)
d. Common Utilities or Hadoop Common

MapReduce: MapReduce is an Algorithm or a data structure that is based on the YARN framework.
The major feature of MapReduce is to perform the distributed processing in parallel in a Hadoop
cluster which Makes Hadoop working so fast
HDFS: Hadoop Distributed File System is utilized for storage permission is a Hadoop cluster. It
mainly designed for working on commodity Hardware devices (inexpensive devices), working on a
distributed file system design. HDFS in Hadoop provides Fault-tolerance and High availability
to the storage layer and the other devices present in that Hadoop cluster

• NameNode(Master)
• DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster that guides the


Datanode(Slaves)

DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the
data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or more

YARN: YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small
jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized
Common utilities: These are the java library and java files that are needed for all the other
components present in a Hadoop cluster. These utilities are used by HDFS, YARN, and MapReduce
for running the cluster.

2. Demonstrate how to Analyse Weather Data Using Hadoop

Analysing climate related data can be done using Hadoop MapReduce. Every record is filtered based
on location Id which specifies the input. From filtered record values of parameters are extracted
after which further calculations are done.

MapReduce Implementation:
A Mapper is used to run the block and perform simultaneous processing of each block. Mapper
filters the matching records of particular location id or year and all the parameters are extracted and
get saved into HDFS as key-value pairs. In this Mapper phase the memory is allocated only once for
each record of key-value and the memory space is reused resulting in optimized memory allocation.
The combiner phase is used after mapper so that it can calculate local calculations like finding
maximum, minimum, average temperatures and top hottest and coldest stations based on
parameters, resulting in reduction of network traffic and load on reducer.
The reducer phase is used to calculate global Maxima, Minima, hot and cold stations of different
parameter fields like temperature, pressure, humidity and wind speed. The resultant data is stored
back to HDFS in sorted format.
3. Write short notes on Fault- tolerance ,data replication, High availability of HDFS Architecture

Fault-Tolerance: Fault tolerance refers to the ability of the system to work or operate even in case of
unfavorable conditions. Fault tolerance in HDFS refers to the working strength of a system in
unfavorable conditions. HDFS is highly fault-tolerant. HDFS maintains the replication factor by
creating a replica of data on other available machines in the cluster if suddenly one machine fails.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves
storage efficiency while providing the same level of fault tolerance and data durability as traditional
replication-based HDFS deployment.

Data Replication: HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance. The block size and replication factor are
configurable per file. An application can specify the number of replicas of a file. The replication
factor can be specified at file creation time and can be changed later. Files in HDFS are write-once
and have strictly one writer at any time.

High Availability: Hadoop framework store the replicas of files of one machine on the other
machines present in the cluster. At some unfavorable conditions like a failure of a node, the client
can easily access their data from the other nodes. This feature of Hadoop is called High Availability.
In the HDFS cluster, there is a number of DataNodes. After the definite interval of time, all these
DataNodes sends heartbeat messages to the NameNode. If the NameNode stops receiving heartbeat
messages from any of these DataNodes, then it assumes it to be dead. After that, it checks for the
data present in those nodes and then gives commands to the other datanode to create a replica of
that data to other datanodes. Therefore data is always available. When a client asks for a data access
in HDFS, NameNode searches for the data in that datanodes, in which data is quickly available. And
then provides access to that data to the client. HDFS Namenode itself makes data availability easy to
the clients by providing the address of the datanode from where a user can directly read.

4. Illustrate Map Reduce Approach to solve a Word Count Problem

Hadoop WordCount operation occurs in 3 stages –


a. Mapper Phase
b. Shuffle Phase
c. Reducer Phase
Mapper Phase Execution
The text from the input text file is tokenized into words to form a key value pair with all the words
present in the input text file. The key is the word from the input file and value is ‘1’.
For instance if you consider the sentence “An elephant is an animal”. The mapper phase in the
WordCount example will split the string into individual tokens i.e. words. In this case, the entire
sentence will be split into 5 tokens (one for each word) with a value 1 as shown below –
Key-Value pairs from Hadoop Map Phase Execution-
(an,1)
(elephant,1)
(is,1)
(an,1)
(animal,1)
Shuffle Phase Execution
After the map phase execution is completed successfully, shuffle phase is executed automatically
wherein the key-value pairs generated in the map phase are taken as input and then sorted in
alphabetical order. After the shuffle phase is executed from the WordCount example code, the
output will look like this -
(an,1)
(an,1)
(animal,1)
(elephant,1)
(is,1)
Reducer Phase Execution
In the reduce phase, all the keys are grouped together and the values for similar keys are added up
to find the occurrences for a particular word. It is like an aggregation phase for the keys generated
by the map phase. The reducer phase takes the output of shuffle phase as input and then reduces
the key-value pairs to unique keys with values added up. In our example “An elephant is an animal.”
is the only word that appears twice in the sentence. After the execution of the reduce phase of
MapReduce WordCount example program, appears as a key only once but with a count of 2 as
shown below -
(an,2)
(animal,1)
(elephant,1)
(is,1)
This is how the MapReduce word count program executes and outputs the number of occurrences
of a word in any given input file. An important point to note during the execution of the WordCount
example is that the mapper class in the WordCount program will execute completely on the entire
input file and not just a single sentence. The reducer execution will begin only after the mapper
phase is executed successfully.

You might also like