Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Assignment- 1&2

1. What is big data? Why do you need to analyze big data?

o Ans. Big Data is a collection of large datasets that cannot be processed using traditional
computing techniques. It is not a single technique or a tool, rather it involves many areas
of business and technology.

o Big data is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it.

Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers. Businesses that use big data with advanced
analytics gain value in many ways, such as:

1. Reducing cost. Big data technologies like cloud-based analytics can significantly
reduce costs when it comes to storing large amounts of data (for example, a data lake).
2. Making faster, better decisions. The speed of in-memory analytics – combined with
the ability to analyze new sources of data helps businesses and make fast, informed
decisions.
3. Developing and marketing new products and services. Being able to gauge
customer needs and customer satisfaction through analytics.
4. Extra: Now a day's companies use Big Data to make business more informative and
allows to take business decisions by enabling data scientists, analytical modelers and
other professionals to analyse large volume of transactional data.

2. Write down any four industry examples for Big Data.


(Pick any four)

Ans. The term Big Data is referred to as large amount of complex and unprocessed
data. Big data is the valuable and powerful fuel that drives large IT industries of the 21st
century. Big data is a spreading technology used in each business sector.
a. Travel and Tourism: Travel and tourism are the users of Big Data. It enables us to
forecast travel facilities requirements at multiple locations, improve business through
dynamic pricing, and many more.
b. Financial and banking sector: The financial and banking sectors use big data
technology extensively. Big data analytics help banks and customer behavior based on
investment patterns, shopping trends, motivation to invest, and inputs that are obtained
from personal or financial backgrounds.
c. Healthcare: Big data has started making a massive difference in the healthcare
sector, with the help of predictive analytics, medical professionals, and health care
personnel. It can produce personalized healthcare and solo patients also.
d. E-commerce: E-commerce is also an application of Big data. It maintains
relationships with customers that is essential for the e-commerce industry. E-commerce
websites have many marketing ideas to retail merchandise customers, manage
transactions, and implement better strategies of innovative ideas to improve businesses
with Big data.
Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic
daily. But, when there is a pre-announced sale on Amazon, traffic increase rapidly that
may crash the website. So, to handle this type of traffic and data, it uses Big Data. Big
Data help in organizing and analyzing the data for far use.

Extra:

e. Social Media: Social media is the largest data generator. The statistics have shown
that around 500+ terabytes of fresh data generated from social media daily, particularly
on Facebook. The data mainly contains videos, photos, message exchanges, etc.
A single activity on the social media site generates many stored data and gets
processed when required. The data stored is in terabytes (TB); it takes a lot of time for
processing. Big Data is a solution to the problem.
f. Telecommunication and media: Telecommunications and the multimedia sector are
the main users of Big Data. There are zettabytes to be generated every day and
handling large-scale data that require big data technologies.
g. Government and Military: The government and military also used technology at high
rates. We see the figures that the government makes on the record. In the military, a
fighter plane requires to process petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing
with traffic jams, and the effect of crime like hacking and online fraud.
Aadhar Card: The government has a record of 1.21 billion citizens. This vast data
is analyzed and store to find things like the number of youths in the country. Some
schemes are built to target the maximum population. Big data cannot store in a
traditional database, so it stores and analyze data by using the Big Data Analytics tools.
3. Discuss about the three(3Vs) dimensions of big data. | List the
characteristics of big data. (3Vs) | What are the 5 Vs in big data?

Ans. V3: V for Volume: Volume of data, which needs to be processed is increasing
rapidly. It results in:

o More storage capacity

o More computation

o More tools and techniques

The name ‘Big Data’ itself is related to a size which is enormous.


The quantity of generated and stored data. The size of the data determines the value
and potential insight, and whether it can be considered big data or not. The size of big
data is usually larger than terabytes and petabytes.

o V3: V for Variety: The type and nature of the data. Big data draws from text, images,
audio, video; It refers to nature of data that is structured, semi-structured and
unstructured data.

o Unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80-90%
data of an organization is in this format, for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters, research, white papers, body
of an email, text, numerical, audio, sequences, time series, social media data,
multi-dimensional arrays, etc.

o 2. Semi-structured data: This is the data which does not conform to a data model
but has some structure. However, it is not in a form which can be used easily by a
computer program, for example, emails, XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.

o 3. Structured data: This is the data which is in an organized form (e.g., in rows
and columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.

o V3: V for velocity: Velocity refers to the high speed of accumulation of data.
o In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
o There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
o Sampling data can help in dealing with the issue like ‘velocity’.
o Example: There are more than 3.5 billion searches per day are made on Google.
Also, FaceBook users are increasing by 22%(Approx.) year by year.
• ________________________________________________________
o V4: V for Veracity:

o It refers to inconsistencies and uncertainty in data, that is data which is available


can sometimes get messy and quality and accuracy are difficult to control.

o Big Data is also variable because of the multitude of data dimensions resulting
from multiple disparate data types and sources.

o Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.

o V5: V for Value:

o After having the 4 Vs into account there comes one more V which stands for
Value! The bulk of Data having no Value is of no good to the company, unless you
turn it into something useful.

o Data in itself is of no use or importance, but it needs to be converted into


something valuable to extract Information. Hence, you can state that Value! is the
most important V of all the 5V’s.

4. Explain the major challenges of big data.

Ans.

1. Sharing and Accessing Data:


• Sharing data can cause substantial challenges. Accessing data from public repositories
leads to multiple difficulties.
• It is necessary for the data to be available in an accurate, complete and timely manner
because if data in the company’s information system is to be used to make accurate
decisions in time, then it becomes necessary for data to be available in this manner.

2. Privacy and Security:

• This challenge includes sensitive, conceptual, technical as well as legal significance.


• Most of the organizations are unable to maintain regular checks due to large amounts of
data generation. However, it should be necessary to perform security checks and
observation in real time because it is most beneficial.
• There is some information of a person which when combined with external large data
may lead to some facts of a person which may be secretive, and he might not want the
owner to know this information about that person.
• Some of the organization collects information of the people to add value to their
business. This is done by making insights into their lives that they’re unaware of.

3. Analytical Challenges:

• There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too large?

• There are two techniques through which decision making can be done:
• Either incorporate massive data volumes in the analysis.
• Or determine upfront which Big data is relevant.

4.Quality of data:

• When there is a collection of a large amount of data and storage of this data, it comes at
a cost.
• For better results and conclusions, Big data rather than having irrelevant data, focuses
on quality data storage.
• This further arise a question that how it can be ensured that data is relevant, how much
data would be enough for decision making and whether the stored data is accurate or
not.
5. Fault tolerance:
• Fault tolerance is another technical challenge and fault tolerance computing is
extremely hard, involving intricate algorithms.
• Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
6. Scalability:
• Big data projects can grow and evolve rapidly. The scalability issue of Big Data has led
towards cloud computing.
• It leads to various challenges like how to run and execute various jobs so that goal of
each workload can be achieved cost-effectively.
• It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.

5. What are the different types of big data technologies?

Ans. Big data technologies are important in providing more accurate analysis, which may
lead to more concrete decision-making resulting in greater operational efficiencies, cost
reductions, and reduced risks for the business. To harness the power of big data, you
would require an infrastructure that can manage and process huge volumes of structured
and unstructured data in real-time and can protect data privacy and security.

There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology:

Operational Big Data: These include systems like MongoDB that provide operational
capabilities for real-time, interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be
run inexpensively and efficiently. This makes operational big data workloads much easier to
manage, cheaper, and faster to implement. Some NoSQL systems can provide insights into
patterns and trends based on real-time data with minimal coding and without the need for
data scientists and additional infrastructure.

Analytical Big Data: These includes systems like Massively Parallel Processing (MPP)
database systems and MapReduce that provide analytical capabilities for retrospective and
complex analysis that may touch most or all the data. MapReduce provides a new method
of analyzing data that is complementary to the capabilities provided by SQL, and a system
based on MapReduce that can be scaled up from single servers to thousands of high- and
low-end machines. These two classes of technology are complementary and frequently
deployed together.
6. Explain the difference between operational and analytical system.

Ans. Operational vs. Analytical Systems:

Operational Data Systems:

• First up, Operational Data is exactly what it sounds like - data that is produced by
your organization's day to day operations. Things like customer, inventory, and
purchase data fall into this category.
• Operational Data Systems support high-volume low-latency access, called Online
Transactional Processing tables, or OLTP, where you want to create, read, update,
or delete one piece of data at a time.

Analytical Data Systems:

• Analytical Data is a little more complex and will look different for different types of
organizations; however, at its core is an organization's Operational Data.

• Analytical Data is used to make business decisions. Examples include grouping


customers for market segmentation or changes in purchase volume over time.

• Every organization will have different questions to answer and different decisions to
make, so Analytical Data is not one-size-fits-all by any stretch of the imagination!

• Analytical Data is best stored in a Data System designed for heavy aggregation, data
mining, and ad hoc queries, called an Online Analytical Processing system, OLAP, or
a Data Warehouse!

To recap, Operational Data Systems, consisting largely of transactional data, are built for
quicker updates. Analytical Data Systems, which are intended for decision making, are built
for more efficient analysis.
7. What is Hadoop architecture? | Explain the core components of Hadoop.

Ans. 1. At its core, Hadoop has two major layers namely: (a) Processing/Computation layer
(MapReduce), and (b) Storage layer (Hadoop Distributed File System).

2.The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

3.A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.

Hadoop Distributed File System

1. The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It states
that the files will be broken into blocks and stored in nodes over the distributed architecture. It
contains a master/slave architecture. This architecture consist of a single NameNode performs
the role of master, and multiple DataNodes performs the role of a slave.

2. Both NameNode and DataNode are capable enough to run on commodity machines. The
Java language is used to develop HDFS. So any machine that supports Java language can
easily run the NameNode and DataNode software.

MapReduce Layer
1.This is a framework which helps Java programs to do the parallel computation on data using
key value pair. The Map task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed by reduce task and then the
out of reducer gives the desired result.

Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules:

1.Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules. Hadoop common or Common utilities are nothing but our java library and
java files or we can say the java scripts that we need for all the other components present in
a Hadoop cluster.

These utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

2.Hadoop YARN: This is a framework for job scheduling and cluster resource management.
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for
the Hadoop System.

• Consists of three major components i.e.

0. Resource Manager

1. Nodes Manager

2. Application Manager

• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.

8.How does Hadoop work? Explain the advantages of Hadoop.

Ans. It is quite expensive to build bigger servers with heavy configurations that handle large
scale processing, but as an alternative, you can tie together many commodity computers with
single-CPU, as a single functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper
than one high-end server. So, this is the first motivational factor behind using Hadoop that it
runs across clustered and low-cost machines. Hadoop runs code across a cluster of
computers.

This process includes the following core tasks that Hadoop performs:

• Data is initially divided into directories and files. Files are divided into uniform sized blocks of
128M and 64M (preferably 128M).

• These files are then distributed across various cluster nodes for further processing.

• HDFS, being on top of the local file system, supervises the processing.

• Blocks are replicated for handling hardware failure.

• Checking that the code was executed successfully.

• Performing the sort that takes place between the map and reduce stages.

• Sending the sorted data to a certain computer.

• Writing the debugging logs for each job.

Advantages of Hadoop:

o Fast: Hadoop framework allows the user to quickly write and test distributed systems. It
is efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
o Resilient to failure: Hadoop does not rely on hardware to provide fault-tolerance and
high availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.
o Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
o Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data, so
it really cost effective as compared to traditional relational database management
system.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
OR

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in
faster retrieval. Even the tools to process the data are often on the same servers, thus
reducing the processing time. It is able to process terabytes of data in minutes and Peta
bytes in hours.

o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.

9.Explain the core components of Hadoop. | Combine Ans 7 & 8.

Ans. HDFS:

• HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.

• HDFS consists of two core components i.e.

0. Name node

1. Data Node

• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.

• HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.

YARN:

• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.

• Consists of three major components i.e.


0. Resource Manager

1. Nodes Manager

2. Application Manager

• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.

MapReduce:

• By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.

• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:

0. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.

1. Reduce(), as the name suggests does the summarization by aggregating the


mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.

Hadoop common or Common Utilities:

Hadoop common or Common utilities are nothing but our java library and java files or we can
say the java scripts that we need for all the other components present in a Hadoop cluster.

These utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop
Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

10.Explain the importance of Hadoop technology in big data analytics.

Ans. Importance of Hadoop

Hadoop is a valuable technology for big data analytics for the reasons as mentioned below:
o Stores and processes humongous data at a faster rate. The data may be structured,
semi-structured, or unstructured

o Protects application and data processing against hardware failures. Whenever a node
gets down, the processing gets redirected automatically to other nodes and ensures
running of applications

o Organizations can store raw data and processor filter it for specific analytic uses as and
when required

o As Hadoop is scalable, organizations can handle more data by adding more nodes into
the systems

o Supports real-time analytics, drives better operational decision-making and batch


workloads for historical analysis

You might also like