Professional Documents
Culture Documents
Unit-1 BDA
Unit-1 BDA
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
1. Structured
2. Unstructured
3. Semi-structured
structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Example:
In addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it.
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML
file.
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>SeemaR.</name><sex>Female</sex><age>41</age>
</rec>
Volume
Variety
Velocity
Veracity
Value
volume
Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
Velocity-The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
Veracity- refers to the accuracy and trustworthiness .It relates to the assurance of the
data quality,integrity,credibility and accuracy.
Value: the final output is the value derived.The final value used to further
development.
3. CLOUD AND BIG DATA
1. Big Data :
Big data refers to the data which is huge in size and also increasing rapidly with
respect to time. Big data includes structured data, unstructured data as well as semi-
structured data. Big data can not be stored and processed in traditional data
management tools it needs specialized big data management tools. It refers to
complex and large data sets having 5 V’s volume, velocity, Veracity, Value and
variety information assets. It includes data storage, data analysis, data mining and data
visualization.
Examples of the sources where big data is generated includes social media data, e-
commerce data, weather station data, IoT Sensor data etc.
Variability of Big data – Inconsistency which can be shown by the data at times.
Cost Savings
Better decision-making
Better Sales insights
Increased Productivity
Improved customer service.
Incompatible tools
Security and Privacy Concerns
Need for cultural change
Rapid change in technology
Specific hardware needs.
2. Cloud Computing :
Examples of cloud computing vendors who provides cloud computing services are
Amazon Web Service (AWS), Microsoft Azure, Google Cloud Platform, IBM Cloud
Services etc.
On-Demand availability
Accessible through a network
Elastic Scalability
Pay as you go model
Multi-tenancy and resource pooling.
Vendor lock-in
Limited Control
Security Concern
Downtime due to various reason
Requires good Internet connectivity.
4. INDUSTRY EXAMPLE OF BIG DATA
Netflix, a giant streaming platform has made it big using big data analytics. Netflix is
one of the most prominent examples of how advancements in technology have helped
brands like Netflix to grow into becoming famous and successful.
Netflix has been using Big Data Analytics to optimize the overall quality and user
experience. Through big data analytics, Netflix is targeting users through new offers
for shows that will interest them. Not only this but through big data analytics, they
also are playing the ground with relevant preferences. All these efforts all together
have led to the success of the Netflix streaming platform.
With the help of Big data analytics, Netflix knows what you want and what you
would like to watch next. Knowing and understanding the preferences of the users
have proven to be the two pillars of success for Netflix. With the help of which they
understood the viewing habits of viewers which help the prediction system that is
powered by the algorithm designed by the developers.
5.WEB ANALYTICS
We use web analytics to track key metrics and analyze visitors’ activity and traffic
flow. It is a tactical approach to collect data and generate reports.
We need Web Analytics to assess the success rate of a website and its associated
business. Using Web Analytics, we can −
The primary objective of carrying out Web Analytics is to optimize the website in
order to provide better user experience. It provides a data-driven report to measure
visitors’ flow throughout the website.
Take a look at the following illustration. It depicts the process of web analytics.
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which brand they wish to spent,
how frequently they spent), shopping behavior, customer’s most liked product (so that
they can keep those products in the store). Which product is being searched/sold most,
based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can
provide the offer to a particular customer to buy his particular liked product by using
bank’s credit or debit card with discount or cashback. By this way, they can send the
right offer to the right person at the right time.
2. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.).
All such data are analyzed and jam-free or less jam way, less time taking ways are
recommended. Such a way smart traffic system can be built in the city by Big data
analysis. One more profit is fuel consumption can be reduced.
3. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature,
other environmental condition. Based on such data analysis, an environmental
parameter within flight are set up and varied.
3. Auto Driving Car: Big data analysis helps drive a car without human interpretation.
In the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc.
These data are being analyzed, then various calculation like how many angles to
rotate, what should be speed, when to stop, etc carried out. These calculations help to
take action automatically.
5. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant
tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to
provide the answer of the various question asked by users. This tool tracks the
location of the user, their local time, season, other data related to question asked, etc.
Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects
data like location of the user, season and weather condition at that location, then
analyze these data to conclude if there is a chance of raining, then provide the answer.
6. IoT
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the
situation when machine facing a lot of issues or gets totally down. Thus, the cost to
replace the whole machine can be saved.
Big data technology is used to handle both real-time and batch related data. Big data
technology is defined as software-utility.Big data technologies including Apache
Hadoop, Apache Spark, MongoDB, Cassandra, Plotly, Pig, Tableay, and Apache
Cassandra.
Hadoop: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly
introduced to store and process data in a distributed data processing environment
parallel to commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known
as one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
Cassandra: Cassandra is one of the leading big data technologies among the list
of top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail.
This ultimately helps in the process of handling data efficiently on large
commodity groups. Cassandra's essential features include fault-tolerant
mechanisms, scalability, MapReduce support, distributed nature, eventual
consistency, query language property, tunable consistency, and multi-datacenter
replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
8.INTRODUCTION TO HADOOP:
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not
OLAP (online analytical processing). It is used for batch/offline processing.It is being
used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it
can be scaled up just by adding nodes in the cluster.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
NameNode
DataNode
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
In response, NameNode provides metadata to Job Tracker.
Task Tracker
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.
It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are
facilitated by HDFS.
Advantages of HDFS:
Disadvantages of HDFS:
The biggest disadvantage is that it is not fit for small quantities of data.
It has issues related to potential stability, restrictive and rough in nature.
MapReduce
Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps. The Map()
function will be executed in its memory repository on each of these input key-
value pairs and generates the intermediate key-value pair which works as input
for the Reducer or Reduce() function.
Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or group
the data based on its key-value pair as per the reducer algorithm written by the
developer.
Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send jobs
for processing to the Hadoop MapReduce Manager.
Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
Input Data: The data set that is fed to the MapReduce for processing.
In MapReduce, we have a client.The client will submit the job of a particular size to
the Hadoop MapReduce Master.Now, the MapReduce master will divide this job into
further equivalent job-parts.These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires.
The input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and the final
output is stored on the HDFS. There can be n number of Map and Reduce tasks made
available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such that the
time complexity or space complexity is minimum.
MapReduce- Example
Hadoop YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data
stored in HDFS (Hadoop Distributed File System) thus making the system much more
efficient. Through its various components, it can dynamically allocate various
resources and schedule the application processing.
For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.
YARN Features:
Advantages :
1. Flexibility: YARN offers flexibility to run various types of distributed processing
systems such as Apache Spark, Apache Flink, Apache Storm, and others. It
allows multiple processing engines to run simultaneously on a single Hadoop
cluster.
2. Resource Management: YARN provides an efficient way of managing resources
in the Hadoop cluster. It allows administrators to allocate and monitor the
resources required by each application in a cluster, such as CPU, memory, and
disk space.
3. Scalability: YARN is designed to be highly scalable and can handle thousands of
nodes in a cluster. It can scale up or down based on the requirements of the
applications running on the cluster.
4. Improved Performance: YARN offers better performance by providing a
centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.
5. Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.
Disadvantages :
1) Complexity: YARN adds complexity to the Hadoop ecosystem. It requires
additional configurations and settings, which can be difficult for users who are
not familiar with YARN.
2) Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing
resources and scheduling applications.
3) Latency: YARN introduces additional latency in the Hadoop ecosystem. This
latency can be caused by resource allocation, application scheduling, and
communication between components.
4) Single Point of Failure: YARN can be a single point of failure in the Hadoop
cluster. If YARN fails, it can cause the entire cluster to go down. To avoid this,
administrators need to set up a backup YARN instance for high availability.
5) Limited Support: YARN has limited support for non-Java programming
languages. Although it supports multiple processing engines, some engines have
limited language support, which can limit the usability of YARN in certain
environments.
1. Hadoop
It is recognized as one of the most popular big data tools to analyze large data sets, as
the platform can send data to different servers. Another benefit of using Hadoop is
that it can also run on a cloud infrastructure.
This open-source software framework is used when the data volume exceeds the
available memory. This big data tool is also ideal for data exploration, filtration,
sampling, and summarization. It consists of four parts:
Hadoop Distributed File System: This file system, commonly known as HDFS, is a
distributed file system compatible with very high-scale bandwidth.
MapReduce: It refers to a programming model for processing big data.
YARN: All Hadoop’s resources in its infrastructure are managed and scheduled using
this platform.
Libraries: They allow other modules to work efficiently with Hadoop.
2. Apache Spark
This big data tool is the most preferred tool for data analysis over other types of
programs due to its ability to store large computations in memory. It can run
complicated algorithms, which is a prerequisite for dealing with large data sets.
Proficient in handling batch and real-time data, Apache Spark is flexible to work with
HDFS and OpenStack Swift or Apache Cassandra. Often used as an alternative to
MapReduce, Spark can run tasks 100x faster than Hadoop’s MapReduce.
3. Cassandra
Apache Cassandra is one of the best big data tools to process structured data sets.
Created in 2008 by Apache Software Foundation, it is recognized as the best open-
source big data tool for scalability. This big data tool has a proven fault-tolerance on
cloud infrastructure and commodity hardware, making it more critical for big data
uses.
It also offers features that no other relational and NoSQL databases can provide. This
includes simple operations, cloud availability points, performance, and continuous
availability as a data source, to name a few. Apache Cassandra is used by giants like
Twitter, Cisco, and Netflix.
4. MongoDB
Thanks to its power to store data in documents, it is very flexible and can be easily
adapted by companies. It can store any data type, be it integer, strings, Booleans,
arrays, or objects. MongoDB is easy to learn and provides support for multiple
technologies and platforms.
5. HPCC
Big data refers to the data which is Cloud computing refers to the on
01. huge in size and also increasing demand availability of computing
rapidly with respect to time. resources over internet.
On-Demand availability of IT
Volume of data, Velocity of data,
resources, broad network access,
Variety of data, Veracity of data, and
resource pooling, elasticity and
03. Value of data are considered as the 5
measured service are considered as the
most important characteristics of Big
main characteristics of cloud
data.
computing.
The definition of mobile BI refers to the access and use of information via mobile
devices. With the increasing use of mobile devices for business – not only in
management positions – mobile BI is able to bring business intelligence and analytics
closer to the user when done properly. Whether during a train journey, in the airport
departure lounge or during a meeting break, information can be consumed almost
anywhere and anytime with mobile BI.
Mobile phones' data storage capacity has grown in tandem with their use. You are
expected to make decisions and act quickly in this fast-paced environment. The
number of businesses receiving assistance in such a situation is growing by the
day.
To expand your business or boost your business productivity, mobile BI can help,
and it works with both small and large businesses. Mobile BI can help you
whether you are a salesperson or a CEO. There is a high demand for mobile BI in
order to reduce information time and use that time for quick decision making.
Advantages of mobile BI
1)Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.
2)Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to
stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.
3)Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from
any location. During its demand, Mobile BI offers the information. This assists
consumers in obtaining what they require at the time. As a result, decisions are made
quickly.
4)Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data
when they need it. Obtaining all of the corporate data with a single click frees up a
significant amount of time to focus on the smooth and efficient operation of the firm.
Increased productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1)Stack of data
The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information
and does end up with heaps of earlier data. The corporation only needs a small portion
of the previous data, but they need to store the entire information, which ends up in
the stack
2)Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved.
3)Time consuming
Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse is
used to create the system, hence the implementation of BI in an enterprise takes more
than 18 months.
12. CROWD SOURCING ANALYTICS
Enterprise
IT
Marketing , Education , Finance , Science and health
How To Crowdsource?
Lack of confidentiality: Asking for suggestions from a large group of people can
bring the threat of idea stealing by other organizations.
Repeated ideas: Often contestants in crowdsourcing competitions submit repeated,
plagiarized ideas which leads to time wastage as reviewing the same ideas is not
worthy.
13.INTER AND TRANS FIREWALL ANALYTICS
What is Firewall?
Aside from that, cloud-based firewalls are available. FaaS (firewall as a service) is a
typical moniker for them. The ability to administer cloud-based firewalls from a
central location is a major benefit. Cloud-based firewalls, like hardware firewalls, are
best recognized for perimeter security.
Firewall Methodologies –
3)Proxy firewalls –
These are also known as application-layer firewalls. A proxy firewall acts as an
intermediary between the original client and the server. No direct connection takes
place between the original client and the server.
The client, who has to establish a connection directly to the server to communicate
with it, now has to establish a connection with the proxy server. The proxy server then
establishes a connection with the server on the behalf of the client. Now, the client
sends the data to the proxy server and the proxy server forwards it to the server. A
proxy server can operate up to layer 7 (application layer).
4)Transparent firewall –
By default, the firewall operates at layer 3 but the benefit of using a transparent
firewall is that it can operate at layer 2. It has 2 interfaces that will act as a bridge so
can be configured through a single management IP address. Also, users accessing the
network will not even know that a firewall exists.
The main advantage of using a transparent firewall is that we don’t need to re-address
our networks while putting up a firewall in our network. Also, while operating at layer
2, it can still perform functions like building a stateful database, application inspection,
etc.
6)Next-Generation Firewalls –
NGFWs are third-generation security firewall that is implemented in either software
or device. It combines basic firewall properties like static packet filtering, application
inspection with advanced security features like an integrated intrusion prevention
system. Cisco ASA with firePOWER services is an example of a Next-Generation
firewall.
What exactly is the work of a firewall?
A firewall system examines network traffic according to pre-set rules. The traffic is
subsequently filtered, and any traffic coming from untrustworthy or suspect sources is
blocked. It only accepts traffic that has been configured to accept. Firewalls often
intercept network traffic at a computer’s port, or entry point. According to pre-defined
security criteria, firewalls allow or block particular data packets (units of
communication carried over a digital network). Only trusted IP addresses or sources
are allowed to send traffic in.
Firewalls have grown in power, and now encompass a number of built-in functions
and capabilities: