Unit-1 BDA

BDA UNIT 1
Unit-1 UNDERSTANDING BIG DATA
1. INTRODUCTION TO BIG DATA
Define big data

Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with so large size and complexity that none of
traditional data management tools can store it or process it efficiently. Big data is also
a data but with huge size.
Example:
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
Types Of Big Data
Following are the types of Big Data:
1. Structured
2. Unstructured
3. Semi-structured
structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data.
Example:
Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it.
A typical example of unstructured data is a heterogeneous data source containing a

combination of simple text files, images, videos etc.
Examples Of Un-structured Data
The output returned by ‘Google Search’-which includes text,audio,video
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML
file.
Examples Of Semi-structured Data - Personal data stored in an XML file-
<rec><name>PrashantRao</name><sex>Male</sex><age>35</age></rec>
<rec><name>SeemaR.</name><sex>Female</sex><age>41</age>
</rec>
2. CONVERGENCE OF KEY TRENDS
Characteristics Of Big Data

Big data can be described by the following characteristics:
 Volume
 Variety
 Velocity
 Veracity
 Value
volume
Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can
actually be considered as a Big Data or not, is dependent upon the volume of data.
Hence, ‘Volume’ is one characteristic which needs to be considered while dealing
with Big Data solutions.
Velocity-The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
Variety-Variety refers to heterogeneous sources and the nature of data, both

structured and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in the
form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured data poses certain
issues for storage, mining and analyzing data.
Veracity- refers to the accuracy and trustworthiness .It relates to the assurance of the
data quality,integrity,credibility and accuracy.
Value: the final output is the value derived.The final value used to further
development.
3. CLOUD AND BIG DATA
1. Big Data :
Big data refers to the data which is huge in size and also increasing rapidly with
respect to time. Big data includes structured data, unstructured data as well as semi-
structured data. Big data can not be stored and processed in traditional data
management tools it needs specialized big data management tools. It refers to
complex and large data sets having 5 V’s volume, velocity, Veracity, Value and
variety information assets. It includes data storage, data analysis, data mining and data
visualization.
Examples of the sources where big data is generated includes social media data, e-
commerce data, weather station data, IoT Sensor data etc.
Characteristics of Big Data :
Variety of Big data – Structured, unstructured, and semi structured data
Velocity of Big data – Speed of data generation
Volume of Big data – Huge volumes of data that is being generated
Value of Big data – Extracting useful information and making it valuable
Variability of Big data – Inconsistency which can be shown by the data at times.
Advantages of Big Data :
 Cost Savings
 Better decision-making
 Better Sales insights
 Increased Productivity
 Improved customer service.
Disadvantages of Big Data :
 Incompatible tools
 Security and Privacy Concerns
 Need for cultural change
 Rapid change in technology
 Specific hardware needs.
2. Cloud Computing :
Cloud computing refers to the on demand availability of computing resources over

internet. These resources includes servers, storage, databases, software, analytics,
networking and intelligence over the Internet and all these resources can be used as
per requirement of the customer. In cloud computing customers have to pay as per use.
It is very flexible and can be resources can be scaled easily depending upon the
requirement. Instead of buying any IT resources physically, all resources can be
availed depending on the requirement from the cloud vendors. Cloud computing has
three service models i.e Infrastructure as a Service (IaaS), Platform as a Service (PaaS)
and Software as a Service (SaaS).
Examples of cloud computing vendors who provides cloud computing services are
Amazon Web Service (AWS), Microsoft Azure, Google Cloud Platform, IBM Cloud
Services etc.
Characteristics of Cloud Computing :
 On-Demand availability
 Accessible through a network
 Elastic Scalability
 Pay as you go model
 Multi-tenancy and resource pooling.
Advantages of Cloud Computing :
 Back-up and restore data

 Improved collaboration
 Excellent accessibility
 Low maintenance cost
 On-Demand Self-service.
Disadvantages of Cloud Computing :
 Vendor lock-in
 Limited Control
 Security Concern
 Downtime due to various reason
 Requires good Internet connectivity.
4. INDUSTRY EXAMPLE OF BIG DATA
Netflix, a giant streaming platform has made it big using big data analytics. Netflix is
one of the most prominent examples of how advancements in technology have helped
brands like Netflix to grow into becoming famous and successful.
Netflix has been using Big Data Analytics to optimize the overall quality and user
experience. Through big data analytics, Netflix is targeting users through new offers
for shows that will interest them. Not only this but through big data analytics, they
also are playing the ground with relevant preferences. All these efforts all together
have led to the success of the Netflix streaming platform.
With the help of Big data analytics, Netflix knows what you want and what you
would like to watch next. Knowing and understanding the preferences of the users
have proven to be the two pillars of success for Netflix. With the help of which they
understood the viewing habits of viewers which help the prediction system that is
powered by the algorithm designed by the developers.
5.WEB ANALYTICS
Web Analytics is the methodological study of online/offline patterns and trends. It is a

technique that you can employ to collect, measure, report, and analyze your website
data. It is normally carried out to analyze the performance of a website and optimize
its web usage.
We use web analytics to track key metrics and analyze visitors’ activity and traffic
flow. It is a tactical approach to collect data and generate reports.
Importance of Web Analytics
We need Web Analytics to assess the success rate of a website and its associated
business. Using Web Analytics, we can −
 Assess web content problems so that they can be rectified

 Have a clear perspective of website trends
 Monitor web traffic and user flow
 Demonstrate goals acquisition
 Figure out potential keywords
 Identify segments for improvement
 Find out referring sources
Web Analytics Process
The primary objective of carrying out Web Analytics is to optimize the website in
order to provide better user experience. It provides a data-driven report to measure
visitors’ flow throughout the website.
Take a look at the following illustration. It depicts the process of web analytics.
 Set the business goals.

 To track the goal achievement, set the Key Performance Indicators (KPI).
 Collect correct and suitable data.
 To extract insights, Analyze data.
 Based on assumptions learned from the data analysis, Test alternatives.
 Based on either data analysis or website testing, Implement insights.
6. APPLICATIONS OF BIG DATA
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which brand they wish to spent,
how frequently they spent), shopping behavior, customer’s most liked product (so that
they can keep those products in the store). Which product is being searched/sold most,
based on that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can
provide the offer to a particular customer to buy his particular liked product by using
bank’s credit or debit card with discount or cashback. By this way, they can send the
right offer to the right person at the right time.
2. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.).
All such data are analyzed and jam-free or less jam way, less time taking ways are
recommended. Such a way smart traffic system can be built in the city by Big data
analysis. One more profit is fuel consumption can be reduced.
3. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature,
other environmental condition. Based on such data analysis, an environmental
parameter within flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the

machine can operate flawlessly when it to be replaced/repaired.
3. Auto Driving Car: Big data analysis helps drive a car without human interpretation.
In the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc.
These data are being analyzed, then various calculation like how many angles to
rotate, what should be speed, when to stop, etc carried out. These calculations help to
take action automatically.
5. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant
tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to
provide the answer of the various question asked by users. This tool tracks the
location of the user, their local time, season, other data related to question asked, etc.
Analyzing all such data, it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects
data like location of the user, season and weather condition at that location, then
analyze these data to conclude if there is a chance of raining, then provide the answer.
6. IoT
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the
situation when machine facing a lot of issues or gets totally down. Thus, the cost to
replace the whole machine can be saved.
7. Media and Entertainment Sector: Media and entertainment service providing

company like Netflix, Amazon Prime, Spotify do analysis on data collected from their
users. Data like what type of video, music users are watching, listening most, how
long users are spending on site, etc are collected and analyzed to set the next business
strategy.
7.BIG DATA TECHNOLOGIES
Big data technology is used to handle both real-time and batch related data. Big data
technology is defined as software-utility.Big data technologies including Apache
Hadoop, Apache Spark, MongoDB, Cassandra, Plotly, Pig, Tableay, and Apache
Cassandra.
Hadoop: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly
introduced to store and process data in a distributed data processing environment
parallel to commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known
as one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
MongoDB: MongoDB is another important component of big data technologies in

terms of storage. No relational properties and RDBMS properties apply to
MongoDb because it is a NoSQL database. This is not the same as traditional
RDBMS databases that use structured query languages. Instead, MongoDB uses
schema documents. The structure of the data storage in MongoDB is also different
from traditional RDBMS databases. This enables MongoDB to hold massive
amounts of data. It is based on a simple cross-platform document-oriented design.
The database in MongoDB uses documents similar to JSON with the schema. This
ultimately helps operational data storage options, which can be seen in most
financial organizations. As a result, MongoDB is replacing traditional mainframes
and offering the flexibility to handle a wide range of high-volume data-types in
distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a
combination of C++, Python, JavaScript, and Go language.
Cassandra: Cassandra is one of the leading big data technologies among the list
of top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail.
This ultimately helps in the process of handling data efficiently on large
commodity groups. Cassandra's essential features include fault-tolerant
mechanisms, scalability, MapReduce support, distributed nature, eventual
consistency, query language property, tunable consistency, and multi-datacenter
replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
8.INTRODUCTION TO HADOOP:
Hadoop is an open source framework from Apache and is used to store process and
analyze data which are very huge in volume. Hadoop is written in Java and is not
OLAP (online analytical processing). It is used for batch/offline processing.It is being
used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it
can be scaled up just by adding nodes in the cluster.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was
the Google File System paper, published by Google.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master
node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
NameNode
 It is a single master server exist in the HDFS cluster.

 As it is a single node, it may become the reason of single point failure.
 It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
 It simplifies the architecture of the system.
DataNode
 The HDFS cluster contains multiple DataNodes.

 Each DataNode contains multiple data blocks.
 These data blocks are used to store data.
 It is the responsibility of DataNode to read and write requests from the file
system's clients.
 It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
 The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
 In response, NameNode provides metadata to Job Tracker.
Task Tracker
 It works as a slave node for Job Tracker.

 It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the
appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a
case, that part of the job is rescheduled.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and

processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other 8
modules.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are
facilitated by HDFS.
Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,

scalable, block structured, can process a large amount of data simultaneously and
many more.
Disadvantages of HDFS:
 The biggest disadvantage is that it is not fit for small quantities of data.
 It has issues related to potential stability, restrictive and rough in nature.
MapReduce
MapReduce is a programming model used for efficient processing in parallel over

large data-sets in a distributed manner.
The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations.
The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will
reduce it to equivalent tasks for providing less overhead over the cluster network and
to reduce the processing power.
The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
 Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps. The Map()
function will be executed in its memory repository on each of these input key-
value pairs and generates the intermediate key-value pair which works as input
for the Reducer or Reduce() function.
 Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or group
the data based on its key-value pair as per the reducer algorithm written by the
developer.
Components of MapReduce Architecture:
 Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send jobs
for processing to the Hadoop MapReduce Manager.
 Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
 Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
 Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The
result of all the job-parts combined to produce the final output.
 Input Data: The data set that is fed to the MapReduce for processing.
 Output Data: The final result is obtained after the processing.
In MapReduce, we have a client.The client will submit the job of a particular size to
the Hadoop MapReduce Master.Now, the MapReduce master will divide this job into
further equivalent job-parts.These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires.
The input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and the final
output is stored on the HDFS. There can be n number of Map and Reduce tasks made
available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such that the
time complexity or space complexity is minimum.
MapReduce- Example
Hadoop YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data
stored in HDFS (Hadoop Distributed File System) thus making the system much more
efficient. Through its various components, it can dynamically allocate various
resources and schedule the application processing.
For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.
YARN Features:
 Scalability: The scheduler in Resource manager of YARN architecture allows

Hadoop to extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without

disruptions thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in

Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a

benefit of multi-tenancy.
Hadoop YARN Architecture
The main components of YARN architecture include:
1. Client: It submits map-reduce jobs.

2. Resource Manager: It is the master daemon of YARN and is responsible for
resource assignment and management among all the applications. Whenever it
receives a processing request, it forwards it to the corresponding node manager
and allocates resources for the completion of the request accordingly. It has two
major components:
3. Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
4. Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
5. Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up
with the Resource Manager. It registers with the Resource Manager and sends
heartbeats with the health status of the node
6. . It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager. It is also responsible for creating
the container process and start it on the request of Application master.
7. Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource
manager, tracking the status and monitoring progress of a single application. The
application master requests the container from the node manager by sending a
Container Launch Context(CLC) which includes everything an application needs
to run. Once the application is started, it sends the health report to the resource
manager from time-to-time.
8. Container: It is a collection of physical resources such as RAM, CPU cores and
disk on a single node. The containers are invoked by Container Launch
Context(CLC) which is a record that contains information such as environment
variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:
i. Client submits an application

ii. The Resource Manager allocates a container to start the Application Manager
iii. The Application Manager registers itself with the Resource Manager
iv. The Application Manager negotiates containers from the Resource Manager
v. The Application Manager notifies the Node Manager to launch containers
vi. Application code is executed in the container
vii. Client contacts Resource Manager/Application Manager to monitor application’s
status
viii. Once the processing is complete, the Application Manager un-registers with the
Resource Manager
Advantages :
1. Flexibility: YARN offers flexibility to run various types of distributed processing
systems such as Apache Spark, Apache Flink, Apache Storm, and others. It
allows multiple processing engines to run simultaneously on a single Hadoop
cluster.
2. Resource Management: YARN provides an efficient way of managing resources
in the Hadoop cluster. It allows administrators to allocate and monitor the
resources required by each application in a cluster, such as CPU, memory, and
disk space.
3. Scalability: YARN is designed to be highly scalable and can handle thousands of
nodes in a cluster. It can scale up or down based on the requirements of the
applications running on the cluster.
4. Improved Performance: YARN offers better performance by providing a
centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.
5. Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.
Disadvantages :
1) Complexity: YARN adds complexity to the Hadoop ecosystem. It requires
additional configurations and settings, which can be difficult for users who are
not familiar with YARN.
2) Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing
resources and scheduling applications.
3) Latency: YARN introduces additional latency in the Hadoop ecosystem. This
latency can be caused by resource allocation, application scheduling, and
communication between components.
4) Single Point of Failure: YARN can be a single point of failure in the Hadoop
cluster. If YARN fails, it can cause the entire cluster to go down. To avoid this,
administrators need to set up a backup YARN instance for high availability.
5) Limited Support: YARN has limited support for non-Java programming
languages. Although it supports multiple processing engines, some engines have
limited language support, which can limit the usability of YARN in certain
environments.
9.OPEN-SOURCE BIG DATA TOOLS
1. Hadoop
It is recognized as one of the most popular big data tools to analyze large data sets, as
the platform can send data to different servers. Another benefit of using Hadoop is
that it can also run on a cloud infrastructure.
This open-source software framework is used when the data volume exceeds the
available memory. This big data tool is also ideal for data exploration, filtration,
sampling, and summarization. It consists of four parts:
Hadoop Distributed File System: This file system, commonly known as HDFS, is a
distributed file system compatible with very high-scale bandwidth.
MapReduce: It refers to a programming model for processing big data.
YARN: All Hadoop’s resources in its infrastructure are managed and scheduled using
this platform.
Libraries: They allow other modules to work efficiently with Hadoop.
2. Apache Spark
This big data tool is the most preferred tool for data analysis over other types of
programs due to its ability to store large computations in memory. It can run
complicated algorithms, which is a prerequisite for dealing with large data sets.
Proficient in handling batch and real-time data, Apache Spark is flexible to work with
HDFS and OpenStack Swift or Apache Cassandra. Often used as an alternative to
MapReduce, Spark can run tasks 100x faster than Hadoop’s MapReduce.
3. Cassandra
Apache Cassandra is one of the best big data tools to process structured data sets.
Created in 2008 by Apache Software Foundation, it is recognized as the best open-
source big data tool for scalability. This big data tool has a proven fault-tolerance on
cloud infrastructure and commodity hardware, making it more critical for big data
uses.
It also offers features that no other relational and NoSQL databases can provide. This
includes simple operations, cloud availability points, performance, and continuous
availability as a data source, to name a few. Apache Cassandra is used by giants like
Twitter, Cisco, and Netflix.
4. MongoDB
MongoDB is an ideal alternative to modern databases. A document-oriented database

is an ideal choice for businesses that need fast and real-time data for instant decisions.
One thing that sets it apart from other traditional databases is that it makes use of
documents and collections instead of rows and columns.
Thanks to its power to store data in documents, it is very flexible and can be easily
adapted by companies. It can store any data type, be it integer, strings, Booleans,
arrays, or objects. MongoDB is easy to learn and provides support for multiple
technologies and platforms.
5. HPCC
High-Performance Computing Cluster, or HPCC, is the competitor of Hadoop in the

big data market. It is one of the open-source big data tools under the Apache 2.0
license. Developed by LexisNexis Risk Solution, its public release was announced in
2011. It delivers on a single platform, a single architecture, and a single programming
language for data processing. If you want to accomplish big data tasks with minimal
code use, HPCC is your big data tool. It automatically optimizes code for parallel
processing and provides enhanced performance. Its uniqueness lies in its lightweight
core architecture, which ensures near real-time results without a large-scale
development team.
10. BIG DATA AND CLOUD
S.No. BIG DATA CLOUD COMPUTING
Big data refers to the data which is Cloud computing refers to the on
01. huge in size and also increasing demand availability of computing
rapidly with respect to time. resources over internet.
Cloud Computing Services includes

Big data includes structured data,
Infrastructure as a Service (IaaS),
02. unstructured data as well as semi-
Platform as a Service (PaaS) and
structured data.
Software as a Service (SaaS).
On-Demand availability of IT
Volume of data, Velocity of data,
resources, broad network access,
Variety of data, Veracity of data, and
resource pooling, elasticity and
03. Value of data are considered as the 5
measured service are considered as the
most important characteristics of Big
main characteristics of cloud
data.
computing.
The purpose of big data is to

The purpose of cloud computing is to
organizing the large volume of data
store and process data in cloud or
04. and extracting the useful information
availing remote IT services without
from it and using that information
physically installing any IT resources.
for the improvement of business.
Distributed computing is used for

Internet is used to get the cloud based
05. analyzing the data and extracting the
services from different cloud vendors.
useful information.
Big data management allows

centralized platform, provision for Cloud computing services are cost
06.
backup and recovery and low effective, scalable and robust.
maintenance cost.
Some of the challenges of big data Some of the challenges of cloud

are variety of data, data storage and computing are availability,
07.
integration, data processing and transformation, security concern,
resource management. charging model.
Cloud computing refers to remote IT

Big data refers to huge volume of
resources and different internet service
08. data, its management, and useful
models.
information extraction.
Cloud computing is used to store data
Big data is used to describe huge and information on remote servers and
09.
volume of data and information. also processing the data using remote
infrastructure.
Some of the cloud computing vendors

Some of the sources where big data
who provides cloud computing services
is generated includes social media
10. are Amazon Web Service (AWS),
data, e-commerce data, weather
Microsoft Azure, Google Cloud
station data, IoT Sensor data etc.
Platform, IBM Cloud Services etc.
11.MOBILE BUSINESS INTELLIGENCE
The definition of mobile BI refers to the access and use of information via mobile
devices. With the increasing use of mobile devices for business – not only in
management positions – mobile BI is able to bring business intelligence and analytics
closer to the user when done properly. Whether during a train journey, in the airport
departure lounge or during a meeting break, information can be consumed almost
anywhere and anytime with mobile BI.
Need for mobile BI?
 Mobile phones' data storage capacity has grown in tandem with their use. You are
expected to make decisions and act quickly in this fast-paced environment. The
number of businesses receiving assistance in such a situation is growing by the
day.
 To expand your business or boost your business productivity, mobile BI can help,
and it works with both small and large businesses. Mobile BI can help you
whether you are a salesperson or a CEO. There is a high demand for mobile BI in
order to reduce information time and use that time for quick decision making.
Advantages of mobile BI
1)Simple access
Mobile BI is not restricted to a single mobile device or a certain place. You can view
your data at any time and from any location. Having real-time visibility into a firm
improves production and the daily efficiency of the business. Obtaining a company's
perspective with a single click simplifies the process.
2)Competitive advantage
Many firms are seeking better and more responsive methods to do business in order to
stay ahead of the competition. Easy access to real-time data improves company
opportunities and raises sales and capital. This also aids in making the necessary
decisions as market conditions change.
3)Simple decision-making
As previously stated, mobile BI provides access to real-time data at any time and from
any location. During its demand, Mobile BI offers the information. This assists
consumers in obtaining what they require at the time. As a result, decisions are made
quickly.
4)Increase Productivity
By extending BI to mobile, the organization's teams can access critical company data
when they need it. Obtaining all of the corporate data with a single click frees up a
significant amount of time to focus on the smooth and efficient operation of the firm.
Increased productivity results in a smooth and quick-running firm.
Disadvantages of mobile
1)Stack of data
The primary function of a mobile BI is to store data in a systematic manner and then
present it to the user as required. As a result, Mobile BI stores all of the information
and does end up with heaps of earlier data. The corporation only needs a small portion
of the previous data, but they need to store the entire information, which ends up in
the stack
2)Expensive
Mobile BI can be quite costly at times. Large corporations can continue to pay for
their expensive services, but small businesses cannot. As the cost of mobile BI is not
sufficient, we must additionally consider the rates of IT workers for the smooth
operation of BI, as well as the hardware costs involved.
3)Time consuming
Businesses prefer Mobile BI since it is a quick procedure. Companies are not patient
enough to wait for data before implementing it. In today's fast-paced environment,
anything that can produce results quickly is valuable. The data from the warehouse is
used to create the system, hence the implementation of BI in an enterprise takes more
than 18 months.
12. CROWD SOURCING ANALYTICS
Crowd sourcing is a sourcing model in which an individual or an organization gets

support from a large, open-minded, and rapidly evolving group of people in the form
of ideas, micro-tasks, finances, etc. Crowd sourcing typically involves the use of the
internet to attract a large group of people to divide tasks or to achieve a target. The
term was coined in 2005 by Jeff Howe and Mark Robinson.
Where Can We Use Crowd sourcing?

Crowd sourcing is touching almost all sectors from education to health. It is not only
accelerating innovation but democratizing problem-solving methods. Some fields
where crowd sourcing can be used.
 Enterprise
 IT
 Marketing , Education , Finance , Science and health
How To Crowdsource?
For scientific problem solving, a broadcast search is used where an organization

mobilizes a crowd to come up with a solution to a problem.
For information management problems, knowledge discovery and management is
used to find and assemble information.
For processing large datasets, distributed human intelligence is used. The organization
mobilizes a crowd to process and analyze the information.
Examples Of Crowd sourcing
1. Doritos: It is one of the companies which is taking advantage of crowdsourcing

for a long time for an advertising initiative. They use consumer-created ads for
one of their 30-Second Super Bowl Spots(Championship Game of Football).
2. Starbucks: Another big venture which used crowdsourcing as a medium for idea
generation. Their white cup contest is a famous contest in which customers need
to decorate their Starbucks cup with an original design and then take a photo and
submit it on social media.
3. Lays:” Do us a flavor” contest of Lays used crowdsourcing as an idea-generating
medium. They asked the customers to submit their opinion about the next chip
flavor they want.
4. Airbnb: A very famous travel website that offers people to rent their houses or
apartments by listing them on the website. All the listings are crowdsourced by
people.
Advantages Of Crowd sourcing
 Evolving Innovation: Innovation is required everywhere and in this advancing

world innovation has a big role to play. Crowd sourcing helps in getting
innovative ideas from people belonging to different fields and thus helping
businesses grow in every field.
 Save costs: There is the elimination of wastage of time of meeting people and
convincing them. Only the business idea is to be proposed on the internet and you
will be flooded with suggestions from the crowd.
 Increased Efficiency: Crowd sourcing has increased the efficiency of business
models as several expertise ideas are also funded.
Disadvantages Of Crowd sourcing
 Lack of confidentiality: Asking for suggestions from a large group of people can
bring the threat of idea stealing by other organizations.
 Repeated ideas: Often contestants in crowdsourcing competitions submit repeated,
plagiarized ideas which leads to time wastage as reviewing the same ideas is not
worthy.
13.INTER AND TRANS FIREWALL ANALYTICS
What is Firewall?
A firewall is a sort of network security hardware or software application that monitors

and filters incoming and outgoing network traffic according to a set of security rules.
It serves as a barrier between internal private networks and public networks (such as
the public Internet).
A firewall’s principal goal is to allow non-threatening communication while blocking

dangerous or undesirable data transmission in order to protect the computer from
viruses and attacks. A firewall is a cybersecurity solution that filters network traffic
and assists users in preventing harmful malware from gaining access to the Internet on
compromised machines.
Firewall is Hardware or Software?

Whether a firewall is hardware or software is one of the most difficult issues to
answer. A firewall, as previously noted, can be a network security equipment or a
computer software programme. This means that the firewall is available on both
hardware and software levels, albeit it is preferable to have both.
The functionality of each format (a firewall implemented as hardware or software)

varies, but the goal remains the same. A hardware firewall is a piece of hardware that
connects a computer network to a gateway. Consider a broadband router. A software
firewall, on the other hand, is a basic programme that works with port numbers and
other installed software to protect a computer.
Aside from that, cloud-based firewalls are available. FaaS (firewall as a service) is a
typical moniker for them. The ability to administer cloud-based firewalls from a
central location is a major benefit. Cloud-based firewalls, like hardware firewalls, are
best recognized for perimeter security.
Firewall Methodologies –
1) Static packet filtering – Packet filtering is a firewall technique used to control

access on the basis of source IP address, destination IP address, source port number,
and destination port number. It works on layers 3 and 4 of the OSI model. Also, an
ACL doesn’t maintain the state of the session. A router with ACL applied to it is an
example of static packet filtering.
2) Stateful packet filtering –

In stateful packet filtering, the state of the sessions is maintained i.e when a session is
initiated within a trusted network, it’s the source and destination IP address, source,
and destination ports, and other layer information are recorded. By default, all the
traffic from an untrusted network is denied.
The replies of this session will be allowed only when the IP addresses (source and
destination IP address) and port numbers (source and destination )are swapped.
3)Proxy firewalls –
These are also known as application-layer firewalls. A proxy firewall acts as an
intermediary between the original client and the server. No direct connection takes
place between the original client and the server.
The client, who has to establish a connection directly to the server to communicate
with it, now has to establish a connection with the proxy server. The proxy server then
establishes a connection with the server on the behalf of the client. Now, the client
sends the data to the proxy server and the proxy server forwards it to the server. A
proxy server can operate up to layer 7 (application layer).
4)Transparent firewall –
By default, the firewall operates at layer 3 but the benefit of using a transparent
firewall is that it can operate at layer 2. It has 2 interfaces that will act as a bridge so
can be configured through a single management IP address. Also, users accessing the
network will not even know that a firewall exists.
The main advantage of using a transparent firewall is that we don’t need to re-address
our networks while putting up a firewall in our network. Also, while operating at layer
2, it can still perform functions like building a stateful database, application inspection,
etc.
5)Network Address Translation (NAT) –

NAT is implemented on a router or firewall. NAT is used to translate a private IP
address into a public IP address through which we can hide our source IP address.
And if we are using dynamic NAT or PAT, an attacker will not be able to know that
what devices are dynamically assigned which IP address from the pool. This makes it
difficult to make a connection from the outside world to our private network.
6)Next-Generation Firewalls –
NGFWs are third-generation security firewall that is implemented in either software
or device. It combines basic firewall properties like static packet filtering, application
inspection with advanced security features like an integrated intrusion prevention
system. Cisco ASA with firePOWER services is an example of a Next-Generation
firewall.
What exactly is the work of a firewall?
A firewall system examines network traffic according to pre-set rules. The traffic is
subsequently filtered, and any traffic coming from untrustworthy or suspect sources is
blocked. It only accepts traffic that has been configured to accept. Firewalls often
intercept network traffic at a computer’s port, or entry point. According to pre-defined
security criteria, firewalls allow or block particular data packets (units of
communication carried over a digital network). Only trusted IP addresses or sources
are allowed to send traffic in.
Firewalls have grown in power, and now encompass a number of built-in functions
and capabilities:
 Preventing Network Threats

 Control based on the application and the user’s identity.
 Support for Hybrid Cloud.
 Performance that scales.
 Control and management of network traffic.
 Validation of access.
 Keep track of what happens and report on it.

Unit-1 BDA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-1 BDA

Uploaded by

Copyright:

Available Formats

BDA UNIT 1

Unit-1 UNDERSTANDING BIG DATA

1. INTRODUCTION TO BIG DATA

Define big data

Types Of Big Data

Following are the types of Big Data:

Employee_ID Employee_Name Gender Department Salary_In_lacs

A typical example of unstructured data is a heterogeneous data source containing a

Examples Of Un-structured Data

The output returned by ‘Google Search’-which includes text,audio,video

Examples Of Semi-structured Data - Personal data stored in an XML file-

2. CONVERGENCE OF KEY TRENDS

Characteristics Of Big Data

Variety-Variety refers to heterogeneous sources and the nature of data, both

Characteristics of Big Data :

Variety of Big data – Structured, unstructured, and semi structured data

Velocity of Big data – Speed of data generation

Volume of Big data – Huge volumes of data that is being generated

Value of Big data – Extracting useful information and making it valuable

Advantages of Big Data :

Disadvantages of Big Data :

Cloud computing refers to the on demand availability of computing resources over

Characteristics of Cloud Computing :

Advantages of Cloud Computing :

 Back-up and restore data

Disadvantages of Cloud Computing :

Web Analytics is the methodological study of online/offline patterns and trends. It is a

Importance of Web Analytics

 Assess web content problems so that they can be rectified

Web Analytics Process

 Set the business goals.

6. APPLICATIONS OF BIG DATA

By analyzing flight’s machine-generated data, it can be estimated how long the

7. Media and Entertainment Sector: Media and entertainment service providing

7.BIG DATA TECHNOLOGIES

MongoDB: MongoDB is another important component of big data technologies in

 It is a single master server exist in the HDFS cluster.

 The HDFS cluster contains multiple DataNodes.

 It works as a slave node for Job Tracker.

Hadoop framework is made up of the following modules:

1. Hadoop MapReduce- a MapReduce programming model for handling and

Hadoop Distributed File System

It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,

MapReduce is a programming model used for efficient processing in parallel over

Components of MapReduce Architecture:

 Output Data: The final result is obtained after the processing.

 Scalability: The scheduler in Resource manager of YARN architecture allows

 Compatibility: YARN supports the existing map-reduce applications without

 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in

 Multi-tenancy: It allows multiple engine access thus giving organizations a

The main components of YARN architecture include:

1. Client: It submits map-reduce jobs.

i. Client submits an application

9.OPEN-SOURCE BIG DATA TOOLS

MongoDB is an ideal alternative to modern databases. A document-oriented database

High-Performance Computing Cluster, or HPCC, is the competitor of Hadoop in the

S.No. BIG DATA CLOUD COMPUTING

Cloud Computing Services includes

The purpose of big data is to

Distributed computing is used for

Big data management allows

Some of the challenges of big data Some of the challenges of cloud