Data Analytics and Hadoop

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Introduction to Data Analytics

In the world of IoT, the creation of


massive amounts of data from
sensors is common and one of the
biggest challenges.
A great example of the deluge of
data that can be generated by IoT
is found in the commercial
aviation industry and the sensors
that are deployed throughout an
aircraft. Modern jet engines are
equipped with around 5000
sensors. Therefore, a twin engine
commercial aircraft with these
engines operating on average 8
hours a day will generate over 500
TB of data daily. In fact, a single
wing of a modern jumbo jet is
equipped with 10,000 sensors.
Structured Versus Unstructured Data

Structured data means that the data follows a model or schema that defines
how the data is represented or organized, meaning it fits well with a traditional
relational database management system (RDBMS).

Unstructured data lacks a logical schema for understanding and decoding the
data through traditional programming means. Examples of this data type include
text, speech, images, and video. As a general rule, any data that does not fit
neatly into a predefined data model is classified as unstructured data.

According to some estimates, around 80% of a business’s data is unstructured


Types of Data Analysis Results
IoT Data Analytics Challenges

Scaling problems: Due to the large number of smart objects in most IoT networks
that continually send data, relational databases can grow incredibly large very
quickly.
Volatility of data: With relational databases, it is critical that the schema be
designed correctly from the beginning. Changing it later can slow or stop the
database from operating

Machine Learning
One of the core subjects in IoT is how to makes sense of the data that is
generated. ML is indeed central to IoT. Data collected by smart objects needs to be
analyzed, and intelligent actions need to be taken based on these analyses.

Machine learning is part of a larger set of technologies commonly


grouped under the term artificial intelligence (AI).

ML is a vast field but can be simply divided in two main categories


 Supervised
 Unsupervised learning.
Supervised Learning
In supervised learning, the machine is trained with input for which there is a
known correct answer. Hundreds or thousands of images are fed into the
Machine. This is called the training set.
Example: Determining whether the shape is a human or something else
(such as a vehicle, a pile of ore, a rock, etc..)

Unsupervised Learning
The computing process associated with decision making is called unsupervised
learning. This type of learning is unsupervised because there is not a “good” or
“bad” answer known in advance.
Example: Aircraft fault detection.
Neural networks
Neural networks are ML methods that mimic the way the
human brain works. When you look at a human figure,
multiple zones of your brain are activated to recognize
colors, movements, facial expressions, and so on. Your
brain combines these elements to conclude that the
shape you are seeing is human. Neural networks mimic
the same logic. The information goes through different
algorithms (called units), each of which is in charge of
processing an aspect of the information.
Hadoop

 Hadoop is an Open source software platform for scalable, distributed


computing.
 Hadoop provides fast and reliable analysis of both structured data and
unstructured data
 Apache Hadoop software library is essentially a framework that allows for the
distributed processing of large datasets across clusters of computers using a
simple programming model.
 Hadoop can scale up from single servers to thousands of machines, each
offering local computation and storage.
 Hadoop was originally developed as a result of projects at Google and Yahoo!,
and the original intent for Hadoop was to index millions of websites and
quickly return search results for open source search engines.

Hadoop has two key elements:


Hadoop Distributed File System (HDFS): A system for storing data
across multiple nodes
MapReduce: A distributed processing engine that splits a large task into
smaller ones that can be run in parallel
HADOOP adoption in Industry
Hadoop Distributed File System (HDFS):

 A distributed file system that provides high-throughput access to application


data
 HDFS uses a master/slave architecture in which one device (master) termed
as NameNode controls one or more other devices (slaves) termed as
DataNode.
 It breaks Data/Files into small blocks (128 MB each block) and stores on
DataNode and each block replicates on other nodes to accomplish fault
tolerance.
 NameNode keeps the track of blocks written to the DataNode
Writing a File to HDFS

NameNodes: They coordinate where the DataNodes: These are the servers where the
data is stored, and maintain a map of data is stored at the direction of the
where each block of data is stored and NameNode. It is common to have many
where it is replicated. DataNodes in a Hadoop cluster to store the
data.
MAP Reduce

 A software framework for distributed processing of large data sets


 The framework takes care of scheduling tasks, monitoring them and re-
executing any failed tasks.
 It splits the input data set into independent chunks that are processed in a
completely parallel manner.
 MapReduce framework sorts the outputs of the maps, which are then input
to the reduce tasks. Typically, both the input and the output of the job are
stored in a file system.
MAP Reduce : Basic Concepts
YARN
Introduced with version 2.0 of Hadoop, YARN (Yet Another Resource Negotiator)
was designed to enhance the functionality of MapReduce. With the initial
release, MapReduce was responsible for batch data processing and job tracking
and resource management across the cluster. YARN was developed to take over
the resource negotiation and job/task tracking, allowing MapReduce to be
responsible only for data processing.

The Hadoop Ecosystem


Hadoop plays an increasingly big role in the collection, storage, and processing
of IoT data due to its highly scalable nature and its ability to work with large
volumes of data. Many organizations have adopted Hadoop clusters for storage
and processing of data and have looked for complimentary software packages
to add additional functionality to their distributed Hadoop clusters. Since the
initial release of Hadoop in 2011, many projects have been developed to add
incremental functionality to Hadoop and have collectively become known as
the Hadoop ecosystem.
Examples: Apache Kafka, Spark, Storm, etc.
Apache Kafka
Apache Kafka is an open-source stream-processing software platform developed by
LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The
project aims to provide a unified, high-throughput, low-latency platform for handling
real-time data feeds.
Apache Spark
Apache Spark is an in-memory distributed data analytics platform designed to
accelerate processes in the Hadoop ecosystem. The “in-memory” characteristic
of Spark is what enables it to run jobs very quickly.

Apache Storm and Apache Flink


Apache Storm and Apache Flink are other Hadoop ecosystem projects designed
for distributed stream processing and are commonly deployed for IoT use cases.
Storm can pull data from Kafka and process it in a near-real-time fashion, and so
can Apache Flink. This space is rapidly evolving, and projects will continue to
gain and lose popularity as they evolve.
Lambda Architecture
Lambda architecture is a data-processing architecture designed to handle massive
quantities of data by taking advantage of both batch and stream-processing
methods.

 Stream layer: This layer is responsible for near-real-time processing of events.


 Batch layer: The Batch layer consists of a batch-processing engine and data store.
 Serving layer: The Serving layer is a data store and mediator that decides which of
the ingest layers to query based on the expected result or view into the data.
Edge Streaming Analytics
Edge analytics is an approach to data collection and analysis in which an automated
analytical computation is performed on data at a sensor, network switch or other
device instead of waiting for the data to be sent back to a centralized data store.

The key values of edge streaming analytics include the following:


 Reducing data at the edge: The aggregate data generated by IoT devices is
generally in proportion to the number of devices. The scale of these devices is
likely to be huge, and so is the quantity of data they generate. Passing all this
data to the cloud is inefficient and is unnecessarily expensive in terms of
bandwidth and network infrastructure.
 Analysis and response at the edge: Some data is useful only at the edge (such
as a factory control feedback system). In cases such as this, the data is best
analyzed and acted upon where it is generated.
 Time sensitivity: When timely response to data is required, passing data
to the cloud for future processing results in unacceptable latency. Edge
analytics allows immediate responses to changing conditions.
Edge Analytics Core Functions
Distributed Analytics Systems
Distributed analytics spreads data analysis workloads over multiple nodes in a
cluster of servers, rather than asking a single node to tackle a big problem. The
same algorithms run across each of the nodes, processing a subset of the data.
When the processing concludes, the data sets are aggregated, or brought back
together, to generate collective insights.

Distributed Analytics Throughout the IoT System


Network Analytics
Another form of analytics that is extremely important in managing IoT systems
is network-based analytics. Unlike the data analytics systems previously
discussed that are concerned with finding patterns in the data generated by
endpoints, network analytics is concerned with discovering patterns in the
communication flows from a network traffic perspective. Network analytics has
the power to analyze details of communications patterns made by protocols and
correlate this across the network.
Flexible NetFlow Architecture
FNF is a flow technology developed by Cisco Systems that is widely deployed
all over the world.
Key advantages of FNF are as follows:
 Flexibility, scalability, and aggregation of flow data
 Ability to monitor a wide range of packet information and produce new
information about network behavior
 Enhanced network anomaly and security detection
 User-configurable flow information for performing customized traffic
identification and ability to focus and monitor specific network behavior
 Convergence of multiple accounting technologies into one accounting
mechanism

Flexible NetFlow (FNF) and IETF IPFIX (RFC 5101, RFC 5102) are examples
of protocols that are widely used for networks.
Flexible NetFlow overview

You might also like