Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

7082 CEM

Lecture 2 - Introduction to Big Data Analytics

DR. MARWAN FUAD


Siddhartha Neupane

2019- 2020
2023
Outline 2

o Part 1 - Analytics

• Types of Analytics (Descriptive, Predictive, Prescriptive)


• Business Intelligence
• Data Lake
• Data Warehousing
• ETL

o Part2 - Big Data Analytics

• Big Data
• Drivers of Big Data
• The Vs
• Big Data Characteristics
• Hadoop - Hadoop Characteristics.
• Hadoop Components:.
• HDFS, Name Node, Data Node

• MapReduce, How MapReduce Works.


• Hadoop Daemons - Job Tracker - Task Tracker

• YARN, Resource manager - Node Manager

• Hadoop Limitations
• Big Data Limitations
3

Part 1 – Analytics
Introduction 4

o The early definition of analytics is “a systematic analysis and interpretation


of raw data (typically using mathematical, statistical, and computational
tools) to improve our understanding of a real-world domain”.

o In recent years this definition has broadened to encompass not only the
specific techniques and approaches that transform collected data into
useful information but also the infrastructure required to make analytics
work, the various sources of data that feed into the analytical systems, the
processes through which the raw data are cleaned up and organized for
analysis, the user interfaces that make the results easy to view and simple to
understand.
Types on Analytics 5

Descriptive Analytics
o It primarily focuses on describing the past status of the domain of interest
using a variety of tools through techniques such as reporting, data
visualization, dashboards, etc
o Online analytical processing (OLAP), is part of descriptive analytics that
allows users to get a multidimensional view of data
Predictive Analytics
o Applies statistical and computational methods and models to data
regarding past and current events to predict what might happen in the
future.
Prescriptive Analytics
o Attempts to answer the question “How can we make it happen?” using
advanced optimization, simulation, and modeling tools.
Business Intelligence (BI) 6
o BI is an umbrella term that refers to the processes for collecting and analyzing
data, the technologies used in these processes, and the information obtained
from these processes with the purpose of facilitating corporate decision making.

o Analytics is in the core of BI.

o BI has four drivers:


 Optimize business operations
 Identify business risk
 Predict new business opportunities
 Comply with laws or regulatory requirements

(from https://baianat.s3.amazonaws.com)
Data Lake 7

o A data lake is a large integrated repository for internal and external data
that does not follow a predefined schema.

o Data lake has three characteristics:


 Collect everything: A data lake includes all collected raw data over a long period
of time
 Dive in anywhere: data in a data lake can be accessed by a wide variety of
organizational actors
 Flexible access: “Schema-on-read”

(from https://www.guru99.com)
Data Warehousing 8

o Originally, it was defined as “information warehouse”, and it was used to


allow organizations to use their data archives to help them gain a business
advantage.
o Now it is defined as “A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process”

o Subject-oriented: data are organized around the major subjects of the enterprise.
o Integrated: data come from different resources, so they are often inconsistent.
o Time-variant: data in the warehouse are valid only at some point in time .
o Nonvolatile, data are not updated in real time
Extraction, Transformation, and Loading (ETL) 9
o The data destined for enterprise data warehouses (EDW) must first be extracted from one
or more data sources, transformed into a form that is easy to analyze and consistent with
data already in the warehouse, and then finally loaded into the EDW
Extraction
o The extraction step targets one or more data sources for the EDW.

Transformation
o Transformation applies a series of rules or functions to the extracted data, which
determines how the data will be used for analysis.

Loading
o It can occur after all transformations have taken place or as part of the transformation
processing.
10

Part 2 – Big Data Analytics


(BDA)
Introduction 11

o In the last decade, data have been created constantly, and at an ever-
increasing rate.
o Mobile phones, social media, imaging technologies create new data, which
must be stored somewhere for some purpose.
o Devices and sensors automatically generate diagnostic information that
needs to be.
o Keeping up with all these data is very
challenging, yet analysing them is far
more difficult.
Drivers of Big Data 12
o In the 90s the volume of data was often measured in terabytes. Most organizations
analyzed structured and used RDBMS to manage them.

o The following decade saw a


proliferation of data sources
with an increase in size.
Data started to be measured in
petabyte.

o In the 2010s, the size of data that


organizations have to manage has
increased even more with a more
diversity of data types.
Big Data 13

o Although the term Big Data has become very popular, there is in fact no
definition for it yet. The word “big” is very generic. How big is “big”? More
importantly, this concept is time and circumstance relevant.

o “Big Data is data whose scale, distribution, diversity, and/or timeliness


require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value” (McKinsey & Co)

o This concept is actually constantly changing with the advent of new


technologies.

o Traditionally, all attributes of Big Data were referred to using a term that
starts with the letter “V” (!!!)
The 3Vs Definition 14
o This is the first and widely used definition of big data. It is based on volume,
velocity, and variety.
Volume It is VERY important to understand
that the concept of “big” when it
o Refers to the size of data. comes to data volume is relative and
NOT abstract. What is big for one
machine, may not be that for that for
Velocity another, and what is considered big
in the past/now will very likely not be
o The speed of new data creation and growth. considered that way same now/in
the future, respectively. at a in the
past or now,
Variety
o The complexity of data types and structures.

 Why “Big”? Why it’s about volume and not velocity or variety?
The 4Vs Definition 15

o IBM added another attribute to Big Data. This attribute is veracity.

Veracity
o Implies the uncertainty of data.

o The reason behind adding this attribute is “in response to the quality and
source issues our clients began facing with their Big Data initiatives”
The 5Vs Definition 16

o A 5th V was proposed later, which is value.

Value
o The worth of extracted data.
The 6Vs Definition 17

o Microsoft presented a definition that added two new Vs, to the 4Vs in the
4Vs definition (volume, velocity, variety, veracity). These two new Vs are on
variability, and visibility.

Variability
o Refers to the complexity of data set(the number of variables). In
comparison with “Variety”, which refers to different data format.

Visibility
o Emphasizes on the need to have a full picture of data in order to make
informative decision.
Yet More Vs 18

o More and more Vs have


been added. The current
number (last time I checked)
is 42.
The 32Vs Definition (1) 19

o This definition groups attributes into three domains:

1. Data Domain
The dominant V in this domain is volume, the other two attributes are velocity and
variety (the 3Vs definition)
Verdict : The potential choice or decision that
should be made by a decision maker or committee
based on the scope of the problem, available
2. BI Domain resources, and certain computational capacity

The drivers and motivations to applying BDA to BI are value, visibility, and verdict

3. Statistics Domain
The three attributes of this domain are veracity, variability, and validity
Validity: It is to verify the quality of data being logically sound. It
emphasizes on how to correctly acquire data and avoid biases
The 32Vs Definition (2) 20

o Venn diagram of the 32Vs definition


The 32Vs Definition (3) 21

o Semantic meaning of the 32Vs definition


Big Data Characteristics 22
o Regardless of the Vs definitions; Big Data has certain characteristics. Here are some of
them:

1. Data are distributed across several nodes.


2. Applications are distributed to data not the other way around (to be explained further later)
3. As much as possible, data are processed local to the node

o The two points 2 and 3 above are actually related, together they can be expressed as
“avoid moving data as much as possible” .
Remark 23
o Unlike the common belief, Big Data can deal with structured data and not only
unstructured data. However, in most cases it is concerned with unstructured or semi-
structured data.

o The importance of this issue is actually diminishing as many RDBMS systems are starting to
incorporate concepts from MapReduce - the computational framework widely used for
Big Data (to be presented later). On the other hand, several Big Data projects, such as
Hive and Pig, have been developed to make Big Data more applicable to traditional
databases, which are still widely used in business, and Big Data will have to adapt to that.
Apache Hadoop 24
o Hadoop is the most popular platform of BDA.
It was created by Doug Cutting and Mike Cafarella.
Hadoop was based on a Google paper published
in 2004 and its development started in 2005.
The name “Hadoop” actually came from the name
of a yellow toy elephant that Cutting’s son had.

o In 2008, Hadoop had become a top level Apache project and was being used by several
large data companies such as Yahoo!, Facebook, and The New York Times

o Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data. It provides both storage and computational capabilities.

o Hadoop is deployed on a cluster of machines. These machine are grouped in racks


Hadoop Characteristics 25

o Hadoop has several characteristics that make it popular:

1. It is an open source platform.


2. It is linearly scalable and reliable and accepts hardware failure
3. It is a fault-tolerant system.
4. It is a practical platform to store and process large amounts of data
5. It leverages commodity of hardware
6. It uses “schema-on-read”
7. It is the best choice for diversified data sources
Hadoop Components 26

o Hadoop has three main components:

1. HDFS: Hadoop’s file system


2. MapReduce
3. YARN: resource negotiator

(from https://www.edureka.co/)

Remark: Some references mention two components of Hadoop; HDFS and MapReduce. In
fact these are the Hadoop's kennel, which inherited from Google’s system had two
components; MapReduce and a predecessor of HDFS called GFS .
Hadoop Distributed File System (HDFS) (1) 27
o HDFS is the component of Hadoop that is responsible for storage. It
was adapted from GFS (Google’s file system)

o HDFS is built to support very large files. It is in fact optimized for very
large files and not small ones.

o It can store millions of files.

o It is designed to support the functionality of MapReduce, although, in


fact, it can be used independently of MapReduce to support large
datasets.

o It is based on “write once, read many times”


Hadoop Distributed File System (HDFS) (2) 28

o Each file is split into blocks of a fixed size (64-128 MB)

o Each block is replicated (3 replicas by default in Hadoop).

o These replicas are distributed to different machines in the cluster


(why? See next slide)
Hadoop Distributed File System (HDFS) (3) 29

o Having different replicas (copies) means multiple machine


failures are easily tolerated (Remember the fault-tolerance
characteristic)
o Having different replicas means several “versions” of data are
available for reading, so a data block can be read from the
machine closest to an application on the network, which, in turn,
means we won’t have to move the data (Remember how we
avoid moving the data). This speeds up the processing
o HDFS is designed to track and manage the number of available
replicas of a block, so if the number of copies of a block drops,
because of failure, the filesystem automatically makes a new
copy from one of the remaining replicas.
Hadoop Distributed File System (HDFS) (4) 30

o There are two kinds of nodes in the HDFS cluster:

1. Name node.
2. Data node.

Remark: In fact there are other types, but these are the main
kinds

o HDFS uses a master-slave architecture, where the name node is


the master and the data node is the slave
Hadoop Distributed File System (HDFS) (5) 31

Name Node
o It is the most important node in HDFS- the “master mind” of the system. If it
fails, the whole system fails. In fact, in older versions of Hadoop the name
node constituted a single-point-of-failure (SPOF), but this problem was
handled in newer versions.

o The name node stores filesystem metadata, stores file to block map, and
provides a global picture of the filesystem.

o HDFS has at least one name node, it can have two (active/stand-by) to
avoid system failure (This release is called HDFS-High Availability)
Hadoop Distributed File System (HDFS) (6) 32

Data Node
o It is where the chunks of data – the file content
– are stored. The system has many data nodes.
The data node has direct local access to one
or more disks

o Data nodes regularly report their status to the


name node in a heartbeat. This means that,
at any given time, the name node has a
complete view of all data nodes in the cluster,
their current health- whether they are down
or up, and what blocks they have available
HDFS - write operation 33

Writing a file to HDFS


HDFS - read operation 34

Reading a file from HDFS


Hadoop Distributed File System (HDFS) (7) 35

o When an application processes a file stored in HDFS, it first queries the name
node for the block locations.

o Once the locations are known, the application contacts the data nodes
directly to access the file contents.
MapReduce (1) 36
o MapReduce is the computing paradigm of Hadoop

o MapReduce is a distributed data processing model. It is designed to scale data


processing over multiple computing nodes

o It has two constructs; mappers and reducers. The computations are expressed in
terms of map and reduce, which manipulate key/value pairs. These two constructs
can practically implement any function, which MapReduce then executes on the
dataset in a distributed manner.

o In other words, MapReduce operates at the higher level where the programmer
thinks in terms of functions of key and value pairs and the data flow is implicit

o The system has many mappers and at least one reducer


MapReduce (2) 37

o Once the programmer writes an application in the MapReduce form,


scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is a simple configuration change. This is
why this model is attractive.

o MapReduce is particularly suitable for problems that need to analyze the


whole dataset in a batch fashion

o MapReduce works well on unstructured or semi-structured data because it is


designed to interpret the data at processing time (MapReduce is based, as
we mentioned in the previous slide on key/value pairs, which are not intrinsic
properties of the data)
MapReduce (3) 38

o MapReduce is based on a share-nothing architecture, meaning that tasks


have no dependence on one other. This will be illustrated later in an
example

o This share-nothing property enables MapReduce to run a program across


thousands or even millions of unreliable machines in parallel and to
complete a task in very short time.
How MapReduce Works (1) 39

o The MapReduce model has five steps:


1) Splitting
2) Mapping (distribution),
3) Shuffling and sorting
4) Reducing
5) Aggregating.

Remark: Some references merge some of these steps and present a simpler model
of three steps: mapping, shuffling, and reducing.

o We will illustrate how MapReduce works through a simple example of


counting the occurrences of characters (A,B,C,D) in a file. Notice that this
task is parallelizable.
How MapReduce Works (2) 40
o In the first step the file is split into chunks (three in the figure next slide).

o The second step is to generate a process of a key/value pair by the programmer


who specifies the function. In our example it is to count the number of different
letters (A, B, C, and D) within each split file.

o The shuffling step generates an intermediate key/value pair, by sorting the same
letter (or key) and quantity (or value) from different split files into one file.

o The fourth step is to merge all intermediate values associated with the same
intermediate key (A, B, C, and D).

o The final step aggregates these key/value pairs into one output file

Remark: This is a simple example to show how MapReduce works, so we didn’t discuss how
replicas of each chunk is made and distributed.
How MapReduce Works (3) 41
o Counting frequencies of characters using MapReduce

Remark: Notice how mappers processed data independently, this is the share-nothing
property we mentioned previously
The Whole Picture (Simplified) 42

o HDFS/MapReduce: Illustrating example.


(why more than one reducer)
Hadoop Daemons (1) 43

Job Tracker
o The Job Tracker daemon is the liaison between the application and
Hadoop.
o Once the code is submitted to the cluster, the Job Tracker determines the
execution plan by determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they’re running.

o If a task fails, the Job Tracker will automatically relaunch it, possibly on a
different node.

o There is only one Job Tracker daemon per Hadoop cluster.

o The Job Tracker oversees the overall execution of a MapReduce.


Hadoop Daemons (2) 44

Task Tracker
o The Task Trackers manage the
execution of individual tasks on
each slave node.

o Each Task Tracker is responsible


for executing the individual tasks
that the Job Tracker assigns.
Hadoop Daemons (3) 45

o There is usually a single Task


Tracker per slave node

o The Task Tracker constantly


communicates with the Job
Tracker. If the Job Tracker fails
to receive a heartbeat from
a Task Tracker, it will assume
the Task Tracker has crashed
and will resubmit the corresponding
tasks to other nodes in the cluster.
YARN (Yet Another Resource Negotiator) (1) 46
o In Hadoop 2.0 MapReduce underwent substantial modifications resulting in
what’s called MapReduce 2.0 or YARN.

o Rather than having a single daemon that tracks and assigns resources such
as CPU and memory and handles MapReduce-specific job tracking, these
functions are separated into two parts:
1. Resource Manager: responsible for tracking and arbitrating resources among
applications.
2. Node Manager(s): responsible for launching tasks and monitoring the resource
usage per slave node.

o The Resource Manager and the Node Manager form the data-computation
framework.
YARN (2) – Resource Manager 47
o A separate daemon responsible for creating and allocating resources to multiple
applications.

o It is the ultimate authority that arbitrates resources among all the applications in the
system.

o Instead of having one centralized Job Tracker, each application has its own “Job Tracker”
called application master . that runs on one of the workers of the cluster.

o This way each application master is completely isolated from other application masters. The
system will become more tolerant to failures.

o Also, because each application has its own “Job Tracker” (the application master), multiple
application masters can be run at once on the cluster.

o So as we can see, in YARN there is no central Job Tracker


YARN (3) – Node Manager(s) 48

o A Node Manager replaces the traditional Task Tracker. However, while the
Task Tracker handles MapReduce-specific jobs, the node manager is more
generic as it launches any type of process, dictated by the application, in
an application container.

o In fact, because of its ability to run arbitrary applications, one can write non-
MapReduce applications and run them on YARN
YARN (4) 49
o In fact YARN provides a
resource management
framework suitable for any
type of distributed computing
framework

o There is a new architecture


for YARN, which separates
the resource management
from the computation model.
Such a separation enables YARN to
support a number of diverse data- (from https://hadoop.apache.org/)

intensive computing frameworks


including Dryad, Giraph, Spark, Storm, and Tez.
Hadoop Limitations 50
o All Hadoop’s master tasks are SPOF.
o Security is one of the weaknesses of Hadoop, although there’s a lot of work being done to
improve it
o Availability is still an issue. Again, this is an area of improvement focus
o HDFS is inefficient in handling small files
o Does not support real-time processing, but batch processing
o The share-nothing property of MapReduce makes it unsuitable to process some algorithms
o There are some compatibility issues with versions of the projects in the Hadoop Ecosystem.
o Not easy to use
Big Data Limitations 51
o Some argue that Big Data is inconclusive, overstated, exaggerated, and misinformed by
the media and that data cannot speak for itself.

o Some are skeptical concerning the volume aspect of Big Data arguing that “bigger” is not
always “better”.
“The size of data should fit the research question being asked; in some cases, small is best.”
Danah Boyd et al.

o The famous “Google flu trends prediction” issue, when the algorithm dynamics impacted
the users’ behaviour, so the data collected were influenced by the algorithm itself.
Summary 52

o In the first part of this lecture we talked about analytics, and their different
types (descriptive, predictive, prescriptive), then we talked about BI, data
lake, data warehousing, and ETL

o In the second part we talked about BDA, and the Vs definitions, then we
talked about the characters and drivers of Big Data. We then introduced
Hadoop, its characteristics and components: HDFS, MapReduce, and
YARN. One talked about the components of each. Finally we talked a little
about the limitations of Hadoop and Big Data.
53
References (1) 54
• Big Data, Data Mining, and Machine Learning, Jared Dean (2014)

• Big Data - Principles and Best Practices of Scalable Real-Time Data Systems, Nathan Marz (2015)

• Big Data - Principles and Paradigms, R. Buyya, R. Calheiros, A. V. Dastjerdi (2016)

• Data Analytics with Spark Using Python, Jeffrey Aven (2018)

• Database Systems - A Practical Approach to Design, Implementation, and Management, 6th Edition, Thomas Connolly
and Carolyn Begg (2015)

• Data Science and Big Data Analytics - Discovering, Analyzing, Visualizing and Presenting Data, EMC Education
Services (2015)

• Hadoop in Action, Chuck Lam (2011)

• Hadoop in Practice, Alex Holmes (2012)

• Hadoop Operations, Eric Sammer (2012)

• Hadoop: The Definitive Guide, 3rd Edition, Tom White (2012)


References (2) 55
• Introducing Data Science - Big Data, Machine Learning, and More, Using Python Tools, Davy Cielen, Arno Meysman,
Mohamed Ali (2016)

• Modern Database Management, 12th Edition, Jeffrey A. Hoffer, V. Ramesh, and Heikki Topi (2016)

• Pro Apache Hadoop, 2nd Edition, S.Wadkar, M.Siddalingaiah, and J.Venner (2014)

You might also like