Professional Documents
Culture Documents
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
Lecture 2 - Introduction To Big Data Analytics - 1691894427998
2019- 2020
2023
Outline 2
o Part 1 - Analytics
• Big Data
• Drivers of Big Data
• The Vs
• Big Data Characteristics
• Hadoop - Hadoop Characteristics.
• Hadoop Components:.
• HDFS, Name Node, Data Node
• Hadoop Limitations
• Big Data Limitations
3
Part 1 – Analytics
Introduction 4
o In recent years this definition has broadened to encompass not only the
specific techniques and approaches that transform collected data into
useful information but also the infrastructure required to make analytics
work, the various sources of data that feed into the analytical systems, the
processes through which the raw data are cleaned up and organized for
analysis, the user interfaces that make the results easy to view and simple to
understand.
Types on Analytics 5
Descriptive Analytics
o It primarily focuses on describing the past status of the domain of interest
using a variety of tools through techniques such as reporting, data
visualization, dashboards, etc
o Online analytical processing (OLAP), is part of descriptive analytics that
allows users to get a multidimensional view of data
Predictive Analytics
o Applies statistical and computational methods and models to data
regarding past and current events to predict what might happen in the
future.
Prescriptive Analytics
o Attempts to answer the question “How can we make it happen?” using
advanced optimization, simulation, and modeling tools.
Business Intelligence (BI) 6
o BI is an umbrella term that refers to the processes for collecting and analyzing
data, the technologies used in these processes, and the information obtained
from these processes with the purpose of facilitating corporate decision making.
(from https://baianat.s3.amazonaws.com)
Data Lake 7
o A data lake is a large integrated repository for internal and external data
that does not follow a predefined schema.
(from https://www.guru99.com)
Data Warehousing 8
o Subject-oriented: data are organized around the major subjects of the enterprise.
o Integrated: data come from different resources, so they are often inconsistent.
o Time-variant: data in the warehouse are valid only at some point in time .
o Nonvolatile, data are not updated in real time
Extraction, Transformation, and Loading (ETL) 9
o The data destined for enterprise data warehouses (EDW) must first be extracted from one
or more data sources, transformed into a form that is easy to analyze and consistent with
data already in the warehouse, and then finally loaded into the EDW
Extraction
o The extraction step targets one or more data sources for the EDW.
Transformation
o Transformation applies a series of rules or functions to the extracted data, which
determines how the data will be used for analysis.
Loading
o It can occur after all transformations have taken place or as part of the transformation
processing.
10
o In the last decade, data have been created constantly, and at an ever-
increasing rate.
o Mobile phones, social media, imaging technologies create new data, which
must be stored somewhere for some purpose.
o Devices and sensors automatically generate diagnostic information that
needs to be.
o Keeping up with all these data is very
challenging, yet analysing them is far
more difficult.
Drivers of Big Data 12
o In the 90s the volume of data was often measured in terabytes. Most organizations
analyzed structured and used RDBMS to manage them.
o Although the term Big Data has become very popular, there is in fact no
definition for it yet. The word “big” is very generic. How big is “big”? More
importantly, this concept is time and circumstance relevant.
o Traditionally, all attributes of Big Data were referred to using a term that
starts with the letter “V” (!!!)
The 3Vs Definition 14
o This is the first and widely used definition of big data. It is based on volume,
velocity, and variety.
Volume It is VERY important to understand
that the concept of “big” when it
o Refers to the size of data. comes to data volume is relative and
NOT abstract. What is big for one
machine, may not be that for that for
Velocity another, and what is considered big
in the past/now will very likely not be
o The speed of new data creation and growth. considered that way same now/in
the future, respectively. at a in the
past or now,
Variety
o The complexity of data types and structures.
Why “Big”? Why it’s about volume and not velocity or variety?
The 4Vs Definition 15
Veracity
o Implies the uncertainty of data.
o The reason behind adding this attribute is “in response to the quality and
source issues our clients began facing with their Big Data initiatives”
The 5Vs Definition 16
Value
o The worth of extracted data.
The 6Vs Definition 17
o Microsoft presented a definition that added two new Vs, to the 4Vs in the
4Vs definition (volume, velocity, variety, veracity). These two new Vs are on
variability, and visibility.
Variability
o Refers to the complexity of data set(the number of variables). In
comparison with “Variety”, which refers to different data format.
Visibility
o Emphasizes on the need to have a full picture of data in order to make
informative decision.
Yet More Vs 18
1. Data Domain
The dominant V in this domain is volume, the other two attributes are velocity and
variety (the 3Vs definition)
Verdict : The potential choice or decision that
should be made by a decision maker or committee
based on the scope of the problem, available
2. BI Domain resources, and certain computational capacity
The drivers and motivations to applying BDA to BI are value, visibility, and verdict
3. Statistics Domain
The three attributes of this domain are veracity, variability, and validity
Validity: It is to verify the quality of data being logically sound. It
emphasizes on how to correctly acquire data and avoid biases
The 32Vs Definition (2) 20
o The two points 2 and 3 above are actually related, together they can be expressed as
“avoid moving data as much as possible” .
Remark 23
o Unlike the common belief, Big Data can deal with structured data and not only
unstructured data. However, in most cases it is concerned with unstructured or semi-
structured data.
o The importance of this issue is actually diminishing as many RDBMS systems are starting to
incorporate concepts from MapReduce - the computational framework widely used for
Big Data (to be presented later). On the other hand, several Big Data projects, such as
Hive and Pig, have been developed to make Big Data more applicable to traditional
databases, which are still widely used in business, and Big Data will have to adapt to that.
Apache Hadoop 24
o Hadoop is the most popular platform of BDA.
It was created by Doug Cutting and Mike Cafarella.
Hadoop was based on a Google paper published
in 2004 and its development started in 2005.
The name “Hadoop” actually came from the name
of a yellow toy elephant that Cutting’s son had.
o In 2008, Hadoop had become a top level Apache project and was being used by several
large data companies such as Yahoo!, Facebook, and The New York Times
o Hadoop is an open source framework for writing and running distributed applications that
process large amounts of data. It provides both storage and computational capabilities.
(from https://www.edureka.co/)
Remark: Some references mention two components of Hadoop; HDFS and MapReduce. In
fact these are the Hadoop's kennel, which inherited from Google’s system had two
components; MapReduce and a predecessor of HDFS called GFS .
Hadoop Distributed File System (HDFS) (1) 27
o HDFS is the component of Hadoop that is responsible for storage. It
was adapted from GFS (Google’s file system)
o HDFS is built to support very large files. It is in fact optimized for very
large files and not small ones.
1. Name node.
2. Data node.
Remark: In fact there are other types, but these are the main
kinds
Name Node
o It is the most important node in HDFS- the “master mind” of the system. If it
fails, the whole system fails. In fact, in older versions of Hadoop the name
node constituted a single-point-of-failure (SPOF), but this problem was
handled in newer versions.
o The name node stores filesystem metadata, stores file to block map, and
provides a global picture of the filesystem.
o HDFS has at least one name node, it can have two (active/stand-by) to
avoid system failure (This release is called HDFS-High Availability)
Hadoop Distributed File System (HDFS) (6) 32
Data Node
o It is where the chunks of data – the file content
– are stored. The system has many data nodes.
The data node has direct local access to one
or more disks
o When an application processes a file stored in HDFS, it first queries the name
node for the block locations.
o Once the locations are known, the application contacts the data nodes
directly to access the file contents.
MapReduce (1) 36
o MapReduce is the computing paradigm of Hadoop
o It has two constructs; mappers and reducers. The computations are expressed in
terms of map and reduce, which manipulate key/value pairs. These two constructs
can practically implement any function, which MapReduce then executes on the
dataset in a distributed manner.
o In other words, MapReduce operates at the higher level where the programmer
thinks in terms of functions of key and value pairs and the data flow is implicit
Remark: Some references merge some of these steps and present a simpler model
of three steps: mapping, shuffling, and reducing.
o The shuffling step generates an intermediate key/value pair, by sorting the same
letter (or key) and quantity (or value) from different split files into one file.
o The fourth step is to merge all intermediate values associated with the same
intermediate key (A, B, C, and D).
o The final step aggregates these key/value pairs into one output file
Remark: This is a simple example to show how MapReduce works, so we didn’t discuss how
replicas of each chunk is made and distributed.
How MapReduce Works (3) 41
o Counting frequencies of characters using MapReduce
Remark: Notice how mappers processed data independently, this is the share-nothing
property we mentioned previously
The Whole Picture (Simplified) 42
Job Tracker
o The Job Tracker daemon is the liaison between the application and
Hadoop.
o Once the code is submitted to the cluster, the Job Tracker determines the
execution plan by determining which files to process, assigns nodes to
different tasks, and monitors all tasks as they’re running.
o If a task fails, the Job Tracker will automatically relaunch it, possibly on a
different node.
Task Tracker
o The Task Trackers manage the
execution of individual tasks on
each slave node.
o Rather than having a single daemon that tracks and assigns resources such
as CPU and memory and handles MapReduce-specific job tracking, these
functions are separated into two parts:
1. Resource Manager: responsible for tracking and arbitrating resources among
applications.
2. Node Manager(s): responsible for launching tasks and monitoring the resource
usage per slave node.
o The Resource Manager and the Node Manager form the data-computation
framework.
YARN (2) – Resource Manager 47
o A separate daemon responsible for creating and allocating resources to multiple
applications.
o It is the ultimate authority that arbitrates resources among all the applications in the
system.
o Instead of having one centralized Job Tracker, each application has its own “Job Tracker”
called application master . that runs on one of the workers of the cluster.
o This way each application master is completely isolated from other application masters. The
system will become more tolerant to failures.
o Also, because each application has its own “Job Tracker” (the application master), multiple
application masters can be run at once on the cluster.
o A Node Manager replaces the traditional Task Tracker. However, while the
Task Tracker handles MapReduce-specific jobs, the node manager is more
generic as it launches any type of process, dictated by the application, in
an application container.
o In fact, because of its ability to run arbitrary applications, one can write non-
MapReduce applications and run them on YARN
YARN (4) 49
o In fact YARN provides a
resource management
framework suitable for any
type of distributed computing
framework
o Some are skeptical concerning the volume aspect of Big Data arguing that “bigger” is not
always “better”.
“The size of data should fit the research question being asked; in some cases, small is best.”
Danah Boyd et al.
o The famous “Google flu trends prediction” issue, when the algorithm dynamics impacted
the users’ behaviour, so the data collected were influenced by the algorithm itself.
Summary 52
o In the first part of this lecture we talked about analytics, and their different
types (descriptive, predictive, prescriptive), then we talked about BI, data
lake, data warehousing, and ETL
o In the second part we talked about BDA, and the Vs definitions, then we
talked about the characters and drivers of Big Data. We then introduced
Hadoop, its characteristics and components: HDFS, MapReduce, and
YARN. One talked about the components of each. Finally we talked a little
about the limitations of Hadoop and Big Data.
53
References (1) 54
• Big Data, Data Mining, and Machine Learning, Jared Dean (2014)
• Big Data - Principles and Best Practices of Scalable Real-Time Data Systems, Nathan Marz (2015)
• Database Systems - A Practical Approach to Design, Implementation, and Management, 6th Edition, Thomas Connolly
and Carolyn Begg (2015)
• Data Science and Big Data Analytics - Discovering, Analyzing, Visualizing and Presenting Data, EMC Education
Services (2015)
• Modern Database Management, 12th Edition, Jeffrey A. Hoffer, V. Ramesh, and Heikki Topi (2016)
• Pro Apache Hadoop, 2nd Edition, S.Wadkar, M.Siddalingaiah, and J.Venner (2014)