Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

ECS640U/ECS765P Big Data Processing

Hadoop reliability, performance and


high-level frameworks
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Hadoop reliability, performance and
high-level frameworks
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science

Credit: Joseph Doyle, Jesus Carrion, Felix Cuadrado, …


Today’s Lecture Contents

● Reliability in Distributed Systems


● Reliability in Hadoop
● Performance
● High Level frameworks for other workloads (Hive, Mahout, Storm, Presto)
High Availability

Availability is the percentage of total time that a system is available for use
High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational
performance for an agreed percentage of time
For example, to achieve 99.99% availability à 1 hour downtime every 10000 hours (~14 month).
● Fault tolerance is a property of a system that allows it to continue operation on the event of a failure
HA implies there should not be any single points of failure or the system should be fault-tolerant
● Graceful degradation is when some components fail, the system temporarily works with worse
performance (but still works)
High Availability Measurement: Counting Nines

Percentage Uptime Percentage Downtime per year Downtime per week


Downtime

98% 2% 7.3 days 3h22m


99% 1% 3.65 days 1h41m
99.8% 0.2% 17h30m 20m10s

99.9% 0.1% 8h45m 10m5s


99.99% 0.01% 52.5m 1m
99.999% 0.001% 5.25m 6s

99.9999% 0.00001% 31.5s 0.6s


study [KKI99]. due to planned events, such as software and hardware upgrades.
Google’s machine-level failure and downtime statistics are summarized in Figures 7.2 These
andupgrades are necessary to keep up with the velocity of kernel changes
Number of machine restarts in a Google Data Centre
7.3. The data is based on a six-month observation of all machine restart events andGoogle to prudently react to emergent & urgent security issues. As discussed e
their corre-
sponding downtime, where downtime corresponds to the entire time interval where importanta machine to
is note that Google Cloud’s Compute Engine offers live migration to
instances For
not available for service, regardless of cause. These statistics cover all of Google’s machines. running by migrating the running instances to another host in the same z
example, they include machines that are in the repairs pipeline, planned downtime forrequiring
upgrades,your
as VMs to be rebooted. Note that live migration does not change an
well as all kinds of machine crashes. properties of the VM itself.

100 CUMULATIVE 100 CUMU


DENSITY DENSI

75 75

% RESTART EVENT
% MACHINES

50 50

25 25

0 0
0 2 4 6 8 10 0.1 1 10 100 1000 10000 100000
NUMBER OF RESTARTS IN SIX MONTHS MINUTES OF DOWNTIME (log)

Figure 7.2: Distributions of machine restarts over six months at Google. (Updated in 2018.)
Figure 7.3: Distribution of machine downtime, observed at Google over six months. Th
nualized restart rate across all machines is 12.4, corresponding to a mean time between
Figure 7.2 shows the distribution of machine restart events. The graph shows less that
than 50%
one month.
or more machines restart at least once a month, on average. The tail is relatively long (the figure
truncates From
the data at 11etoral.more
Barroso restarts)
Datacenter asdue to the large
a Computer, 2ndpopulation
Ed, Morganof& machines
Claypool, in
2013 Restart statistics are key parameters in the design of fault-tolerant software s
Google’s
fleet. Approximately 5% of all machines restart more than once a week. Several effects, however,
availability picture is complete only once we combine it with downtime data—a p
Contents

● Reliability in Distributed Systems


● Reliability in Hadoop
● Performance
● High Level frameworks for other workloads (Hive, Mahout, Storm, Presto)
Error Management in Hadoop

Goal: detect errors and gracefully recover from them while not interrupting job execution (if possible)
Any Hadoop node/daemon/process can fail during a job
1 ● Data integrity error
2 ● Task (Map/Reduce) failure
● NodeManager failure
● ApplicationMaster failure
● ResourceManager/ NodeManager failure

● Same goes for Spark Framework: https://hevodata.com/learn/spark-fault-tolerance/


Any Component Could Fail
Data Integrity Errors

DataNodes calculate and verify checksums before writing


A checksum is a unique fingerprint of a file that can be used to verify whether two files are identical
HDFS clients verify checksum when reading data blocks
Errors are reported to the NameNode
● The block is marked as corrupt
No more clients will read that copy of the block
● The client is redirected to another copy
● A block replica is scheduled to be copied in another node
Task (Map or Reduce) Failure
Caused by software failures or programming bugs
● e.g., a task requires Java Version X; node has older Java Version X-1. (Version issue)

● e.g. the code throws a Java Exception


Error reported back to NodeManager, task marked as failed
Hanging tasks are detected by NodeManager based on TIMEOUT and then marked as failed
Infinite Loops, or waiting for input/resource that is not available, .., etc
The ApplicationMaster/ResourceManager tries to reschedule the task on a different node
There is a maximum number of retries (4 by default) before declaring job failure
NodeManager Failure
The NodeManager monitors the health of the hardware resources of the node and reports any problem to
the ResourceManager (RM).
The RM also detects NodeManager failures when heartbeats are no longer received from it.
The RM marks all hosted Containers on the NodeManager as killed, and reports the failure to the
ApplicationMaster
The ApplicationMaster will try to rerun all the hosted Containers in other nodes from the cluster after
negotiating with the RM
Completed Map tasks are also rescheduled to other nodes why?

This is because the intermediate results from the Map tasks are not stored in the HDFS but stored
in either the memory or the local disk of the node
ApplicationMaster Failure
● The ApplicationMaster also sends heartbeat messages to the ResourceManager (RM)
● The RM can detect the failure of ApplicationMaster if heartbeat is not received to declare it failed.
The RM kills all the containers of the failed ApplicationMaster
● The RM starts a new ApplicationMaster instance in a different container (managed by the RM).
● For MapReduce AMs, the job history is used to recover the state of the tasks that were already running
by the (failed) ApplicationMaster so they don’t have to be reran from scratch.
● There is a maximum number of attempts (default is 2).
ALERT - ResourceManager/NameNode failure
The ResourceManager and the NameNode are the main single points of failures for Hadoop leading to
Loss of data/progress of the tasks
Cluster stops working and no more jobs could run
The Secondary NameNode communicates periodically with NameNode and stores backup copy of the
index table to avoid data loss.
The ResourceManager failure is quite serious!
Store the list of ApplicationMasters (AM) in highly available state store backed by Zookeepr & HDFS
Store the progress of each of AM, so when the RM restarts, it can restart/resume scheduled jobs.

This is can result in long wait, could you think of an alternative solution?
Hadoop 2.0: High Availability for ResourceManager/NameNode
Alternatively to default setup in earlier versions
Run 2 redundant NameNodes in different machines of the cluster: Active and Standby
New daemon is introduced called JournalNodes (3 nodes run by default)
Active NameNode writes all changes to ALL journals.
Changes must be accepted by majority of journals
The Standby NameNode reads the changes of the journals to catch up with state updates
In this setting, no SecondaryNameNode is needed (=> now we have a standby NameNode)
Same idea applied to the ResourceManager
Run a pair of resource managers in an active-standby configuration

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Active-Passive vs ZooKeeper Architecture

Active

Passive ResourceManager state is stored externally over the


when one goes down, another becomes active Zookeeper, and one ResourceManager is in an active state
and takes responsibility for the cluster and one or more ResourceManagers are in passive mode

https://learning.oreilly.com/library/view/yarn-essentials/9781784391737/ch08.html#ch08lvl1sec46
https://go.oreilly.com/queen-mary-university-of-london/https://learning.oreilly.com/library/view/-/9781784391737/
Contents

● Reliability in Distributed Systems


● Reliability in Hadoop
● Performance
● High Level frameworks for other workloads (Hive, Mahout, Storm, Presto)

Quiz
Speedup concept
● Speedup of a parallel processing system is a function of n, the number of processors:

time _ taken _ with _1_ processor


s(n) =
time _ taken _ with _ n _ processors
● speedup is problem-dependent as well as architecture-dependent
● A 100% embarrassingly (or perfectly) parallel job can be fully parallelized.

semb (n) = n

Could you think of examples of embarrassingly parallel jobs?


Speedup concept
● Speedup of a parallel processing system is a function of n, the number of processors:

time _ taken _ with _1_ processor


s(n) =
time _ taken _ with _ n _ processors
● speedup is problem-dependent as well as architecture-dependent
● A 100% embarrassingly (or perfectly) parallel job can be fully parallelized.

semb (n) = n
1. Computer simulations comparing many independent scenarios
2. Hyperparameter grid search in machine learning
3. Large scale facial recognition systems
4. Brute-force searches in cryptography
Amdahl’s Law
● In many jobs, some parts of the computation can only be executed on one processor.
● If the parts of job that can run only on a single processor take a fraction (f) of the total work, then the
maximum speedup is S(n=∞) = 1/f, which is limited by the speed of parts executed by single processor
● Amdahl’s law: if the remaining 1 - f of the work can be perfectly parallelized, then the speedup with n
processors is:
n
Not
Embarrassingly s(n) =
Parallel
1+ (n −1) f
● s(n) grows with n, never gets larger than 1/f

For 1- f = 0.9, n = 256 à S(n=256) = 256 / (1 + (256-1) * 0.1)


= 256 / (1 + 0.75) = 9.66

Source: https://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl’s Law exercises
● 95% of a program’s execution time occurs inside a loop that can be executed in parallel.
What is the maximum speedup from a parallel version of the program executing on 8 CPUs?
n/(1+(n-1)*f) = 8/(1+7*0.05) = 5.925
● 5% of a parallel program’s execution time is spent within inherently sequential code.
What is the maximum speedup, regardless of how many parallel cores are used?
1/f = 1/0.05 = 20
Amdahl’s Law exercises
● 95% of a program’s execution time occurs inside a loop that can be executed in parallel.
What is the maximum speedup from a parallel version of the program executing on 8 CPUs?
Speedup = n / (1 + (n-1) * f) = 8 / (1 + 7 * 0.05) = 5.92
● 5% of a parallel program’s execution time is spent within inherently sequential code.
What is the maximum speedup, regardless of how many parallel cores are used?
Max Speedup = 1 / f = 1 / 0.05 = 20
Speedup: Real Vs. Actual Cases
● Amdahl’s argument is too simplified to be applied to real cases
● When we run a parallel program, there are communication overhead, contention and workload
imbalance among processes in general which prohibits achieving the ideal case of Amdahl’s speedups
Amdahl’s Law on Map/Reduce jobs
Non-parallelizaable! Non-
=>over the network, the bottleneck of the data transmission Parallelizable

Parallelizable

Image source: Hadoop: the definitive guide, Tom White


Amdahl’s Law on Map/Reduce jobs
● Job setup
● Load Split
● Map
● Copy
Which tasks can be parallelized and which can not?
● Merge
● Reduce
● Write part
Amdahl’s Law on Map/Reduce jobs
● Job setup
● Load Split
● Map
● Copy
● Merge
● Reduce
● Write part

Merge is dependent on the network, so even though the communication might happen in parallel but they are
bottlenecked by the network and so each merge run but may not complete at the same time
Indicators for Hadoop Job Performance
Latency is the time between the start of a job and when it starts delivering output
In Hadoop: total job execution time is the latency

Throughput of the job is measured in bytes/second (the number of output bytes generated per second)
Note, high latency can occur even high throughput is measured especially in something like Hadoop
due to the overhead of coordination done at the beginning of the job.
Hadoop Performance Overheads
Job setup is costly, becomes more complex the bigger the dataset is
Reading from HDFS takes up some CPU cycles
HDFS has some latency (microseconds per block read)
Concurrent read threads result in lock contention, for example, reading from FSNamesystem
Disks or network have finite throughput (MB/sec)
Hadoop is I/O or network bound, often not CPU bound

More at: http://www.slideshare.net/cloudera/hadoop-world-2011-hadoop-and-performance-


todd-lipcon-yanpei-chen-cloudera
Load Balancing Problems: Data skew
Problem: Not every task processes the same amount of data
● Mappers: in general splits are balanced because input data size is known before-hand
● Reducers: may not be balanced, depends on the number of keys and number of values per key
For instance, think about the Word Count example where partitioning is done by starting letter.
Some letters such as E or T are much more common than the other letters in the alphabet
The Partitioning of Keys to Reducers can be trained to provide a balanced spread regardless of the skew
Requires Initial sampling of data à more overhead!
Languages Word Frequency Distribution

Follows a Zipfian distribution


à Popular words appear more

à As a result, load imbalance would occur


à More frequent words means more load on reducer

Source: https://en.wikipedia.org/wiki/Zipf%27s_law
Performance Analysis of MapReduce Jobs
Input dataset
size, number of records?
Average number of records generated per Mapper
How much information is being sent over the network?
Does the combiner help reduce the communicated volume?
Number of keys/records sent to each Reducer
Data skew of the mapper results?
Keys with too many records (i.e., popular key)?
Contents

● Reliability in Distributed Systems


● Reliability in Hadoop
● Performance
● High Level Frameworks for Other Workloads (Hive, Mahout, Storm, Presto)

Quiz and Break


Hadoop High-Level Frameworks

The are several frameworks which build upon Hadoop to offer additional functionality including
● Hive SQL

● Mahout machine learning

● Storm stream processing

● Presto another version of Hive designed by FB


Hive (https://hive.apache.org)

Enables running of traditional SQL queries on big data frameworks


SQL is not ideal for every big data problem but it is a well established and widely used language
The Hive driver translates the SQL queries into MapReduce tasks on the Hadoop cluster
Requires a metadata store (or metastore) which is a repository for Hive metadata.
The metastore is divided into two pieces:
● A service which the Hive driver uses
● A storage/database for the metadata
Hive (https://hive.apache.org)
Mahout (https://mahout.apache.org)
● Platform for distributed machines learning algorithms which focus primarily on linear algebra
● Core algorithms for clustering classification and batch based collaborative filtering are implemented on
top of Apache Hadoop using the map/reduce paradigm
● Currently primarily focused on Spark
Mahout (https://mahout.apache.org)
Storm (https://storm.apache.org)

● Stream processing version of Hadoop


● Users define “spouts” and “bolts” to define information sources and manipulations to the distributed
processing of streaming data
● Topology is defined as a directed acyclic graph (DAG)
● Topologies run indefinitely until killed
● Data processed in real time
Storm (https://storm.apache.org)
Bolts Bolts

Spouts
Presto (https://prestodb.io)

● Presto is another variation of Hive for SQL queries and is originally designed by Facebook
● Can query from a variety of data sources
● Presto, however, does not write intermediate results to the local hard disk which results in a
significant speed improvement.
Presto (https://prestodb.io)
Contents

● Reliability in Distributed Systems


● Reliability in Hadoop
● Performance
● High Level Frameworks for Other Workloads (Hive, Mahout, Storm, Presto)

Quiz

You might also like