Professional Documents
Culture Documents
ECS765P - W8 - Hadoop Reliability Performance
ECS765P - W8 - Hadoop Reliability Performance
Availability is the percentage of total time that a system is available for use
High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational
performance for an agreed percentage of time
For example, to achieve 99.99% availability à 1 hour downtime every 10000 hours (~14 month).
● Fault tolerance is a property of a system that allows it to continue operation on the event of a failure
HA implies there should not be any single points of failure or the system should be fault-tolerant
● Graceful degradation is when some components fail, the system temporarily works with worse
performance (but still works)
High Availability Measurement: Counting Nines
75 75
% RESTART EVENT
% MACHINES
50 50
25 25
0 0
0 2 4 6 8 10 0.1 1 10 100 1000 10000 100000
NUMBER OF RESTARTS IN SIX MONTHS MINUTES OF DOWNTIME (log)
Figure 7.2: Distributions of machine restarts over six months at Google. (Updated in 2018.)
Figure 7.3: Distribution of machine downtime, observed at Google over six months. Th
nualized restart rate across all machines is 12.4, corresponding to a mean time between
Figure 7.2 shows the distribution of machine restart events. The graph shows less that
than 50%
one month.
or more machines restart at least once a month, on average. The tail is relatively long (the figure
truncates From
the data at 11etoral.more
Barroso restarts)
Datacenter asdue to the large
a Computer, 2ndpopulation
Ed, Morganof& machines
Claypool, in
2013 Restart statistics are key parameters in the design of fault-tolerant software s
Google’s
fleet. Approximately 5% of all machines restart more than once a week. Several effects, however,
availability picture is complete only once we combine it with downtime data—a p
Contents
Goal: detect errors and gracefully recover from them while not interrupting job execution (if possible)
Any Hadoop node/daemon/process can fail during a job
1 ● Data integrity error
2 ● Task (Map/Reduce) failure
● NodeManager failure
● ApplicationMaster failure
● ResourceManager/ NodeManager failure
This is because the intermediate results from the Map tasks are not stored in the HDFS but stored
in either the memory or the local disk of the node
ApplicationMaster Failure
● The ApplicationMaster also sends heartbeat messages to the ResourceManager (RM)
● The RM can detect the failure of ApplicationMaster if heartbeat is not received to declare it failed.
The RM kills all the containers of the failed ApplicationMaster
● The RM starts a new ApplicationMaster instance in a different container (managed by the RM).
● For MapReduce AMs, the job history is used to recover the state of the tasks that were already running
by the (failed) ApplicationMaster so they don’t have to be reran from scratch.
● There is a maximum number of attempts (default is 2).
ALERT - ResourceManager/NameNode failure
The ResourceManager and the NameNode are the main single points of failures for Hadoop leading to
Loss of data/progress of the tasks
Cluster stops working and no more jobs could run
The Secondary NameNode communicates periodically with NameNode and stores backup copy of the
index table to avoid data loss.
The ResourceManager failure is quite serious!
Store the list of ApplicationMasters (AM) in highly available state store backed by Zookeepr & HDFS
Store the progress of each of AM, so when the RM restarts, it can restart/resume scheduled jobs.
This is can result in long wait, could you think of an alternative solution?
Hadoop 2.0: High Availability for ResourceManager/NameNode
Alternatively to default setup in earlier versions
Run 2 redundant NameNodes in different machines of the cluster: Active and Standby
New daemon is introduced called JournalNodes (3 nodes run by default)
Active NameNode writes all changes to ALL journals.
Changes must be accepted by majority of journals
The Standby NameNode reads the changes of the journals to catch up with state updates
In this setting, no SecondaryNameNode is needed (=> now we have a standby NameNode)
Same idea applied to the ResourceManager
Run a pair of resource managers in an active-standby configuration
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Active-Passive vs ZooKeeper Architecture
Active
https://learning.oreilly.com/library/view/yarn-essentials/9781784391737/ch08.html#ch08lvl1sec46
https://go.oreilly.com/queen-mary-university-of-london/https://learning.oreilly.com/library/view/-/9781784391737/
Contents
Quiz
Speedup concept
● Speedup of a parallel processing system is a function of n, the number of processors:
semb (n) = n
semb (n) = n
1. Computer simulations comparing many independent scenarios
2. Hyperparameter grid search in machine learning
3. Large scale facial recognition systems
4. Brute-force searches in cryptography
Amdahl’s Law
● In many jobs, some parts of the computation can only be executed on one processor.
● If the parts of job that can run only on a single processor take a fraction (f) of the total work, then the
maximum speedup is S(n=∞) = 1/f, which is limited by the speed of parts executed by single processor
● Amdahl’s law: if the remaining 1 - f of the work can be perfectly parallelized, then the speedup with n
processors is:
n
Not
Embarrassingly s(n) =
Parallel
1+ (n −1) f
● s(n) grows with n, never gets larger than 1/f
Source: https://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl’s Law exercises
● 95% of a program’s execution time occurs inside a loop that can be executed in parallel.
What is the maximum speedup from a parallel version of the program executing on 8 CPUs?
n/(1+(n-1)*f) = 8/(1+7*0.05) = 5.925
● 5% of a parallel program’s execution time is spent within inherently sequential code.
What is the maximum speedup, regardless of how many parallel cores are used?
1/f = 1/0.05 = 20
Amdahl’s Law exercises
● 95% of a program’s execution time occurs inside a loop that can be executed in parallel.
What is the maximum speedup from a parallel version of the program executing on 8 CPUs?
Speedup = n / (1 + (n-1) * f) = 8 / (1 + 7 * 0.05) = 5.92
● 5% of a parallel program’s execution time is spent within inherently sequential code.
What is the maximum speedup, regardless of how many parallel cores are used?
Max Speedup = 1 / f = 1 / 0.05 = 20
Speedup: Real Vs. Actual Cases
● Amdahl’s argument is too simplified to be applied to real cases
● When we run a parallel program, there are communication overhead, contention and workload
imbalance among processes in general which prohibits achieving the ideal case of Amdahl’s speedups
Amdahl’s Law on Map/Reduce jobs
Non-parallelizaable! Non-
=>over the network, the bottleneck of the data transmission Parallelizable
Parallelizable
Merge is dependent on the network, so even though the communication might happen in parallel but they are
bottlenecked by the network and so each merge run but may not complete at the same time
Indicators for Hadoop Job Performance
Latency is the time between the start of a job and when it starts delivering output
In Hadoop: total job execution time is the latency
Throughput of the job is measured in bytes/second (the number of output bytes generated per second)
Note, high latency can occur even high throughput is measured especially in something like Hadoop
due to the overhead of coordination done at the beginning of the job.
Hadoop Performance Overheads
Job setup is costly, becomes more complex the bigger the dataset is
Reading from HDFS takes up some CPU cycles
HDFS has some latency (microseconds per block read)
Concurrent read threads result in lock contention, for example, reading from FSNamesystem
Disks or network have finite throughput (MB/sec)
Hadoop is I/O or network bound, often not CPU bound
Source: https://en.wikipedia.org/wiki/Zipf%27s_law
Performance Analysis of MapReduce Jobs
Input dataset
size, number of records?
Average number of records generated per Mapper
How much information is being sent over the network?
Does the combiner help reduce the communicated volume?
Number of keys/records sent to each Reducer
Data skew of the mapper results?
Keys with too many records (i.e., popular key)?
Contents
The are several frameworks which build upon Hadoop to offer additional functionality including
● Hive SQL
Spouts
Presto (https://prestodb.io)
● Presto is another variation of Hive for SQL queries and is originally designed by Facebook
● Can query from a variety of data sources
● Presto, however, does not write intermediate results to the local hard disk which results in a
significant speed improvement.
Presto (https://prestodb.io)
Contents
Quiz