Professional Documents
Culture Documents
Sys - Wordmark - at - Pag E1 Yarn Ha Runbook For SLB Prod/Qa/Dev Environments
Sys - Wordmark - at - Pag E1 Yarn Ha Runbook For SLB Prod/Qa/Dev Environments
e1
YARN HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS
© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.
OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 Ambari and Yarn service...............................................................................
3.1 Ambari database (PostgreSQL) backup and restoration....................................
4 Resource Manager.......................................................................................
4.1 Automatic failover for Resource Manager........................................................
5 AppTimeline server......................................................................................
6 Node Manager...........................................................................................
7 Zookeeper................................................................................................
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
List of changes
version Date Description Author(s)
0.1 09.12.2019 Initial document structure created Piotr Radke
1.0 12.12.2019 Document draft with the pdf documents included (in Piotr Radke
case of no HDP docs available online)
1.1 13.12.2019 Final document versioning Marcin
Niewiatowski
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to archive, update and restore the services functionality in
case of HA/DR drill or real-life issue.
Processes described in this document were based on the vendor (HortonWorks) best practices
and/or documentation. Links to them are the integral part of the knowledge required to operate
the runbook.
During this runbook creation – authors followed the suggestions brought together in the
following articles:
Part one:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246641
and part two:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246651
For the knowledge systematization – the following vendor support case was created by Arshad
Amir Jamadar:
https://my.cloudera.com/cases/639712/comments
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
2 Components in scope
To cover the HA/DR processes for YARN service for each SLB environment (PROD, QA, DEV)
the following components were described:
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
The Resource Manager service is being managed by use of Ambari. That determines the
configuration steps for creating the high availability configuration for Resource Manager service –
which is described in the documentation:
https://docs.cloudera.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-guide/content/
resource_manager_high_availability.html
https://community.cloudera.com/t5/Community-Articles/Backing-up-the-Ambari-database-with-
Postgres/ta-p/246352
https://docs.cloudera.com/HDPDocuments/Ambari-2.1.1.0/bk_ambari_reference_guide/content/
_back_up_current_data.html
These folders were backed up initially on NLXS5133 server at /home/backup location, for each
environment respectively. There is no need for restoration process of Ambari configuration,
because it is assumed that either:
- we will install the new Ambari instance and the old config files can just be
compared
- or we will work on the current instance of the Ambari server and saved files can
be used as a reference to the initial configuration only.
The substantial information about Ambari configuration is being stored in the two Ambari
databases, named ‘ambari’ and ‘ambarica’.
For the SLB HDP platform - the PostgreSQL DB dumps are being archived at NLXS5133
infrastructure server.
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
The backup files are located in the corresponding subfolders with PostgreSQL server name, for
example:
/home/backup/QAbackups/nlxs5146_postgres
1. Install the new instance of Ambari server on dedicated cluster node and verify the started
process and Ambari instance
2. Stop the Ambari server
3. Stop the Ambari agent
4. Connect to the PostgreSQL instance, containing the new Ambari server databases
5. Drop the newly created ambari and ambarica databases and verify if they were dropped
successful
6. Create manually two new databases named ambari and ambarica
7. Copy the sql backups of the old databases from backup server to PostgreSQL server
8. As the postgres user restore the contents of ambari and ambarica databases, by
executing the backed up sql files (you can use the psql -f <filename.sql> option)
9. Start the Ambari server
10. Start the Ambari agent
11. Connect to the recreated Ambari instance and verify the contents, configuration and
functionalities
12. Update all the cluster nodes Ambari agents with the new Ambari Server IP address/fqdn
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
4 Resource Manager
Resource Managers are distributed system to tracking the resources in a cluster and
scheduling application. High availability for Resource Manager is been realizing by active/passive
architecture. One of Resource Managers is Active and others are standby – if anything will happen
to the Active one
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
5 AppTimeline server
App Timeline Server is distributed system to storage and retrieval of applications current and
historic information.
PROD: nlxs5135
QA: nlxs5139
DEV: nlxs5272
Unfortunately, currently Timeline server DOES NOT work in secure mode in YARN. This is the
reason why full high availability is not able in YARN yet.
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
6 Node Manager
Node manager is agent implemented on each node. This agent is responsible for executing
some part of a YARN job on the node, in same time other parts are executed on other nodes.
Node Managers are redundant. When any of node managers fail, others take over its tasks.
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
7 Zookeeper
Zookeeper in YARN service context is responsible for managing the Resource Manager role
assignment and coordinates the distributed processes between the Resource Managers while the
HA is being used for it.
Zookeeper servers are multiplied across the cluster (usually the three instances installed). The
2/3 of them should be running to guarantee the Zookeper standard operational status. Both
Resource Manager have Zookeeper Failover Controller service installed (during the HA
enablement process) and these services are responsible for reporting and negotiating the
Resource Manager active/standby status.
The Resource Manager HA enablement has been described in the following documentation
(Chapter 5.5):
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hadoop-high-availability/
content/ch_HA-ResourceManager.html