Sys - Wordmark - at - Pag E1 Yarn Ha Runbook For SLB Prod/Qa/Dev Environments

sys_WordMark_AT_Pag
e1
YARN HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS
© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.
AUTHOR(S) : Marcin Niewiatowski

DOCUMENT NUMBER :
VERSION : 1.0
STATUS : Final
SOURCE : Atos Poland Global Services
DOCUMENT DATE : 12 December 2019RELEASED FOR TRAININGRELEASED FOR
OPERATIONSREVIEW BEFORE
NUMBER OF PAGES : 11
OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments
sys_WordMark_AT_Continued
version: 0.10
Public
document number:
Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 Ambari and Yarn service...............................................................................
3.1 Ambari database (PostgreSQL) backup and restoration....................................
4 Resource Manager.......................................................................................
4.1 Automatic failover for Resource Manager........................................................
5 AppTimeline server......................................................................................
6 Node Manager...........................................................................................
7 Zookeeper................................................................................................
Atos Poland Global Services

12 December 2019
2 of 11
version: 0.10
Public
document number:
List of changes
version Date Description Author(s)
0.1 09.12.2019 Initial document structure created Piotr Radke
1.0 12.12.2019 Document draft with the pdf documents included (in Piotr Radke
case of no HDP docs available online)
1.1 13.12.2019 Final document versioning Marcin
Niewiatowski

12 December 2019
3 of 11
version: 0.10
Public
document number:
1 Audience and document purpose
The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to archive, update and restore the services functionality in
case of HA/DR drill or real-life issue.
Processes described in this document were based on the vendor (HortonWorks) best practices
and/or documentation. Links to them are the integral part of the knowledge required to operate
the runbook.
During this runbook creation – authors followed the suggestions brought together in the
following articles:
Part one:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246641
and part two:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246651
For the knowledge systematization – the following vendor support case was created by Arshad
Amir Jamadar:
https://my.cloudera.com/cases/639712/comments

12 December 2019
4 of 11
version: 0.10
Public
document number:
2 Components in scope
To cover the HA/DR processes for YARN service for each SLB environment (PROD, QA, DEV)
the following components were described:
1. Resource Manager – HA use Active/Standby architecture

2. App Timeline server – HA is not available as for now
3. Node Manager – HA resolved by redundancy
4. Zookeeper

12 December 2019
5 of 11
version: 0.10
Public
document number:
3 Ambari and Yarn service
The Resource Manager service is being managed by use of Ambari. That determines the
configuration steps for creating the high availability configuration for Resource Manager service –
which is described in the documentation:
https://docs.cloudera.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-guide/content/
resource_manager_high_availability.html
3.1 Ambari database (PostgreSQL) backup and restoration

The dedicated PostgreSQL runbook covers all the technical steps and proceedings related to
HA/DR in general. This part of the HDFS runbook describes the basic Ambari DB backup.
Vendor documentation proposes the following:
https://community.cloudera.com/t5/Community-Articles/Backing-up-the-Ambari-database-with-
Postgres/ta-p/246352
https://docs.cloudera.com/HDPDocuments/Ambari-2.1.1.0/bk_ambari_reference_guide/content/
_back_up_current_data.html
The Ambari server configuration files are located on Ambari server at
/etc/ambari-server/conf and /var/lib/ambari-server/.
These folders were backed up initially on NLXS5133 server at /home/backup location, for each
environment respectively. There is no need for restoration process of Ambari configuration,
because it is assumed that either:
- we will install the new Ambari instance and the old config files can just be
compared
- or we will work on the current instance of the Ambari server and saved files can
be used as a reference to the initial configuration only.
The substantial information about Ambari configuration is being stored in the two Ambari
databases, named ‘ambari’ and ‘ambarica’.
The tasks to perform the Ambari DB backup are:
1. Stop the existing Ambari server

2. Create the Ambari databases, named ‘ambari’ and ‘ambarica’ dumps (using pg_dump
command) to the two separated sql files
3. Start Ambari server
For the SLB HDP platform - the PostgreSQL DB dumps are being archived at NLXS5133
infrastructure server.
The root path to the PROD/QA/DEV backup is:

/home/backup/PRODbackups/
/home/backup/QAbackups
/home/backup/DEVbackups

12 December 2019
6 of 11
version: 0.10
Public
document number:
The backup files are located in the corresponding subfolders with PostgreSQL server name, for
example:
/home/backup/QAbackups/nlxs5146_postgres
The tasks to perform the restoration are:
1. Install the new instance of Ambari server on dedicated cluster node and verify the started
process and Ambari instance
2. Stop the Ambari server
3. Stop the Ambari agent
4. Connect to the PostgreSQL instance, containing the new Ambari server databases
5. Drop the newly created ambari and ambarica databases and verify if they were dropped
successful
6. Create manually two new databases named ambari and ambarica
7. Copy the sql backups of the old databases from backup server to PostgreSQL server
8. As the postgres user restore the contents of ambari and ambarica databases, by
executing the backed up sql files (you can use the psql -f <filename.sql> option)
9. Start the Ambari server
10. Start the Ambari agent
11. Connect to the recreated Ambari instance and verify the contents, configuration and
functionalities
12. Update all the cluster nodes Ambari agents with the new Ambari Server IP address/fqdn

12 December 2019
7 of 11
version: 0.10
Public
document number:
4 Resource Manager
Resource Managers are distributed system to tracking the resources in a cluster and
scheduling application. High availability for Resource Manager is been realizing by active/passive
architecture. One of Resource Managers is Active and others are standby – if anything will happen
to the Active one
PROD: nlxs5135, nlxs5138
QA: nlxs5139, nlxs5141
DEV: nlxs5272, nlxs5273
4.1 Automatic failover for Resource Manager

Zookeeper ActiveStandbyElector is deciding which of RM is Active. When zookeeper detect
unresponsiveness or other failure automatically another RM is elected as active one.

12 December 2019
8 of 11
version: 0.10
Public
document number:
5 AppTimeline server
App Timeline Server is distributed system to storage and retrieval of applications current and
historic information.
PROD: nlxs5135
QA: nlxs5139
DEV: nlxs5272
Unfortunately, currently Timeline server DOES NOT work in secure mode in YARN. This is the
reason why full high availability is not able in YARN yet.

12 December 2019
9 of 11
version: 0.10
Public
document number:
6 Node Manager
Node manager is agent implemented on each node. This agent is responsible for executing
some part of a YARN job on the node, in same time other parts are executed on other nodes.
Node Managers are redundant. When any of node managers fail, others take over its tasks.

12 December 2019
10 of 11
version: 0.10
Public
document number:
7 Zookeeper
Zookeeper in YARN service context is responsible for managing the Resource Manager role
assignment and coordinates the distributed processes between the Resource Managers while the
HA is being used for it.
Zookeeper servers are multiplied across the cluster (usually the three instances installed). The
2/3 of them should be running to guarantee the Zookeper standard operational status. Both
Resource Manager have Zookeeper Failover Controller service installed (during the HA
enablement process) and these services are responsible for reporting and negotiating the
Resource Manager active/standby status.
The Resource Manager HA enablement has been described in the following documentation
(Chapter 5.5):
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hadoop-high-availability/
content/ch_HA-ResourceManager.html

12 December 2019
11 of 11

Sys - Wordmark - at - Pag E1 Yarn Ha Runbook For SLB Prod/Qa/Dev Environments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sys - Wordmark - at - Pag E1 Yarn Ha Runbook For SLB Prod/Qa/Dev Environments

Uploaded by

Copyright:

Available Formats

sys_WordMark_AT_Pag

AUTHOR(S) : Marcin Niewiatowski

Atos Poland Global Services

Atos Poland Global Services

1 Audience and document purpose

Atos Poland Global Services

1. Resource Manager – HA use Active/Standby architecture

Atos Poland Global Services

3 Ambari and Yarn service

3.1 Ambari database (PostgreSQL) backup and restoration

Vendor documentation proposes the following:

The Ambari server configuration files are located on Ambari server at

/etc/ambari-server/conf and /var/lib/ambari-server/.

The tasks to perform the Ambari DB backup are:

1. Stop the existing Ambari server

The root path to the PROD/QA/DEV backup is:

Atos Poland Global Services

The tasks to perform the restoration are:

Atos Poland Global Services

PROD: nlxs5135, nlxs5138

QA: nlxs5139, nlxs5141

DEV: nlxs5272, nlxs5273

4.1 Automatic failover for Resource Manager

Atos Poland Global Services

Atos Poland Global Services

Atos Poland Global Services

Atos Poland Global Services

You might also like