Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

sys_WordMark_AT_Pag

e1
YARN HA RUNBOOK FOR SLB
PROD/QA/DEV ENVIRONMENTS

© Copyright 2019, ATOS PGS sp. z o.o. All rights reserved. Reproduction in whole or in part is prohibited without the prior
written consent of the copyright owner. For any questions or remarks on this document, please contact Atos Poland Global
Services, +48 22 4446500.

AUTHOR(S) : Marcin Niewiatowski


DOCUMENT NUMBER :
VERSION : 1.0
STATUS : Final
SOURCE : Atos Poland Global Services
DOCUMENT DATE : 12 December 2019RELEASED FOR TRAININGRELEASED FOR
OPERATIONSREVIEW BEFORE
NUMBER OF PAGES : 11

OWNER
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

Contents
1 Audience and document purpose...................................................................
2 Components in scope...................................................................................
3 Ambari and Yarn service...............................................................................
3.1 Ambari database (PostgreSQL) backup and restoration....................................
4 Resource Manager.......................................................................................
4.1 Automatic failover for Resource Manager........................................................
5 AppTimeline server......................................................................................
6 Node Manager...........................................................................................
7 Zookeeper................................................................................................

Atos Poland Global Services


12 December 2019
2 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

List of changes
version Date Description Author(s)
0.1 09.12.2019 Initial document structure created Piotr Radke
1.0 12.12.2019 Document draft with the pdf documents included (in Piotr Radke
case of no HDP docs available online)
1.1 13.12.2019 Final document versioning Marcin
Niewiatowski

Atos Poland Global Services


12 December 2019
3 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

1 Audience and document purpose

The document has been prepared for the SLB HDP platform administrators and ATOS team
responsible for maintaining the PROD/QA and DEV environments. End-user/business team was
not meant as a participant in the process nor the document recipient.
Scope of the document describes the current (for the date of document creation) configuration,
processes and detailed steps leading to archive, update and restore the services functionality in
case of HA/DR drill or real-life issue.
Processes described in this document were based on the vendor (HortonWorks) best practices
and/or documentation. Links to them are the integral part of the knowledge required to operate
the runbook.
During this runbook creation – authors followed the suggestions brought together in the
following articles:
Part one:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246641
and part two:
https://community.cloudera.com/t5/Community-Articles/Disaster-recovery-and-Backup-best-
practices-in-a-typical/ta-p/246651
For the knowledge systematization – the following vendor support case was created by Arshad
Amir Jamadar:
https://my.cloudera.com/cases/639712/comments

Atos Poland Global Services


12 December 2019
4 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

2 Components in scope

To cover the HA/DR processes for YARN service for each SLB environment (PROD, QA, DEV)
the following components were described:

1. Resource Manager – HA use Active/Standby architecture


2. App Timeline server – HA is not available as for now
3. Node Manager – HA resolved by redundancy
4. Zookeeper

Atos Poland Global Services


12 December 2019
5 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

3 Ambari and Yarn service

The Resource Manager service is being managed by use of Ambari. That determines the
configuration steps for creating the high availability configuration for Resource Manager service –
which is described in the documentation:
https://docs.cloudera.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-guide/content/
resource_manager_high_availability.html

3.1 Ambari database (PostgreSQL) backup and restoration


The dedicated PostgreSQL runbook covers all the technical steps and proceedings related to
HA/DR in general. This part of the HDFS runbook describes the basic Ambari DB backup.

Vendor documentation proposes the following:

https://community.cloudera.com/t5/Community-Articles/Backing-up-the-Ambari-database-with-
Postgres/ta-p/246352

https://docs.cloudera.com/HDPDocuments/Ambari-2.1.1.0/bk_ambari_reference_guide/content/
_back_up_current_data.html

The Ambari server configuration files are located on Ambari server at

/etc/ambari-server/conf and /var/lib/ambari-server/.

These folders were backed up initially on NLXS5133 server at /home/backup location, for each
environment respectively. There is no need for restoration process of Ambari configuration,
because it is assumed that either:
- we will install the new Ambari instance and the old config files can just be
compared
- or we will work on the current instance of the Ambari server and saved files can
be used as a reference to the initial configuration only.

The substantial information about Ambari configuration is being stored in the two Ambari
databases, named ‘ambari’ and ‘ambarica’.

The tasks to perform the Ambari DB backup are:

1. Stop the existing Ambari server


2. Create the Ambari databases, named ‘ambari’ and ‘ambarica’ dumps (using pg_dump
command) to the two separated sql files
3. Start Ambari server

For the SLB HDP platform - the PostgreSQL DB dumps are being archived at NLXS5133
infrastructure server.

The root path to the PROD/QA/DEV backup is:


/home/backup/PRODbackups/
/home/backup/QAbackups
/home/backup/DEVbackups

Atos Poland Global Services


12 December 2019
6 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:
The backup files are located in the corresponding subfolders with PostgreSQL server name, for
example:
/home/backup/QAbackups/nlxs5146_postgres

The tasks to perform the restoration are:

1. Install the new instance of Ambari server on dedicated cluster node and verify the started
process and Ambari instance
2. Stop the Ambari server
3. Stop the Ambari agent
4. Connect to the PostgreSQL instance, containing the new Ambari server databases
5. Drop the newly created ambari and ambarica databases and verify if they were dropped
successful
6. Create manually two new databases named ambari and ambarica
7. Copy the sql backups of the old databases from backup server to PostgreSQL server
8. As the postgres user restore the contents of ambari and ambarica databases, by
executing the backed up sql files (you can use the psql -f <filename.sql> option)
9. Start the Ambari server
10. Start the Ambari agent
11. Connect to the recreated Ambari instance and verify the contents, configuration and
functionalities
12. Update all the cluster nodes Ambari agents with the new Ambari Server IP address/fqdn

Atos Poland Global Services


12 December 2019
7 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

4 Resource Manager

Resource Managers are distributed system to tracking the resources in a cluster and
scheduling application. High availability for Resource Manager is been realizing by active/passive
architecture. One of Resource Managers is Active and others are standby – if anything will happen
to the Active one

PROD: nlxs5135, nlxs5138

QA: nlxs5139, nlxs5141

DEV: nlxs5272, nlxs5273

4.1 Automatic failover for Resource Manager


Zookeeper ActiveStandbyElector is deciding which of RM is Active. When zookeeper detect
unresponsiveness or other failure automatically another RM is elected as active one.

Atos Poland Global Services


12 December 2019
8 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

5 AppTimeline server

App Timeline Server is distributed system to storage and retrieval of applications current and
historic information.

PROD: nlxs5135

QA: nlxs5139

DEV: nlxs5272

Unfortunately, currently Timeline server DOES NOT work in secure mode in YARN. This is the
reason why full high availability is not able in YARN yet.

Atos Poland Global Services


12 December 2019
9 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

6 Node Manager

Node manager is agent implemented on each node. This agent is responsible for executing
some part of a YARN job on the node, in same time other parts are executed on other nodes.
Node Managers are redundant. When any of node managers fail, others take over its tasks.

Atos Poland Global Services


12 December 2019
10 of 11
HDFS HA Runbook for SLB PROD/QA/DEV environments

sys_WordMark_AT_Continued
version: 0.10

Public

document number:

7 Zookeeper
Zookeeper in YARN service context is responsible for managing the Resource Manager role
assignment and coordinates the distributed processes between the Resource Managers while the
HA is being used for it.
Zookeeper servers are multiplied across the cluster (usually the three instances installed). The
2/3 of them should be running to guarantee the Zookeper standard operational status. Both
Resource Manager have Zookeeper Failover Controller service installed (during the HA
enablement process) and these services are responsible for reporting and negotiating the
Resource Manager active/standby status.

The Resource Manager HA enablement has been described in the following documentation
(Chapter 5.5):
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_hadoop-high-availability/
content/ch_HA-ResourceManager.html

Atos Poland Global Services


12 December 2019
11 of 11

You might also like