Emergency Scaling Workflow

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

NOC - Emergency / Proactive Scaling

Below document provides the workflow for proactive scaling of nodes for ALL customers ( ACTIVE + POC )

( Below is for both ACTIVE + POC )

Maintenance Notice Period Sorry App Provisioned Status Alert Conditions to Be Met Sustained
Threshold Spike
Category Notice

Under Provisioned 80% Proactive Scaling Trigger + 2 Hours


Emergency Immediate Yes AA Processing Delay +
Kafka Lag

Fully Provisioned * 80% Proactive Scaling Trigger + 2 Hours


AA Processing Delay +
Kafka Lag

Over Provisioned* 100% Proactive Scaling Trigger + 2 Hours


AA Processing Delay +
Kafka Lag

Under Provisioned 50% Proactive Scaling Trigger 2 Hours


Critical 48 hours Yes
Fully Provisioned * 80% Proactive Scaling Trigger 8 Hours

Over Provisioned* 100% Proactive Scaling Trigger 8 Hours


Sorry App Status Notifications

* For Fully Provisioned / Over provisioning -


If environment sizing already supports their contracted EPS/volume based on an average event size, and over-utilized for > 8
hours for Critical maintenance, can oversize them as follows:
Add up to 10,000 EPS, or 100% higher than contracted, whichever is lower. We can go up to 200% with approval from CloudOps
Mgmt ( NOC Manager/Director ) . If more than this we need to go to Finance for approval.
For any rescheduling - NOC to update Sorryapp with the new date and follow the same procedure of scheduling. CS needs to inform
the CloudOps via the Teams channel.
Process flow for TOP 50 customers is Updated ( No Change except for exclusion list below )
NOC to tag CSM for all TOP 50
TOP 50 Exclusion List https://docs.google.com/spreadsheets/d/1Rncc1RWQDQcSwn8YfWLtq-aTHwEiMBU9ZNtyR67f-9k/edit#gid=0

You might also like