Site Reliability Engineering v2

Site Reliability
Engineering
(SRE)
2
• Name
• Total Experience
• Background – Development /
Introduction Infrastructure /
Management
• Experience on DevOps Tools, Cloud
• Your expectations from this training
2
3 3
Few Pointers
• SRE is more of a concept to implement (vs a tool). So theoretical aspects will be more.
• Many pointers will be familiar. It’s just the right use of same.
• Few best practices we already know but couldn’t implement. Now it’s the right time to
implement same. It’s a do or die situation now.
• Try to co-relate the topics with your domain and prepare notes with pointers.
• Implementation discussion is welcome for your existing teams and domains.
• Learn with - “What behavior will I change?” Learning isn’t collecting information. Learning
is changing behavior.
Module 1:
Principles & Practices
4
What is SRE?
5
Digital Transformation
6
A way Forward
Concept
Integration of digital technology into all areas of a business, fundamentally
PEOPLE
changing how you operate and deliver value to customers.
PROCESS
Successful
People Process Digital
The most important part Cultural Change Transformation
and the barrier too
SECURITY
TOOLS
Tools Security
Use technologies which Security, Legal, Compliance must
suits you the best be in center in all designs
Emerging Technologies
7
Which are helping industries in Digital Transformation
1
ML/AI Cloud
2
Better decision Making (well Keeping infra off-ground to third
ML/AI
Cloud predicted, informed) and autonomy party and move all Opex
Security
Blockchain IOT
6 Connect and Integrate whatever
DATA For secure transactions over
distributed public networks possible to automate
3
Blockchain
Automation Security toolsets
Automation Remain secure to avoid Financial,
Automation with DevOps Toolsets
IOT legal and compliance issues
– CICD, SRE, CM, Containers etc.
5
4
8
DevOps
8
9
SDLC Model
• A systems development life cycle is composed of several clearly defined and distinct work phases which
are used by systems engineers and systems developers to plan for, design, build, test, and deliver
information systems
Require-
ment
Analysis
Evaluation Design
SDLC – Life Cycle
Implementa
Testing
tion
9
10
Waterfall Model
1. Determine the Requirements
- long release cycle
2. Complete the design - A lot of WIP
- Functional silos
3. Do the coding and testing (unit
- Incredibly rigid for developing
tests)
4. Perform other tests (functional
tests, non-functional tests,
Performance testing, bug fixes
etc.)
5. At last deploy and maintain
10
11
Agile
- Shorter release cycle
- Small batch sizes (MVP)
- Cross-functional teams
- Incredibly agile
11
12
Lean Development
- Suddenly ops was the bottleneck (more release less people), again WIP is more!
12
13
DevOps
Software Development Build & Release, Testing Teams
DevOps
- Break the Silos - Involvement in the early

- Communication (not only with development stages
emails) Infrastructure, Operations and - Automation is the key
- Collaboration Support - Continuous Integration
- Trust - Continuous Deployments in
the lower environments
- Fail fast and fail often
DevOps
14
APP APP APP APP
- name:
Playbook - name:
for Playbook
webserver for
setup webserver
hosts: setup
all t
tasks: ....
- name: ....
package ....
installatio ....
n
yum:
name=yum
state=prese
nt
Waiting Waiting
....
Waiting Waiting
....
....
....
Development QA Testing Implementation & Release Infra Management

15
DevOps
• DevOps is a loose set of practices, guidelines, and culture designed to break down silos in IT development,
operations, networking, and security.
• In a DevOps approach, you improve something (often by automating it), measure the results, and share
those results with colleagues so the whole organization can improve.
• DevOps, Agile, and a variety of other business and software reengineering techniques are all examples of a
general worldview on how best to do business in the modern world. None of the elements in the DevOps
philosophy are easily separable from each other, and this is essentially by design.
• DevOps is a broad set of principles about whole-lifecycle collaboration between operations and product
development.
15
16
DevOps Principles
No More Silos
Accidents are not just a result of the isolated

Accidents Are Normal actions of
Change is anbestindividual,
when it but is small
ratherand result
frequent.
from
missing
Change
A good culture
is
safeguards
risky,can
true,
work
for
butwhen
around
the correct
things
brokenresponse
inevitably
tooling,isbut
go
to
Extreme siloization of knowledge, incentives for
wrong.
split
the
Measureup
opposite
your
your
I.e. changes
outcomes
rarely
Misconfigured
into
holds
time
smaller
true.
to time.
System,
subcomponents
Promoters
Its canbroken
be ofin
purely local optimization, and lack of collaboration
Change Should Be Gradual monitoring,
where
DevOps
the formpossible.
ofstrongly
under
NumberThen
pressure
of
emphasize
youincidents,
build
wronga faster
organizational
steady
actions
time
CICD
etc.
to
have in many cases been actively bad for business
Rooting MTTR,
pipeline
culture—rather
market, of
outlow-risk
the
SLA
than
Mistake
etc.
change
tooling—as makers
out the
of regular
and
key to punishing
success
output
them
from
in adopting
product,
createsa newdesign,
mess,
way of like
and
working.
incentives
infrastructure to changes
confuse
Tooling and Culture Are Interrelated issues,
with Automated
hide the truth,testing
and andblame
improvements.
others, all of which
are ultimately unprofitable distractions.
Measurement Is Crucial
16
17
Site Reliability Engineering
17
18
Way to SRE
Site Reliability Engineering (SRE) is a term (and associated job role) coined by Ben Treynor Sloss, a VP of engineering at
Google.
SRE is a job role, a set of practices which are known to work at ground, and some beliefs that animate those practices.
SRE is coined around Reliability of the system.
In general, an SRE has particular expertise around the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of the service(s) they are looking after.
SRE implements interface DevOps.
SRE is hiring software engineers to run products and to create systems to accomplish the work that would otherwise be
performed, often manually, by sysadmins.
Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems.
19
Way to SRE
SRE is a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary
to write software to replace their previously manual work, even when the solution is complicated.
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software
expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design
and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases
and teams will need more people just to keep pace with the workload
Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that
the SRE team has enough time in their schedule to make the service stable and operable via engineering tasks of
automation.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management,
monitoring, emergency response, and capacity planning of their service(s).
20
SRE Principles
Way to SRE
Operations Is a Software Problem
Manage by Service Level Objectives

Ideally, both product development and SRE teams
Work to Minimize Toil Cost of failure
Automation
should haveis isathedirectly
holistic
key. The proportional
view
real work
of the to
in Mean
this
stack—the
area is
Time to Repair
determining
frontend, (MTTR)
backend,
what towhich
libraries, effects
storage,product
automate, under
kernels,what and
Having similar qualified tools across organizations
Automate - What you can SRE should
Toils
developershould
conditions,
physical therefore
be This
velocity.
machine—and
and how to use team
reduced
follows
automate
no software
to it.
from theengineering
minimum
should
As well-
per and
jealously
Google
will
Definehelp
SLOs theand easy
workprocess
around same understanding and
approaches
automation
known
max
own 50%fact
single that
work solve
to be done
the
components.
can bethat
latertoilproblem.
toin
Itthe
andtheextent
product
turns
rest out
50% possible.
lifecycle
that
timeyou a
should
can
SRE/DevOps Culture adoption.
problem
a lot is
be given
get todiscovered,
more SREdone the more
Engineering
if you expensive
“blurtasks
the lines” it is have
or something
and to
Move Fast by Reducing the Cost of Failure
fix.
new. instrument JavaScript, or product developers
SREs
qualify kernels configurations.
Share Ownership with Developers
Use the Same Tooling, Regardless of Function
20
21
DevOps vs SRE
Way to SRE
• DevOps a loose generic set of principles (philosophy and culture) and SRE an advanced explicit implementation.
• Site Reliability Engineering, like DevOps, should not just be changing titles, but making definitive behavior changes,
focusing on outcomes and obviously reliability.
• Collaboration is front and center for DevOps work. An effective shared ownership model and partner team relationships
are necessary for SRE to function.
• Change management is best pursued as small, continual actions, the majority of which are ideally both automatically
tested and applied. The critical interaction between change and reliability makes this especially important for SRE.
• Measurement is absolutely key to how both DevOps and SRE work. For SRE, SLOs are dominant in determining the
actions taken to improve the service. For DevOps, the act of measurement is often used to understand what the outputs
of a process are, what the duration of feedback loops is, and so on.
• DevOps is relatively silent on how to run operations at a detailed level. While SRE talks about detailed steps of
implementations and deployments.
21
22
DevOps vs SRE
Way to SRE
• DevOps is more context-sensitive and works organization wide. SRE, on the other hand, has relatively narrowly defined
responsibilities and its remit is generally service-oriented (and end-user-oriented) rather than whole-business-oriented.
• Ultimately, implementing DevOps or SRE is a holistic act; both hope to make the whole of the team (or unit, or
organization) better, as a function of working together in a highly specific way. For both DevOps and SRE, better velocity
should be the outcome.
22
SRE Context and Successful Adoption
23
Way to SRE
Narrow, Rigid (launch-related or reliability-related) Incentives, Narrow Your Success.
A system with early SRE engagement (ideally, at design time) typically works better in production after deployment,
regardless of who is responsible for managing the service.
Don’t just allow, but actively encourage, engineers to change code and configuration when required for the product.
Support blameless postmortems. Doing so eliminates incentives to downplay or cover up a problem.
Allow support to move away from products that are irredeemably operationally difficult. The threat of support withdrawal
motivates product development to fix issues both in the run-up to support and once the product is itself supported, saving
everyone time.
Always remember – Good people will quit if they’re tasked with too much operational work and aren’t given the opportunity
to use their engineering skill set.
Consider Reliability Work as a Specialized Role.
Strive for Parity of Esteem: Career and Financial.

24
Brain - Storming
25
DevOps vs SRE
An x company wants to reduce the time to market for its new software product releases and facing below issues:
• Hardware capacity planning is a challenge

• Infra is new, yet hardware failures are more
• Lots of bugs are being identified in the products
• Releases fails on production days
• Huge Incident tickets post new releases for next few days.
Where DevOps can help in this area?
Where SRE can help in this segment?
Understand how SRE heals DevOps Failures…

Case Study – French Telecom
26
DevOps vs SRE
Identifying the DevOps Work and building DevOps Team
Identifying Reliability needs and building SRE Engineering Team
Continuous Enhancement…
Video
27
DevOps vs SRE
DevOps vs SRE (Google)
https://www.youtube.com/watch?v=uTEL8Ff1Zvk
Exercise
28
DevOps vs SRE
What we do all day? Is there a way to automate?
Is there any way to make the system more reliable?
Factored ROI?
Q&A
29
DevOps vs SRE
Q&A
Questionnaire
30
Module 2:
Service Level Objectives (SLOs) &
Error Budgets
30
Important terms
31
Way to SRE
Availability=The ability of less downtime, or the fraction of the time that a service is usable. Although 100% availability is
impossible, near-100% availability is often readily achievable
Reliability=The ability to work properly (even if some parts/components failed).
Durability=The ability of not losing data. Or the likelihood that data will be retained over a long period of time—is equally
important (alike Availability) for data storage systems.
SLA (Promise) = Service-Level Agreement is a commitment between a service provider and a client, regarding particular
aspects of the service – quality, availability, responsibilities etc. It is an explicit or implicit contract with your users that
includes consequences of meeting (or missing) the SLOs they contain.
SLO (Goal) = Service Level Objective – The Objectives (within SLA, i.e. uptime, response time) which your team must hit to
meet the SLA.
SLI (How and What) = Service Level Indicators – the Real numbers to measure your compliance against SLO. In Specific – its a
carefully defined quantitative measure of some aspect of the level of service that is provided. i.e. request latency, error rate
etc.
Important terms
32
Way to SRE
Service Level Objectives
33
Way to SRE
It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that
service and how to measure and evaluate those behaviors.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives
(SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we
want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate
metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is
healthy.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural
structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
Choosing an appropriate SLO is usually complex (i.e. QPS, Network Bandwidth etc.) , but sometime its straightforward too
(i.e. setting low-latency).
Choosing and publishing SLOs to users sets expectations about how a service will perform. Without an explicit SLO, users
often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people
designing and operating the service.
SLI – Indicators in Practice
34
Way to SRE
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs.
What Do You and Your Users Care About?

• User-facing serving systems, such as the Shakespeare search frontends, generally care about availability,
latency, and throughput. In other words: Could we respond to the request? How long did it take to respond?
How many requests could be handled?
• Storage systems often emphasize latency, throughputs, IOPS, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
Objectives in Practice (SLO)
Define Objectives (99% (averaged over 1 minute) of Get RPC calls will complete in less than 100 ms)
Choose realistic targets, which are simple & minimum-required and always keep a refine strategy/scope.
Control Measures
35
Way to SRE
SLIs and SLOs are crucial elements in the control loops used to manage systems:
• Monitor and measure the system’s SLIs.

• Compare the SLIs to the SLOs, and decide whether or not action is needed.
• If action is needed, figure out what needs to happen in order to meet the target.
• Take that action.
Always remember:
• Publishing SLOs sets expectations for system behavior
• Keep margins - Using a tighter internal SLO than the SLO advertised to users gives you room to respond to
chronic problems before they become visible externally.
• If your service’s actual performance is much better than its stated SLO, users will come to rely on its current
performance. You can avoid over-dependence by deliberately taking the system offline occasionally (Google’s
Chubby service introduced planned outages in response to being overly available).
Understanding how well a system is meeting its expectations helps decide whether to invest in making the system faster,
more available, and more resilient. Alternatively, if the service is doing fine, perhaps staff time should be spent on other
priorities, such as paying off technical debt, adding new features, or introducing other products.
Monitoring in place
36
Way to SRE
Error Budget and policies
37
Way to SRE
The SLO is a target percentage, and the error budget is 100% minus the SLO. For example, if you have a 99.9% success
ratio SLO, then a service that receives 3 million requests over a four-week period had a budget of 3,000 (0.1%) errors over
that period. If a single outage is responsible for 1,500 errors, that error costs 50% of the error budget.
Once you have an SLO, you can use the SLO to derive an error budget. In order to use this error budget, you need a policy
outlining what to do when your service runs out of budget.
When we talk about enforcing an error budget policy, we mean that once you exhaust your error budget (or come close to
exhausting it), you should do something in order to restore stability to your system
Common owners and actions might include:

• The development team gives top priority to bugs relating to reliability issues over the past four weeks.
• The development team focuses exclusively on reliability issues until the system is within SLO. This responsibility comes
with high-level approval to push back on external feature requests and mandates.
• To reduce the risk of more outages, a production freeze halts certain changes to the system until there is sufficient error
budget to resume changes.
38
Case Study – Genpact
39
SLO & Error Budget
Pain Areas…
Penalties due to SLA Miss…
Setting SLO for Application uptime and performance.
Tracking SLIs
Video
40
SLO & Error Budget
SLA, SLO and SLI (Google)
https://www.youtube.com/watch?v=tEylFyxbDLE
Risks and Error Budgets (Google)
https://www.youtube.com/watch?v=y2ILKr8kCJU
Exercise
41
SLO & Error Budget
Define 3 SLI/SLO for your current Application contract.
Have We Defined Right SLOs and Monitoring right SLIs?
Do we Just work with Availability Monitoring or Performance Monitoring too?

Q&A
42
SLO & Error Budget
Q&A
Questionnaire
43
Module 3:
Reducing Toils
43
Toils
44
Way to SRE
“If a human operator needs to touch your system during normal operations, you have a bug.”
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows.
Google’s SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time.
At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service
features. Feature development typically focuses on improving reliability, performance, or utilization, which often reduces toil
as a second-order effect.
It is equally important to calculate toils and time spent on same, over a given period and keep aligning the team towards
engineering tasks.
Engineering work is novel and essentially requires human judgment. It produces a permanent improvement in your service
and is guided by a strategy. It is frequently creative and innovative, taking a design-driven approach to solving a problem—the
more generalized, the better.
Toils
45
A Must know
Manual
This includes work such as manually running a script that automates some task. Running a script may be quicker than
manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is
still toil time.
Repetitive
If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and
over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
Automatable
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is
toil. If human judgment is essential for the task, there’s a good chance it’s not toil.
Tactical
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be
able to eliminate this type of work completely, but we have to continually work toward minimizing it.
No enduring value
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a
permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into
legacy code and configurations and straightening them out—was involved.
SRE tasks
46
Way to SRE
Engineering work helps your team or the SRE organization handle a larger service, or more services, with the same level of
staffing.
Software engineering
Involves writing or modifying code, in addition to any associated design and documentation work. Examples include writing
automation scripts, creating tools or frameworks, adding service features for scalability and reliability, or modifying
infrastructure code to make it more robust.
Systems engineering
Involves configuring production systems, modifying configurations, or documenting systems in a way that produces lasting
improvements from a one-time effort. Examples include monitoring setup and updates, load balancing configuration, server
configuration, tuning of OS parameters, and load balancer setup. Systems engineering also includes consulting on
architecture, design, and productionization for developer teams.
Toil
Work directly tied to running a service that is repetitive, manual, etc.
Overhead
Administrative work not tied directly to running a service. Examples include hiring, HR paperwork, team/company meetings,
bug queue hygiene, snippets, peer reviews and self-assessments, and training courses.
Why Toils are bad?
47
Way to SRE
Career stagnation
Your career progress will slow down or grind to a halt if you spend too little time on projects.
Low morale
People have different limits for how much toil they can tolerate, but everyone has a limit. Too much toil leads to burnout,
boredom, and discontent.
Slows progress
Excessive toil makes a team less productive. A product’s feature velocity will slow if the SRE team is too busy with manual
work and firefighting to roll out new features promptly.
Sets precedent
If you’re too willing to take on toil, your Dev counterparts will have incentives to load you down with even more toil,
sometimes shifting operational tasks that should rightfully be performed by Devs to SRE.
Promotes attrition
Even if you’re not personally unhappy with toil, your current or future teammates might like it much less. If you build too much
toil into your team’s procedures, you motivate the team’s best engineers to start looking elsewhere for a more rewarding job.
How to Reduce Toils?
48
Way to SRE
Identify Toils and try to reduce same to the level you can. Let's take an example:
Identify what your team members are involved into at 80% of the on-job time.
Check if same can be automated.
If yes, then automate same with some tools, else if not identify what else can be done to improve the process.
Keep improving the existing state and service.
Set goals for Engineering tasks too, for e.g. increasing Internal SLO from 99.9 to 99.95%. Identifying the ways and
implementing the procedure for same.
49
Toils worth
to
automate?
Case Study – Reducing Toils
50
Reducing Toils
One of my client “X” work in Contact Center field, where they deploy the Contact Center services for end clients and manage
it for them.
Now for every new client deploying and building the infrastructure was a very hectic task (contains 15+ Servers with 10+
microservice, multiple LBs, Cache servers, DBs, Security Implementations). Similarly increasing the existing client
environment was very difficult and time-consuming tasks. Even the hardware capacity planning started becoming challenging.
80-90% of the teams (including Developers) were involved in deployment of new services or expansion of environment and
issues handling for existing clients was becoming difficult. Teams were in pain with repeated tasks and pressure they were
going through.
• Company took the hard decision and migrated to Cloud services to avoid hardware bottlenecks.
• Terraform (DevOps IaC) tool was used to automate the deployment. Now the same deployment of infra, which was taking
1 month to design and get ready, is getting up and running in less than an hour.
• Same team members are free from pressure and happy investing their time to enhance the features, reducing bugs and
automating the environment further to next levels.
• Even during pandemic, they thrive with 200% increase in customer on-boarding, without any hastle.
Video
51
Reducing Toils
Pragmatic Automation
https://www.youtube.com/watch?v=oDcjAcFTFC0
Exercise
52
Reducing Toils
Do you foresee any toils in your team?
If yes, benefits of Reducing Toil?
How same can be reduced?
Is daily mails, project Reports toil?
Considered ROI ???

Q&A
53
Reducing Toils
Q&A
Questionnaire
54
Module 4:
Monitoring & Service Level Indicators
54
Monitoring
55
Way to SRE
Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and
types, error counts and types, processing times, and server lifetimes.
White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine
Profiling Interface, or an HTTP handler that emits internal statistics.
Black-box monitoring
Testing externally visible behavior as a user would see it.
Alert
A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a
pager. Respectively, these alerts are classified as tickets, email alerts,22 and pages.
Root cause
A defect in a software or human system that, if repaired, instills confidence that this event won’t happen again in the same
way.
SLI’s -Service Level Indicators
56
Monitoring
SRE doesn’t typically get involved in constructing SLAs, however, get involved in helping to avoid triggering the
consequences of missed SLOs. And to achieve same having proper monitoring with right SLIs is very important.
What Do You and Your Users Care About?

• User-facing serving systems, Could we respond to the request? How long did it take to respond? How many
requests could be handled?
• Storage systems often emphasize latency, availability, and durability.
• Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end
latency (How much data is being processed?, and how long it takes to process?).
• All systems should care about correctness: was the right answer returned, the right data retrieved, the right
analysis done?
Collecting Indicators
Aggregate the metric for better usage (Average out, Instantaneous usages)
Standardize Indicators (Over a period of time i.e. average packets per minute)
Collect Indicators at server as well as Client end
E.g. Application latency for 99.9% users in last 5 min should be less than 100ms.
Why Monitor?
57
Way to SRE
Analyzing long-term trends
Comparing over time
Alerting
Building dashboards
Conducting ad hoc retrospective analysis

Monitoring with proper SLI Metric
58
Way to SRE
Always make sure to select the right Service Level Indicator Metric to track and alert.
Set alerts (with respective criticality, pager, Email etc) for all your SLO targets and also creating simple and meaning full
dashboards for a higher visibility.
Four golden signals to track:

Latency
Traffic
Errors
Saturation (IO, Memory, CPU etc)
SLO Improvements 59
Way to SRE
VALET
Google summed up our new SLOs into a handy acronym: VALET.
Volume (traffic)
How much business volume can my service handle?
Availability
Is the service up when I need it?
Latency
Does the service respond fast when I use it?
Errors
Does the service throw an error when I use it?
Tickets
Does the service require manual intervention to complete my request?
Use Telemetry tools to collect SLI metrics from remote servers into a centralized Monitoring server to generate graphs.
Case Study - Genpact
60
Monitoring and SLI
At Client “X”, we configured autoscaling with value CPU percentage > 80% and made the system up and running in
production. Even the performance/Load Test was done and successful.
But after 6 month, during a heavy peak load, system didn’t autoscaling and got crashed. During RCA identification, we found
that it was a DISK IO full issue, for which we have not monitored the systems and no alert/autoscaling setup on same.
We modified the HDD to SSD on the server for fixing the issues and also enabled monitoring and autoscaling for the DISK
environment.
Video
61
Monitoring and SLI
SLI/SLO and reliability Deep Dive
https://www.youtube.com/watch?v=dplGoewF4DA
62
High Availability and Capacity Planning
62
High Availability
63
A Must know
Serving millions of request by a single server is not possible, even it is a supercomputer. Hence, we need Horizontal Scaling
(adding more servers to handle the requests).
Traffic load balancing is the solution to heavy traffic management, which is distributing traffic across multiple network links,
datacenters, and machines in an "optimal" fashion.
Multiple factors, which affects HA:
• The hierarchical level at which we evaluate the problem (global versus local)
• The technical level at which we evaluate the problem (hardware versus software)
• The nature of the traffic we’re dealing with
Nature of requests and handling techniques plays a useful role here.

High Availability techniques
64
A Must know
Techniques to handle the High availability/DR:
• Clustering
• Load Balancing with VIP
• Microservice architectures ( with Containerization Approach)
• Passive DR Sites
• Load Balancing Using DNS
• Content Delivery Networks (for low latency)

Thinknyx Technologies 65
High Availability & Burst handling

Way to SRE
Cloud Premises
Autoscaling Group
Application
Server Application
Server
Application
Server
65
LB for High Availability

Way to SRE
66
HA in AWS
Way to SRE
67
Disaster Recovery
Way to SRE
X – Cloud X – Cloud
DC 1 Operations On-Prem, Cloud as DR
DC 2 Application is hosted in self-managed Datacenter and
Backup
APP Backup Backup is hosted on Cloud
Operations in Cloud, Cloud as DR

Application is hosted in Cloud Datacenter and Backup is
hosted in Second Region/datacenter in same cloud.
Y – Cloud Operations in Cloud, Third-party DR

DC 1 Application is hosted in Cloud Datacenter and Backup is
APP hosted on Third party cloud service provider or may be
Backup on third party backup service provider.
On-Prem
68
Business Continuity
Way to SRE
1 Physical Location 2 Logical Location
Consider all natural Never put al your eggs

disasters and their in single bucket.
range, before finalizing
a DR location.
3 Who’ll do what 4 DR Drill
Declarations of Role & Test BC/DR at least

Responsibilities, annually.
Emergency process,
Backup ready
5 Priorities
Define application criticality and failover
priorities in advance. Health and Human safety
should be primary concern.
69
Exercise
70
Monitoring and SLI
What do you monitor now and what all reliability aspects you considered?
Performance Monitoring in place?
DR Monitoring /Activation in place?
What we can monitor, where and how?
Risk Factors
Q&A
71
Monitoring and SLI
Q&A
Did you consider Monitoring importance in HA/DR?

72
Module 5:
SRE Tools & Automation
72
Automation
73
Way to SRE
Automation is the key for any organization to thrill.
For SRE, automation is a force multiplier, not a panacea. Of course, just multiplying force does not naturally change the
accuracy of where that force is applied: doing automation thoughtlessly can create as many problems as it solves.
Consistency A Platform
Faster Repairs Faster Actions
Time Saving Ease and Effectiveness

Automation Focus
74
Way to SRE
SRE has a number of philosophies and products in the domain of automation, some of which look more like generic rollout
tools without particularly detailed modeling of higher-level entities, and some of which look more like languages for describing
service deployment (and so on) at a very abstract level.
Some use cases:

• User account creation
• Cluster turnup and turndown for services
• Software or hardware installation preparation and decommissioning
• Rollouts of new software versions
Count is endless, its just identifying the priority and keep automating tasks one by one.
End Goal is to create Autonomous system, which runs and manages on its own. For e.g. A system should not just trigger alerts
and try to make the services up on same system, it should do the failover on its own to another better available system, if
services are not coming up on same server.
Automation system must be secure and reliable. Automation works at scale, so destruction will also be at scale, if something
goes wrong.
Automation Hierarchy
75
Way to SRE
Now a days, tools are available in market to automate majorities of tasks and events what we want to manage; yet there can
be few things which needs customized automation. For same we can follow custom paths. For example a database failover
automation evaluation path for Autonomous environment:
1) No automation
Database master is failed over manually between locations.
2) Externally maintained system-specific automation

An SRE has a failover script in his or her home directory.
3) Externally maintained generic automation

The SRE adds database support to a "generic failover" script that everyone uses.
4) Internally maintained system-specific automation

The database ships with its own failover script.
5) Systems that don’t need any automation (autonomous system)

The database notices problems, and automatically fails over without human intervention.
SRE hates manual operations, so they obviously try to create systems that don’t require them. However, sometimes manual
operations are unavoidable (DR activation, Production push etc).
Secure Automation
76
Way to SRE
Unsecure Automation can be dangerous too. Learn from an example of “CODESPACES” and other clients where AWS AK/SK
was leaked, and disaster happened.
Think about keeping Username and password in your container Images? Who all in your organization do have access to these
image?
Keeping API keys in Code and code on Github/bitbucket?
Zero touch automation is the final goal for an SRE. We have to consider Security also into it at every layer, as multiple tools get
involved.
We have to replace our use of sshd with an authenticated, ACL-driven, RPC-based Local Admin Daemon, also known as
Admin Servers, which had permissions to perform those local changes. As a result, no one could install or modify a server
without an audit trail.
CIA terms are important to implement in automation too.

Automation Tools
77
Way to SRE
Though there are not defined set of tools for any SRE, but it is always better to have universal tools in the basket. Few
Categories and tools which are well known to the market for their well-known results in the area, are as below:
Version Control System: TFVC, Git (Gitlab, Github, Bitbucket, Azure DevOps)
Pipelining for CICD: Jenkins, Azure DevOps, TeamCity, bamboo
Automated Deployment: Octopus Deploy, UrbanDeploy
Configuration Management: Ansible, Chef, puppet, Saltstack
Container and Orchestration: Docker (Kubernetes, Docker Swarm, Openshift)
Automation oriented languages: Python, Java
DevOps Infrastructure as Code: Terraform, Cloud Formation, Azure Templates
Continuous Testing: Jmeter, Sonarqube, Selenium

Video
78
SRE tools and Automation
SREcon19 Asia/Pacific - Ironies of Automation (Microsoft)
https://www.youtube.com/watch?v=U3ubcoNzx9k
Case Study
79
Amazon has done the automation for Leasing servers on Rent and metering same for usage and gradually over a period of
time, it became Cloud with 100s of service provisioned/Released via self-service portal.
50 million changes into AWS Cloud happened in 2016 in 1 year, which is 1 change in production per second.
Distributed system with APIs and Queues are the best way to scale with automation.
Exercise
80
Automation “Greatest Hits” – Uber, Airbnb, Ola, Olx, AWS …
How much automation you have and what can be automated?
Its as simple as:
“Anything that you do more than twice has to be automated.”
-Adam Stone, CEO, D-Tools

Q&A
81
Q&A
Questionnaire
82
Module 6:
Anti-Fragility & Learning from Failure
82
Anti-fragility: Learning from Failure 83
Way to SRE
“The cost of failure is education.” Devin Carraway
Anti-fragility is all about understanding disorder and using it to your advantage. It is a property of systems in which they
increase in capability to thrive as a result of stressors, shocks, volatility, noise, mistakes, faults, attacks, or failures.
Postmortems are an essential tool for SRE (to make the system resilient and reliable).
When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we
have some formalized process of learning from these incidents in place, they may reoccur.
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and
the follow-up actions to prevent the incident from recurring.
The primary goals of writing a postmortem are to ensure that the incident is documented, that all contributing root cause(s)
are well understood, and, especially, that effective preventive actions are put in place to reduce the likelihood and/or impact
of recurrence.
Anti-fragility: Shifting the organizational balance 84
Way to SRE
The antifragile loves randomness and uncertainty.
Anti-fragility is a concept which encompasses the idea that things need chaos and disorder in order to thrive and flourish.
Whatever doesn’t kill us makes us stronger, pushing the notion that we shouldn’t construct our lives or our plans against
randomness and misfortune, rather, we should adopt anti-fragility as a means of maneuvering through disorder.
Similarly, we should plan our system to withstand failures (planned/unplanned). We can even plan/plot the failures in our
system to understand the withstanding capability/Anti-fragility of our system.
We can have unplanned downtimes and activities, which can simulate failure to learn from it and make our system more
robust. For example – pulling a network cable of server or shutting down the UPS to understand the impact, etc… But we
must first understand the error budget and failure cost before we plan for such failure activities. Such activities definitely
add learning and robustness to our system, but we must keep a balance between error budget and enhancements.
Postmortem Culture : Learning from Failure 85
Way to SRE
The postmortem process does present an inherent cost in terms of time or effort, so we can be deliberate in choosing when
to write one. Teams have some internal flexibility, but common postmortem triggers include:
• User-visible downtime or degradation beyond a certain threshold
• Data loss of any kind
• On-call engineer intervention (release rollback, rerouting of traffic, etc.)
• A resolution time above some threshold
• A monitoring failure (which usually implies manual incident discovery)
• Stakeholder request a postmortem for an event

Blameless Postmortems 86
Way to SRE
“Blameless postmortems” is a Principle of SRE culture.
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without
indicting any individual or team for bad or inappropriate behavior.
A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right
thing with the information they had.
If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring
issues to light for fear of punishment.
When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had
incomplete or incorrect information, effective prevention plans can be put in place.
You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when
designing and maintaining complex systems.
Best Practice 87
Way to SRE
Avoid Blame and Keep It Constructive.
Collaborate and Share Knowledge
No Postmortem Left Unreviewed
Introduce a Postmortem Culture
Visibly Reward People for Doing the Right Thing
Ask for Feedback on Postmortem Effectiveness
Continuous improvement
Postmortem should have clearly defined ownership, priority, preventive actions and Action taken
Case Study
88
Anti-Fragility
Netflix Simian Army
https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
https://github.com/netflix/chaosmonkey
Failures
89
Anti-Fragility
Do Failure is really bad – for organization and individuals ?
Consider Elon Musk and Colonel Sanders ☺

Exercise
90
Anti-Fragility
Share some example of problem ticket within your team where you were involved and had lot of incidents be
of a single root cause.
Share the Incidents, Business Impact, Root Cause, Course of Actions done to resolve same and whose mist
was this, if it was a configuration issue.

Q&A
91
Anti-Fragility
Q&A
Questionnaire
92
Module 7:
Organizational Impact of SRE
92
Organizations Embracing SRE 93
Way to SRE
Availability
Continuous
Capacity Planning
Improvements
Reliability
Velocity Happy Customers
Cost Effectiveness
due to less failure
Typical ORG Chart 94
Way to SRE
Dev/Prod Team SRE Team Operations
Manager
Site Reliability Site Reliability DB Systems

Dev Q&A
Engineers Engineers Manager Manager
Manager Manager
(TL) (TL)
DB DB OS OS
Dev Dev QA QA Specialized Specialized Specialized
Admins Admins Admins Admins
Admin Admin Reliability Reliability Reliability
Engineers Engineers Engineers
SRE Responsibilities 95
Way to SRE
Tasks SRE Team

Architecture Design Approvals and Consultaning RC
Instrumentation, Metrics, and Monitoring CI
Maintaining SLI CI
SLO / SLI track and management CR
Handling Incidents CI
R epeated Incidents and Problem Management RA
Capacity Planning CR
Change Management CI
Critical/Large Scale Changes CR
Performance: availability, latency, and efficiency R
Automation RA
Innovation RA
R elease Management C
R elease R epeated failures R
Test Management C
Test R epeated Failures R
Standardization of Tools/Softwares/Process/Technologies RA
Supporting Presales and Sales CR
Training other Team Members R
SRE Engagement and Adoption 96
Way to SRE
SRE seeks production responsibility for important services for which it can make concrete contributions to reliability. SRE is
concerned with several aspects of a service, which are collectively referred to as production. These aspects include the
following:
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
When SREs engage with a service, they aim to improve it along all of these axes, which makes managing production for the
service easier.
SRE & Scale 97
Way to SRE
The bigger the Operations/Systems, the more autonomous systems it should be.
If 10 engineers handles 100 hundred servers, we shouldn’t need 100 engineers to handle 1000 servers.
SRE is all about automation, improvements and reliability in the system.
Bigger environment means more effectiveness from SRE.
As the need for manual tasks reduces over time due to automation, yet enhancement in autonomous system and further
improvement in same is a continuous process.
Testing
98
Way to SRE
“If you haven't tried it, assume it's broken.”
One key responsibility of Site Reliability Engineers is to quantify confidence in the systems they maintain. SREs perform this
task by adapting classical software testing techniques to systems at scale.
Testing is the mechanism we use to demonstrate specific areas of equivalence when changes occur. Each test that passes
both before and after a change reduces the uncertainty for which the analysis needs to allow. Thorough testing helps us
predict the future reliability of a given site with enough detail to be practically useful.
Passing a test or a series of tests doesn’t necessarily prove reliability. However, tests that are failing generally prove the
absence of reliability.
How failures are being measured:
Mean Time to Repair (MTTR) measures how long it takes the operations team to fix the bug, either through a rollback or
another action.
Mean Time Between Failures (MTBF) measures time - for how long the service worked well post a failure condition.
SW Testing Classification
99
Manual Static
Automated Dynamic
Testing Type Testing Methods
Unit Testing
Black Box
Integration Testing
White Box
System Testing
Grey Box
Acceptance testing
Testing Approach
Testing Levels
99
Managing Incidents
100
Way to SRE
Effective incident management is key to limiting the disruption caused by an incident and restoring normal business
operations as quickly as possible.
As SRE you are also supposed to be on-call (limited efforts again) and handle the incidents.
When on-call, an engineer is available to perform operations on production systems within minutes, according to the paging
response times agreed to by the team and the business system owners. Typical values are 5 minutes for user-facing or
otherwise highly time-critical services, and 30 minutes for less time-sensitive systems.
Google strongly believe that invest at least 50% of SRE time into engineering: of the remainder, no more than 25% can be
spent on-call, leaving up to another 25% on other types of operational, nonproject work.
The most important on-call resources are:

• Clear escalation paths
• Well-defined incident-management procedures
• A blameless postmortem culture
Emergency Response
101
Way to SRE
“Things break; that’s life.”
How employees responds to an emergency, show the process and long-term health of the organization. Organization long-
run depends on this one factor very well in IT industry.
What to Do When Systems Break

First of all, don’t panic!
If you feel overwhelmed, pull in more people.
Follow the Incident response process.
Take a deep breath and try to understand the situation, failure cause or relate sources in case of multiple failures.
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
Some important pointers:

• Keep a History of Outages
• Ask the Big, Even Improbable, Questions: What If…?
• Encourage Proactive Testing
Videos
102
Organizational Impact
A history of SRE at Uber:
https://www.youtube.com/watch?v=qJnS-EfIIIE
Case Study - OBS
103
Orange Business Services – The Flexible Engine

Exercise
104
Why do you want to adopt SRE? Who in your organization currently provides SRE?
Your organizational plan for SRE?

Q&A
105
Q&A
Questionnaire
106
Module 8:
SRE, Other Frameworks,
& The Future
106
Transforming Culture 107
Way to SRE
Site Reliability Engineering (SRE) proclaims many advantages for distributed systems. It improves infrastructure
automation, increases reliability, and transforms incident management.
Instead of taking individual at centre, we have a specialized team in centre which is a centre of collaboration and
communication in the organization.
Embracing Risk
Learning From Failure
Better collaboration and communication
Automation in centre – which benefits complete organization
Standardization of tools / technologies / process
Centralize documentation
Consultation and Trainings

SRE with Other frameworks 108
Way to SRE
SRE works well with all major existing process and culture concepts, like:
DevOps
Agile
Scrum
Lean
ITIL
PMP
Its while many of above are majorly conceptual stuff, SRE is having those concepts implemented on ground with practical
work.
It’s a path to create a stress-free autonoums reliable environment with tremendous velocity.
SRE Evolution 109
Way to SRE
Google coined the term “site reliability engineer” in 2003, but it certainly has existed for decades more in different forms —
disaster recovery and production testers.
Ways the SRE Approach is Evolving:

1. Increased Adoption
2. Larger, Diversified SRE Departments
3. New Testing Tactics emerge – e.g., Chaos Monkey
4. Businesses Rely on SREs to Mitigate Risk
Currently SRE approach is widely being adopted by organizations to achieve high uptime and stability for the application, as
even 1 minute of downtime costs millions of $ to many MNCs.
Videos
110
SRE and other frameworks
A Look at ITIL4 & SRE
https://www.youtube.com/watch?v=vFyPXIsUEhE
Case Study – VictorOps
111
Victor Ops
Exercise
112
Where do you see SRE future heading?
Sketch board your understanding of SRE and Requirements for the job role.
Q&A
113
Q&A
Questionnaire
114
THANK YOU

Site Reliability Engineering v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Site Reliability Engineering v2

Uploaded by

Copyright:

Available Formats

Site Reliability

• Experience on DevOps Tools, Cloud

• Your expectations from this training

• Implementation discussion is welcome for your existing teams and domains.

Which are helping industries in Digital Transformation

- Shorter release cycle

- Small batch sizes (MVP)

Software Development Build & Release, Testing Teams

- Break the Silos - Involvement in the early

APP APP APP APP

Development QA Testing Implementation & Release Infra Management

Accidents are not just a result of the isolated

Site Reliability Engineering

SRE is coined around Reliability of the system.

SRE implements interface DevOps.

Operations Is a Software Problem

Manage by Service Level Objectives

Use the Same Tooling, Regardless of Function

Narrow, Rigid (launch-related or reliability-related) Incentives, Narrow Your Success.

Support blameless postmortems. Doing so eliminates incentives to downplay or cover up a problem.

Consider Reliability Work as a Specialized Role.

Strive for Parity of Esteem: Career and Financial.

• Hardware capacity planning is a challenge

Where DevOps can help in this area?

Where SRE can help in this segment?

Understand how SRE heals DevOps Failures…

Identifying the DevOps Work and building DevOps Team

Identifying Reliability needs and building SRE Engineering Team

DevOps vs SRE (Google)

What we do all day? Is there a way to automate?

Is there any way to make the system more reliable?

Reliability=The ability to work properly (even if some parts/components failed).

What Do You and Your Users Care About?

• Monitor and measure the system’s SLIs.

Common owners and actions might include:

SLO & Error Budget

Penalties due to SLA Miss…

Setting SLO for Application uptime and performance.

SLO & Error Budget

SLA, SLO and SLI (Google)

Risks and Error Budgets (Google)

SLO & Error Budget

Define 3 SLI/SLO for your current Application contract.

Have We Defined Right SLOs and Monitoring right SLIs?

Do we Just work with Availability Monitoring or Performance Monitoring too?

SLO & Error Budget

Check if same can be automated.

Keep improving the existing state and service.

Do you foresee any toils in your team?

If yes, benefits of Reducing Toil?

How same can be reduced?

Is daily mails, project Reports toil?

Considered ROI ???

What Do You and Your Users Care About?

Analyzing long-term trends

Comparing over time

Conducting ad hoc retrospective analysis

Four golden signals to track:

Monitoring and SLI

Monitoring and SLI

SLI/SLO and reliability Deep Dive

High Availability and Capacity Planning

Multiple factors, which affects HA:

• The nature of the traffic we’re dealing with