Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Nine developer enablement

practices to achieve DevOps


at enterprise scale
CHRISTIAN OESTREICH
Director of Software Architecture &
Distinguished Engineer
About the Author
CHRISTIAN OESTREICH
Christian Oestreich has over 20 years of experience as a senior developer,
systems architect, and engineering team leader for Fortune 500
companies. He has expertise in leading large development teams in a
variety of software development methodologies, using a myriad of
different client and server technologies.
Table of Contents
03 The problem: A massively scaling dev team—but not enough ops talent

04 The situation today: A 25:1 dev-to-ops ratio

04 How we got to where we are today: The Metrics-Driven Mindset

06 One: Start with a performance requirements questionnaire

07 Two: Bake ops tasks into project bootstrapping

08 Three: Build libraries to accelerate code instrumentation

09 Four: Create patterns for custom metrics

10 Five: Integrate reporting agents into hosts and containers

11 Six: Automate platform compliance checks

12 Seven: Flow data into a single collection stream

13 Eight: Create a single source of truth for dashboards

13 Nine: Make software performance visible to all

14 Conclusion
The problem: A massively scaling
dev team—but not enough ops talent
In 2015, I found myself in a tough but familiar good ops engineers at the same rate that we
situation (and one that many readers will could find and hire developers.
recognize). Our team had a huge amount of
How were we going to support our rapidly
development to do on an ambitious new web
expanding development team? We needed to find
platform, and we were hiring developers by the
creative ways to empower our developers to write
dozen. But we couldn’t possibly scale the ops
high quality code, and then deploy, monitor, and
team to support the growth of the development
remediate problems themselves.
team. To put it simply, we couldn’t find and hire

03
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

The situation today: A 25:1 dev-to-ops ratio


In my 20 years of software development only a small fraction of our engineers are focused
experience, I’ve typically observed dev-to-ops on architecture, operations, and governance. For
ratios of between 6:1 and 8:1. Fast-forward to every 25 developers focused on feature delivery
today: our organization’s dev-to-ops ratio is 25:1. and defect triage, there is only one engineer
We have more than 500 engineers who have dedicated to ops, architecture, and governance.
delivered over 150 production microservices. Yet

How we got to where we are today:


The Metrics-Driven Mindset
How did we achieve a 25:1 dev-to-ops ratio on a A well-planned metrics and monitoring strategy
team of more than 500 engineers? A big factor yields higher quality code, lower support costs,
was what I call our “metrics-driven mindset.” and more self-sufficient development teams.
I often pose this (rhetorical) question:
But, first, ops and architecture teams need
Why build software if you can’t (1) measure its to empower developers with the right tools
effectiveness and (2) react to issues in real-time? and processes to effectively monitor their
applications. In the pages that follow, I describe
To me, this is the core reason why developers
nine practices we’ve developed that were critical
(along with product managers and executives)
to our success:
need to constantly think about the key metrics
that measure their code’s success—and also why
ops and architecture teams need to make it easy
for developers to collect and report on those
metrics. Too often, metrics and monitoring are
an afterthought to software delivery when, in
reality, it needs to be a first-order consideration.

A well-planned metrics
and monitoring strategy
yields higher quality code,
lower support costs,
and more self-sufficient
development teams.

04
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

1. Start with a performance requirements questionnaire

2. Bake ops tasks into project bootstrapping

3. Build libraries to accelerate code instrumentation

4. Create patterns for custom metrics

5. Integrate reporting agents into hosts and containers

6. Automate platform compliance checks

7. Flow data into a single collection stream

8. Create a single source of truth for dashboards

9. Make software performance visible to all

These practices help ensure that developers are ops tasks. Below, I describe these principles in
focused on writing high-quality code that meets more detail and explain how we implemented
business requirements and not worrying about them in our environment.

05
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

01 Start with a performance requirements questionnaire


Almost immediately after the first container to describe the top metrics they need to track
starts, system and application metrics quickly in order to ensure their code is functioning. We
scale beyond the limits of human comprehension. encourage developers to work with business
This leads to an overwhelming amount of data partners to identify a handful of key indicators
and noise that can distract developers. So we per system.
created a questionnaire that asks developers

Our performance requirements questionnaire

As part of the questionnaire, developers must notification. Therefore, this questionnaire also
also think about alerting. I firmly believe that forces developers and product teams to go
if you’re going to monitor a metric, you should beyond metrics and think about key performance
also alert on it as well. A dashboard is fine, but indicators (KPIs) and service level indicators (SLIs).
if something isn’t right, somebody should get a

106
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

As a sidenote, KPIs and SLIs illustrate contextual Ideally, developers fill out this questionnaire along
critical business or system functions, whereas with their counterparts on the Product Team
metrics represent point-in-time data without before they begin writing code. The understanding
the necessary context. The key difference is in is that these metrics and alerts are going to be
defining what is “normal.” For example, the count visible to everybody (including executives). This
of records processed over the last 60 minutes is exercise should inform how developers write the
a metric: it doesn’t provide any business context. code. Then, when developers are getting ready
The successful percentage of records processed to deliver their code, the architecture team uses
over the last 60 minutes is a more useful indicator. this questionnaire during our governance check.
We verify (1) that the developers are going to
The count of records processed be accountable for the things they said were
important in the questionnaire; and (2) that
over the last 60 minutes is a dashboards and alerts have been checked into
metric: it doesn’t provide any the repository.

business context. The successful


percentage of records processed
over the last 60 minutes is a
more useful indicator.

02 Bake ops tasks into project bootstrapping


I’ve found that DevOps teams are often focused source, Java-based framework used to create
more on automating code delivery pipelines microservices. We forked and customized the
while pushing off post-production monitoring and Spring Initializr (a lightweight quickstart generator
alerting to later. But these "later" stages of the for Spring projects). In our forked version of Spring
DevOps lifecycle are also important to automate. Initializr, we added additional automation for key
It’s not efficient to think about monitoring and ops tasks. When developers click the microservice
alerting after the fact. On our team, monitoring generation button, it creates GitHub build streams,
and alerting is baked into the earliest stages of metrics dashboards, and alert templates.
the software delivery lifecycle.

We’re a Java shop, specifically using the Spring


web framework and Spring Boot, an open

07
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

Our customized Microservice Generator automates code and ops tasks

So when developers want to bootstrap a project, are automated. We really want developers to be
they simply click through the Spring Initializr, as focused on writing high quality code, rather than
most of the manual microservice setup steps worrying about ops tasks.

03 Build libraries to accelerate code instrumentation


While it’s important to capture low-level application We built a library that all Java applications in
runtime metrics, we don’t want developers spending our stack use. If a developer is using the library,
precious time creating that instrumentation. application runtime metrics will be auto-
Instead, we’d prefer that developers focus on the instrumented. The common Java library frees
business metrics that matter.

08
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

developers to concentrate on instrumenting key and deploy this documentation to GitHub


business metrics, which are generally custom pages for the repository. This system provides
application metrics (in our case, things like health developers a valuable starting point and puts
data intake processing and success rate). the documentation at the source so people aren’t
scouring wikis for some obscure Word or
All repositories, regardless of business function,
Excel documents.
must contain documentation in the form of
AsciiDocs. This documentation is required
to provide context, sample usage, and SLA
requirements. The build pipeline will generate

Example documentation providing context on how to use our custom @SLA annotation

04 Create patterns for custom metrics


We aim to provide turnkey solutions which allow If the developers can instrument
developers to be naïve to the world of ops.
However, it is necessary for developers to instru-
that small number of KPIs and
ment any custom or application-specific metrics; SLIs, everything else should be
they just have to think about the specific key
delivered and built for them.
metrics they need to monitor for their applica-
tion. If the developers can instrument that small
number of KPIs and SLIs, everything else should
be delivered and built for them.

09
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

But first, developers need training and enablement. files, the compilation and correctness of the
examples and solutions in the docs will have to
We require all developers to complete a
successfully compile and build to deploy new
microservice bootcamp. This bootcamp contains
documentation. This ensures that example code
all the coding patterns, including how to create
will not be stale or broken.
KPIs, SLIs, and custom metrics, that developers
need to get started when building services. This During the bootcamp, we also introduce our
bootcamp is “intra-sourced” and we encourage custom libraries and our AOP (aspect-oriented
active contributions, such as lessons learned and programming) annotations, which experience
new microservice patterns. has shown us create the most useful and simple
high-level KPIs and SLIs. However, we do also
The bootcamp materials are currently written
strongly encourage individual team creativity
in AsciiDocs, an excellent vehicle for linking and
when devising the metrics and visualizations
embedding source files to provide additional
that will work best for their services.
context and clarity. By directly embedding Java

05 Integrate reporting agents into hosts and containers


As developers get ready to deploy their code, it’s vital to ensure that the metrics will actually
get collected. To streamline the deployment process for developers, we automatically integrate
reporting agents into containers and hosts.

Hosts Any host created inside our cloud environment


already has metric collection and aggregation
baked in at the OS layer.

Containers All of our containers share a common base


container, which has an agent already baked in.
As a result, any new containers will inherit the
reporting agent. From there, we simply drop in
the application runtimes, and all of the monitoring
and logging is enabled.

10
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

06 Automate platform compliance checks


On our team, we’ve had a major push on not just High standards for platform compliance remove
software quality, but platform quality as well. many potential headaches for both dev and ops.

It's important to monitor what tools your devs are using before they deploy

With so many components and developers, it’s very The compliance checker generates a compliance
easy to have configuration and versioning drift. score and a green or red result. If green, the
To resolve this problem, we created an automated developer is clear to push to production. If red, the
compliance checker that checks the status of developer understands that the microservice is no
roughly 20 framework dependencies, verifies the longer compliant with our framework standards.
build quality, and ensures that devs are using Updating everything to the latest versions is a click
up-to-date versions of the web framework, library, of a button, as shown below.
containers, and other components. This final, pre-
deployment step helps developers avoid running
stale or insecure legacy items in production.

11
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

The compliance checker interface makes it easy to see what’s out of compliance and make the
required updates.

07 Flow data into a single collection stream


Like most large enterprises, we have several In our unified data platform, teams can filter by
operations and monitoring platforms that we environment (either production, staging, or testing).
need to unify. Having many systems throwing Teams can also decide which data is meaningful,
out metrics and alerts is inefficient: it leads to mute extraneous sources, and alert only on key
too many false positives and a lot of time wasted data. Plus, our ability to combine and correlate
switching between collection points and searching monitoring data allows us to confidently diagnose
for answers. Therefore, we ingest all streams into a problems—and also provides additional context to
centralized platform which emits alerts. troubleshoot any issues that may arise.

12
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

08 Create a single source of truth for dashboards


With 500 developers empowered to create tinker with dashboards, every week we auto-deploy
dashboards and alerts, we needed to create the original dashboards back out of the repository
consistent standards of quality control. Too with the run of a Jenkins job and the Datadog API.
many dashboards can be a problem: before you This mechanism keeps dashboards focused
know it, you could have 500 dashboards, though and clean.
few of them are meaningful. Rather than creating
The Datadog Ops repository helps us cut down
many less-than-useful dashboards, we prefer
on noise and maintain focus, while allowing
developers to create a limited number of
developers to experiment and create dashboards
useful dashboards.
on the fly. The repository also instantiates our
entire monitoring and alerting strategy as code,
Rather than creating many and ensures that we have a way to automatically
less-than-useful dashboards, recreate our entire dashboard and monitoring
infrastructure.
we prefer developers to
However, we do give developers the opportunity
create a limited number to permanently modify their dashboards. We
of useful dashboards. periodically export “experimental” dashboards
and email their creators with this message:
To avoid dashboard overload, we export all “Please enter this dashboard into the Datadog
dashboards into a GitHub repository we call Ops repository if you’d like to replace the existing
“Datadog Ops,” which serves as the single source of production dashboard.”
truth for dashboards. While individuals are able to

09 Make software performance visible to all


We have large TVs around the office that
display our Datadog dashboards. We don’t put
low-level infrastructure or application runtime
metrics on these. Rather, these TVs are the
place for business-relevant KPIs and SLIs. The
high visibility makes everyone care a lot more
about monitoring their applications. In addition,
the high visibility of these dashboards creates
accountability for teams and encourages
collaboration among different teams (such as
business and technology) as well as various
levels (from front-line engineers to C-Suite).

13
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE

A FEW ADDITIONAL TACTICAL TIPS:

·· Rotating screens (red/green or other complementary colors) are


I recommend limiting the number of displays easiest to see from a distance. Lines are great
(or rotating onscreen visualizations). We want to where variance is expected.
emphasize the few business-impacting metrics
·· Global & team-specific screens
that matter, so adding more data doesn’t
Include global business KPI boards on all TVs,
necessarily help.
but allow teams to rotate in team-specific
·· Information display boards on their TVs.
The less clutter on the screen the better;
generally large, color-coded numbers

Conclusion
The metrics-driven mindset has enabled the team key business metrics, rather than worrying about
to deliver higher quality software faster. Our very ops tasks. In addition, the metrics-driven mindset
lean ops and architecture team has driven the gives devs ownership over their microservices
practices described above, freeing developers and encourages devs to identify and troubleshoot
to focus on writing good code and instrumenting issues themselves.

14

You might also like