Datadog Ebook DeveloperEnablementPractices

Nine developer enablement
practices to achieve DevOps

at enterprise scale
CHRISTIAN OESTREICH
Director of Software Architecture &
Distinguished Engineer
About the Author
CHRISTIAN OESTREICH
Christian Oestreich has over 20 years of experience as a senior developer,
systems architect, and engineering team leader for Fortune 500
companies. He has expertise in leading large development teams in a
variety of software development methodologies, using a myriad of
different client and server technologies.
Table of Contents
03 The problem: A massively scaling dev team—but not enough ops talent
04 The situation today: A 25:1 dev-to-ops ratio
04 How we got to where we are today: The Metrics-Driven Mindset
06 One: Start with a performance requirements questionnaire
07 Two: Bake ops tasks into project bootstrapping
08 Three: Build libraries to accelerate code instrumentation
09 Four: Create patterns for custom metrics
10 Five: Integrate reporting agents into hosts and containers
11 Six: Automate platform compliance checks
12 Seven: Flow data into a single collection stream
13 Eight: Create a single source of truth for dashboards
13 Nine: Make software performance visible to all
14 Conclusion
The problem: A massively scaling
dev team—but not enough ops talent
In 2015, I found myself in a tough but familiar good ops engineers at the same rate that we
situation (and one that many readers will could find and hire developers.
recognize). Our team had a huge amount of
How were we going to support our rapidly
development to do on an ambitious new web
expanding development team? We needed to find
platform, and we were hiring developers by the
creative ways to empower our developers to write
dozen. But we couldn’t possibly scale the ops
high quality code, and then deploy, monitor, and
team to support the growth of the development
remediate problems themselves.
team. To put it simply, we couldn’t find and hire
03
NINE DE VELOPER ENAB LEMENT PR AC TICE S TO ACHIE VE DE VOP S AT ENTERPRI S E SCALE
The situation today: A 25:1 dev-to-ops ratio

In my 20 years of software development only a small fraction of our engineers are focused
experience, I’ve typically observed dev-to-ops on architecture, operations, and governance. For
ratios of between 6:1 and 8:1. Fast-forward to every 25 developers focused on feature delivery
today: our organization’s dev-to-ops ratio is 25:1. and defect triage, there is only one engineer
We have more than 500 engineers who have dedicated to ops, architecture, and governance.
delivered over 150 production microservices. Yet
How we got to where we are today:

The Metrics-Driven Mindset
How did we achieve a 25:1 dev-to-ops ratio on a A well-planned metrics and monitoring strategy
team of more than 500 engineers? A big factor yields higher quality code, lower support costs,
was what I call our “metrics-driven mindset.” and more self-sufficient development teams.
I often pose this (rhetorical) question:
But, first, ops and architecture teams need
Why build software if you can’t (1) measure its to empower developers with the right tools
effectiveness and (2) react to issues in real-time? and processes to effectively monitor their
applications. In the pages that follow, I describe
To me, this is the core reason why developers
nine practices we’ve developed that were critical
(along with product managers and executives)
to our success:
need to constantly think about the key metrics
that measure their code’s success—and also why
ops and architecture teams need to make it easy
for developers to collect and report on those
metrics. Too often, metrics and monitoring are
an afterthought to software delivery when, in
reality, it needs to be a first-order consideration.
A well-planned metrics
and monitoring strategy
yields higher quality code,
lower support costs,
and more self-sufficient
development teams.
04
1. Start with a performance requirements questionnaire
2. Bake ops tasks into project bootstrapping
3. Build libraries to accelerate code instrumentation
4. Create patterns for custom metrics
5. Integrate reporting agents into hosts and containers
6. Automate platform compliance checks
7. Flow data into a single collection stream
8. Create a single source of truth for dashboards
9. Make software performance visible to all
These practices help ensure that developers are ops tasks. Below, I describe these principles in
focused on writing high-quality code that meets more detail and explain how we implemented
business requirements and not worrying about them in our environment.
05
01 Start with a performance requirements questionnaire

Almost immediately after the first container to describe the top metrics they need to track
starts, system and application metrics quickly in order to ensure their code is functioning. We
scale beyond the limits of human comprehension. encourage developers to work with business
This leads to an overwhelming amount of data partners to identify a handful of key indicators
and noise that can distract developers. So we per system.
created a questionnaire that asks developers
Our performance requirements questionnaire
As part of the questionnaire, developers must notification. Therefore, this questionnaire also
also think about alerting. I firmly believe that forces developers and product teams to go
if you’re going to monitor a metric, you should beyond metrics and think about key performance
also alert on it as well. A dashboard is fine, but indicators (KPIs) and service level indicators (SLIs).
if something isn’t right, somebody should get a
106
As a sidenote, KPIs and SLIs illustrate contextual Ideally, developers fill out this questionnaire along
critical business or system functions, whereas with their counterparts on the Product Team
metrics represent point-in-time data without before they begin writing code. The understanding
the necessary context. The key difference is in is that these metrics and alerts are going to be
defining what is “normal.” For example, the count visible to everybody (including executives). This
of records processed over the last 60 minutes is exercise should inform how developers write the
a metric: it doesn’t provide any business context. code. Then, when developers are getting ready
The successful percentage of records processed to deliver their code, the architecture team uses
over the last 60 minutes is a more useful indicator. this questionnaire during our governance check.
We verify (1) that the developers are going to
The count of records processed be accountable for the things they said were
important in the questionnaire; and (2) that
over the last 60 minutes is a dashboards and alerts have been checked into
metric: it doesn’t provide any the repository.
business context. The successful

percentage of records processed
over the last 60 minutes is a
more useful indicator.
02 Bake ops tasks into project bootstrapping

I’ve found that DevOps teams are often focused source, Java-based framework used to create
more on automating code delivery pipelines microservices. We forked and customized the
while pushing off post-production monitoring and Spring Initializr (a lightweight quickstart generator
alerting to later. But these "later" stages of the for Spring projects). In our forked version of Spring
DevOps lifecycle are also important to automate. Initializr, we added additional automation for key
It’s not efficient to think about monitoring and ops tasks. When developers click the microservice
alerting after the fact. On our team, monitoring generation button, it creates GitHub build streams,
and alerting is baked into the earliest stages of metrics dashboards, and alert templates.
the software delivery lifecycle.
We’re a Java shop, specifically using the Spring

web framework and Spring Boot, an open
07
Our customized Microservice Generator automates code and ops tasks
So when developers want to bootstrap a project, are automated. We really want developers to be
they simply click through the Spring Initializr, as focused on writing high quality code, rather than
most of the manual microservice setup steps worrying about ops tasks.
03 Build libraries to accelerate code instrumentation

While it’s important to capture low-level application We built a library that all Java applications in
runtime metrics, we don’t want developers spending our stack use. If a developer is using the library,
precious time creating that instrumentation. application runtime metrics will be auto-
Instead, we’d prefer that developers focus on the instrumented. The common Java library frees
business metrics that matter.
08
developers to concentrate on instrumenting key and deploy this documentation to GitHub

business metrics, which are generally custom pages for the repository. This system provides
application metrics (in our case, things like health developers a valuable starting point and puts
data intake processing and success rate). the documentation at the source so people aren’t
scouring wikis for some obscure Word or
All repositories, regardless of business function,
Excel documents.
must contain documentation in the form of
AsciiDocs. This documentation is required
to provide context, sample usage, and SLA
requirements. The build pipeline will generate
Example documentation providing context on how to use our custom @SLA annotation
04 Create patterns for custom metrics

We aim to provide turnkey solutions which allow If the developers can instrument
developers to be naïve to the world of ops.
However, it is necessary for developers to instru-
that small number of KPIs and
ment any custom or application-specific metrics; SLIs, everything else should be
they just have to think about the specific key
delivered and built for them.
metrics they need to monitor for their applica-
tion. If the developers can instrument that small
number of KPIs and SLIs, everything else should
be delivered and built for them.
09
But first, developers need training and enablement. files, the compilation and correctness of the
examples and solutions in the docs will have to
We require all developers to complete a
successfully compile and build to deploy new
microservice bootcamp. This bootcamp contains
documentation. This ensures that example code
all the coding patterns, including how to create
will not be stale or broken.
KPIs, SLIs, and custom metrics, that developers
need to get started when building services. This During the bootcamp, we also introduce our
bootcamp is “intra-sourced” and we encourage custom libraries and our AOP (aspect-oriented
active contributions, such as lessons learned and programming) annotations, which experience
new microservice patterns. has shown us create the most useful and simple
high-level KPIs and SLIs. However, we do also
The bootcamp materials are currently written
strongly encourage individual team creativity
in AsciiDocs, an excellent vehicle for linking and
when devising the metrics and visualizations
embedding source files to provide additional
that will work best for their services.
context and clarity. By directly embedding Java
05 Integrate reporting agents into hosts and containers

As developers get ready to deploy their code, it’s vital to ensure that the metrics will actually
get collected. To streamline the deployment process for developers, we automatically integrate
reporting agents into containers and hosts.
Hosts Any host created inside our cloud environment

already has metric collection and aggregation
baked in at the OS layer.
Containers All of our containers share a common base

container, which has an agent already baked in.
As a result, any new containers will inherit the
reporting agent. From there, we simply drop in
the application runtimes, and all of the monitoring
and logging is enabled.
10
06 Automate platform compliance checks

On our team, we’ve had a major push on not just High standards for platform compliance remove
software quality, but platform quality as well. many potential headaches for both dev and ops.
It's important to monitor what tools your devs are using before they deploy
With so many components and developers, it’s very The compliance checker generates a compliance
easy to have configuration and versioning drift. score and a green or red result. If green, the
To resolve this problem, we created an automated developer is clear to push to production. If red, the
compliance checker that checks the status of developer understands that the microservice is no
roughly 20 framework dependencies, verifies the longer compliant with our framework standards.
build quality, and ensures that devs are using Updating everything to the latest versions is a click
up-to-date versions of the web framework, library, of a button, as shown below.
containers, and other components. This final, pre-
deployment step helps developers avoid running
stale or insecure legacy items in production.
11
The compliance checker interface makes it easy to see what’s out of compliance and make the
required updates.
07 Flow data into a single collection stream

Like most large enterprises, we have several In our unified data platform, teams can filter by
operations and monitoring platforms that we environment (either production, staging, or testing).
need to unify. Having many systems throwing Teams can also decide which data is meaningful,
out metrics and alerts is inefficient: it leads to mute extraneous sources, and alert only on key
too many false positives and a lot of time wasted data. Plus, our ability to combine and correlate
switching between collection points and searching monitoring data allows us to confidently diagnose
for answers. Therefore, we ingest all streams into a problems—and also provides additional context to
centralized platform which emits alerts. troubleshoot any issues that may arise.
12
08 Create a single source of truth for dashboards

With 500 developers empowered to create tinker with dashboards, every week we auto-deploy
dashboards and alerts, we needed to create the original dashboards back out of the repository
consistent standards of quality control. Too with the run of a Jenkins job and the Datadog API.
many dashboards can be a problem: before you This mechanism keeps dashboards focused
know it, you could have 500 dashboards, though and clean.
few of them are meaningful. Rather than creating
The Datadog Ops repository helps us cut down
many less-than-useful dashboards, we prefer
on noise and maintain focus, while allowing
developers to create a limited number of
developers to experiment and create dashboards
useful dashboards.
on the fly. The repository also instantiates our
entire monitoring and alerting strategy as code,
Rather than creating many and ensures that we have a way to automatically
less-than-useful dashboards, recreate our entire dashboard and monitoring
infrastructure.
we prefer developers to
However, we do give developers the opportunity
create a limited number to permanently modify their dashboards. We
of useful dashboards. periodically export “experimental” dashboards
and email their creators with this message:
To avoid dashboard overload, we export all “Please enter this dashboard into the Datadog
dashboards into a GitHub repository we call Ops repository if you’d like to replace the existing
“Datadog Ops,” which serves as the single source of production dashboard.”
truth for dashboards. While individuals are able to
09 Make software performance visible to all

We have large TVs around the office that
display our Datadog dashboards. We don’t put
low-level infrastructure or application runtime
metrics on these. Rather, these TVs are the
place for business-relevant KPIs and SLIs. The
high visibility makes everyone care a lot more
about monitoring their applications. In addition,
the high visibility of these dashboards creates
accountability for teams and encourages
collaboration among different teams (such as
business and technology) as well as various
levels (from front-line engineers to C-Suite).
13
A FEW ADDITIONAL TACTICAL TIPS:
·· Rotating screens (red/green or other complementary colors) are

I recommend limiting the number of displays easiest to see from a distance. Lines are great
(or rotating onscreen visualizations). We want to where variance is expected.
emphasize the few business-impacting metrics
·· Global & team-specific screens
that matter, so adding more data doesn’t
Include global business KPI boards on all TVs,
necessarily help.
but allow teams to rotate in team-specific
·· Information display boards on their TVs.
The less clutter on the screen the better;
generally large, color-coded numbers
Conclusion
The metrics-driven mindset has enabled the team key business metrics, rather than worrying about
to deliver higher quality software faster. Our very ops tasks. In addition, the metrics-driven mindset
lean ops and architecture team has driven the gives devs ownership over their microservices
practices described above, freeing developers and encourages devs to identify and troubleshoot
to focus on writing good code and instrumenting issues themselves.
14

Datadog Ebook DeveloperEnablementPractices

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datadog Ebook DeveloperEnablementPractices

Uploaded by

Copyright:

Available Formats

Nine developer enablement

practices to achieve DevOps

04 The situation today: A 25:1 dev-to-ops ratio

04 How we got to where we are today: The Metrics-Driven Mindset

06 One: Start with a performance requirements questionnaire

07 Two: Bake ops tasks into project bootstrapping

08 Three: Build libraries to accelerate code instrumentation

09 Four: Create patterns for custom metrics

10 Five: Integrate reporting agents into hosts and containers

11 Six: Automate platform compliance checks

12 Seven: Flow data into a single collection stream

13 Eight: Create a single source of truth for dashboards

13 Nine: Make software performance visible to all

The situation today: A 25:1 dev-to-ops ratio

How we got to where we are today:

1. Start with a performance requirements questionnaire

2. Bake ops tasks into project bootstrapping

3. Build libraries to accelerate code instrumentation

4. Create patterns for custom metrics

5. Integrate reporting agents into hosts and containers

6. Automate platform compliance checks

7. Flow data into a single collection stream

8. Create a single source of truth for dashboards

9. Make software performance visible to all

01 Start with a performance requirements questionnaire

Our performance requirements questionnaire

business context. The successful

02 Bake ops tasks into project bootstrapping

We’re a Java shop, specifically using the Spring

Our customized Microservice Generator automates code and ops tasks

03 Build libraries to accelerate code instrumentation

developers to concentrate on instrumenting key and deploy this documentation to GitHub

04 Create patterns for custom metrics

05 Integrate reporting agents into hosts and containers

Hosts Any host created inside our cloud environment

Containers All of our containers share a common base

06 Automate platform compliance checks

07 Flow data into a single collection stream

08 Create a single source of truth for dashboards

09 Make software performance visible to all

A FEW ADDITIONAL TACTICAL TIPS:

·· Rotating screens (red/green or other complementary colors) are

You might also like