Professional Documents
Culture Documents
Packt Datadog Cloud Monitoring Quick Start Guide 1800568738
Packt Datadog Cloud Monitoring Quick Start Guide 1800568738
Packt Datadog Cloud Monitoring Quick Start Guide 1800568738
Monitoring Quick
Start Guide
BIRMINGHAM—MUMBAI
Datadog Cloud Monitoring Quick Start Guide
Copyright © 2021 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the publisher,
except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information
presented. However, the information contained in this book is sold without warranty, either express
or implied. Neither the author(s), nor Packt Publishing or its dealers and distributors, will be held
liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies
and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing
cannot guarantee the accuracy of this information.
ISBN 978-1-80056-873-0
www.packt.com
To my mother, Annie KS, and to the memory of my father, Kurian TT, for
their sacrifices and perseverance to raise me and my siblings and equip us
to navigate the real world. To my wife, Vinaya, for her unconditional love
and the constant encouragement to finish this work.
– Thomas Theakanath
Contributors
About the author
Thomas Kurian Theakanath has over 20 years of experience in software development and
IT production engineering and is currently focused on cloud engineering and DevOps.
He has architected, developed, and rolled out automation tools and processes at multiple
companies in Silicon Valley.
He graduated from Calicut University in India majoring in production engineering
and he did his master's in industrial management at the Indian Institute of Technology,
Mumbai. He has worked at big corporations such as Yahoo! and SAP Ariba and has
helped multiple start-ups to get started with their DevOps practice. He led and mentored
engineering teams to the successful completion of large DevOps and monitoring projects.
He contributes regularly to tech blogs on the latest trends and best practices in his areas of
expertise.
Originally from Kochi, India, Thomas now lives in the San Francisco Bay Area with his
wife and three children. An avid outdoor enthusiast, he spends his free time running and
hiking the trails and hills of the Bay Area with his friends and family and volunteering for
his favorite charities.
About the reviewer
Carlos Ijalba is an IT professional with more than 26 years of experience in the IT
industry in infrastructure architecture, consultancy, systems installation, administration,
monitoring, training, disaster recovery, and documentation. He has worked for
small family companies in multidisciplinary environments up to big international
companies with large IT departments and high technical specialization. He has extensive
multiplatform experience, having worked with Unix, NonStop, IBM i, Windows, Linux,
VMware, AWS, and DevOps. He holds certifications in Unix, Oracle HW, VMware, AWS
Architecture, and is currently preparing for Kubernetes and other AWS certifications.
Carlos has lived in the UK for 10 years, spent 5 years in Andorra, and he now lives
in Spain.
To my wife, Mar: thank you for your unconditional love and support,
especially on the many nights doing this book's technical review. Thank you
for sharing your life with me.
Preface
Section 1:
Getting Started with Datadog
1
Introduction to Monitoring
Technical requirements 4 Monitor15
Why monitoring? 4 Alert15
Alert recipient 15
Proactive monitoring 5
Severity level 15
Implementing a comprehensive
Notification 16
monitoring solution 6
Downtime16
Setting up alerts to warn of impending
issues 6 Event17
Having a feedback loop 6 Incident17
On call 18
Monitoring use cases 7 Runbook18
All in a data center 7
Types of monitoring 18
Application in a data center with cloud
monitoring9 Infrastructure monitoring 19
All in the cloud 10 Platform monitoring 19
Application monitoring 20
Monitoring terminology and Business monitoring 20
processes 12 Last-mile monitoring 21
Host12 Log aggregation 21
Agent12 Meta-monitoring21
Metrics13 Noncore monitoring 22
Up/down status 13
Check 14 Overview of monitoring tools 23
Threshold15 On-premises tools 23
ii Table of Contents
2
Deploying the Datadog Agent
Technical requirements 30 All on the hosts 37
Installing the Datadog Agent 30 Agent on the host
monitoring containers 37
Runtime configurations 30
Agent running as a container 44
Steps for installing the agent 32
Advanced agent configuration 44
Agent components 35
Best practices 45
Agent as a container 36
Summary45
Deploying the agent – use cases 36
3
The Datadog Dashboard
Technical requirements 48 APIs60
Infrastructure List 48 Agent61
Embeds62
Events51
Metrics Explorer 52 Monitors62
Dashboards54 Creating a new metric monitor 64
The main Integrations menu 58 Advanced features 65
Integrations58
Summary66
4
Account Management
Technical requirements 68 Managing API and
Managing users 68 application keys 78
Granting custom access Tracking usage 80
using roles 70 Best practices 83
Setting up organizations 72 Summary83
Implementing Single Sign-On 74
Table of Contents iii
5
Metrics, Events, and Tags
Technical requirements 86 Customizing host tag 93
Understanding metrics in Tagging integration metrics 93
Datadog86 Tags from microservices 93
Metric data 88 Filtering using tags 93
Flush time interval 88 Defining custom metrics 94
Metric type 88
Monitoring event streams 96
Metric unit 88
Query88
Searching events 97
Notifications for events 99
Tagging Datadog resources 91 Generating events 100
Defining tags 92
Best practices 102
Tagging methods 92
Summary102
6
Monitoring Infrastructure
Technical requirements 106 Network traffic 118
Inventorying the hosts 106 Listing containers 120
CPU usage 112
Viewing system processes 122
Load averages 113
Available swap 114
Monitoring serverless
computing resources 124
Disk latency 115
Memory breakdown 116 Best practices 125
Disk usage 117 Summary126
7
Monitors and Alerts
Technical requirements 128 Configuring downtime 143
Setting up monitors 128 Best practices 145
Managing monitors 140 Summary146
Distributing notifications 141
iv Table of Contents
Section 2:
Extending Datadog
8
Integrating with Platform Components
Technical requirements 151 integrations160
Configuring an integration 151 Implementing custom checks 162
Tagging an integration 158 Best practices 166
Reviewing supported Summary167
9
Using the Datadog REST API
Technical requirements 170 Monitors182
Scripting Datadog 170 Host tags 182
10
Working with Monitoring Standards
Technical requirements 194 Cassandra as a Java application 198
Monitoring networks Using Cassandra integration 200
using SNMP 194 Accessing the Cassandra JMX interface 202
11
Integrating with Datadog
Technical requirements 212 Developing integrations 219
Using client libraries 212 Prerequisites220
REST API-based client libraries 212 Setting up the tooling 221
DogStatsD client libraries 215 Creating an integration folder 221
Running tests 223
Evaluating community Building a configuration file 224
projects217 Building a package 225
dog-watcher by Brightcove 218 Deploying an integration 225
kennel 218
Managing monitors using Terraform 218 Best practices 226
Ansible modules and integration 219 Summary227
Section 3:
Advanced Monitoring
12
Monitoring Containers
Technical requirements 232 Viewing logs using Live Tail 244
Collecting Docker logs 232 Searching container data 246
Monitoring Kubernetes 238 Best practices 247
Installing the Datadog Agent 239 Summary247
Using Live Containers 243
13
Managing Logs Using Datadog
Technical requirements 250 Collecting logs from
public cloud services 251
Collecting logs 250
Shipping logs from containers 251
vi Table of Contents
14
Miscellaneous Monitoring Topics
Technical requirements 264 Synthetic monitoring 271
Application Performance Security monitoring 278
Monitoring (APM) 264 Sourcing the logs 280
Sending traces to Datadog 265 Defining security rules 281
Profiling an application 266 Monitoring security signals 281
Service Map 268
Best practices 282
Implementing observability 269 Summary283
Other Books You May Enjoy
Index
Preface
Monitoring software applications used to be an afterthought – when an outage happened,
a check was put in place to monitor the functionality of the software or the status of part
of the infrastructure where the software system ran. It had never been comprehensive, and
the monitoring infrastructure grew organically. The first-generation monitoring tools were
only designed to meet such a simple, reactive approach to rolling out monitoring.
The widespread use of the public cloud as a computing platform, the ongoing trend of
deploying applications as microservices, and the delivery of Software as a Service (SaaS)
with 24x7 availability made the operating of software systems far more complex than it
used to be. Proactive monitoring became an essential part of DevOps culture as rolling
out monitoring iteratively and incrementally is not an option anymore as the users expect
the software systems to always be available. Software systems are now designed with
observability in mind so that their workings can be exposed and monitored well. Both
system and application logs are aggregated and analyzed for operational insights.
There is a large repository of monitoring applications now available to meet these new
monitoring requirements both as on-premises and SaaS-based solutions. Some tools
focus on a specific area of monitoring and others try to cater to all types of monitoring.
It is common to use three or four different monitoring applications to cover all aspects of
monitoring. Datadog, a SaaS monitoring solution, is emerging as a monitoring platform
that can be used to build a comprehensive monitoring infrastructure in an enterprise. This
book aims to introduce Datadog to a variety of audiences and help you to roll out Datadog
as a proactive monitoring solution.
If you are using the digital version of this book, we advise you to type the code yourself
or access the code via the GitHub repository (link available in the next section). Doing
so will help you avoid any potential errors related to the copying and pasting of code.
Conventions used
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles.
Here is an example: "The main Datadog Agent configuration file, datadog.yaml, can be
updated to meet your specific monitoring requirements."
A block of code is set as follows:
init_config:
instances:
- url: "unix://var/run/docker.sock"
new_tag_names: true
When we wish to draw your attention to a particular part of a code block, the relevant
lines or items are set in bold:
Bold: Indicates a new term, an important word, or words that you see onscreen. For
example, words in menus or dialog boxes appear in the text like this. Here is an example:
"The corresponding host will be listed on the dashboard under Infrastructure | Host Map
and Infrastructure List if the agent is able to connect to the backend."
xii Preface
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book
title in the subject of your message and email us at customercare@packtpub.com.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you have found a mistake in this book, we would be grateful if you would
report this to us. Please visit www.packtpub.com/support/errata, selecting your
book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet,
we would be grateful if you would provide us with the location address or website name.
Please contact us at copyright@packt.com with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in
and you are interested in either writing or contributing to a book, please visit authors.
packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on
the site that you purchased it from? Potential readers can then see and use your unbiased
opinion to make purchase decisions, we at Packt can understand what you think about
our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Section 1:
Getting Started
with Datadog
This part of the book will get you started with Datadog by going over the core features of
Datadog and how to use them. With that information, you will be able to roll out Datadog
in an organization. You will also understand the basic concepts of monitoring software
systems and the infrastructure they run on and be able to relate them to the Datadog
features.
This section comprises the following chapters:
• Why monitoring?
• Proactive monitoring
• Monitoring use cases
• Monitoring terminology and processes
• Types of monitoring
• Overview of monitoring tools
4 Introduction to Monitoring
Technical requirements
There are no technical requirements for this chapter
Why monitoring?
Monitoring is a generic term in the context of operating a business. As part of effectively
running a business, various elements of the business operations are measured for their
health and effectiveness. The outputs of such measurements are compared against the
business goals. When such efforts are done periodically and systematically, it could be
called monitoring.
Monitoring could be done directly or indirectly, and voluntarily or dictated by some law.
Successful businesses are good at tracking growth by using various metrics and taking
corrective actions to fine-tune their operations and goals as needed.
The previous statements are common knowledge and applicable to any successful
operation, even if it is a one-time event. However, there are a few important keywords we
already mentioned – metrics, measurement, and goal. A monitoring activity is all about
measuring something and comparing that result against a goal or a target.
While it is important to keep track of every aspect of running a business, this book
focuses on monitoring software application systems running in production. Information
Technology (IT) is a key element of running businesses and that involves operating
software systems. With software available as a service through the internet (commonly
referred to as Software as a Service or just SaaS), most software users don't need to run
software systems in-house, and thus there's no need for them to monitor it.
However, SaaS providers need to monitor the software systems they run for their
customers. Big corporations such as banks and retail chains may still have to run software
systems in-house due to the non-availability of the required software service in the cloud
or due to security or privacy reasons.
The primary focus of monitoring a software system is to check its health. By keeping
track of the health of a software system, it is possible to determine whether the system is
available or its health is deteriorating.
If it is possible to catch the deteriorating state of a software system ahead of time, it might
be possible to fix the underlying issue before any outage happens, which would ensure
business continuity. Such a method of proactive monitoring that provides warnings well
in advance is the ideal method of monitoring. But that is not always possible. There must
also be processes in place to deal with the outage of software systems.
Proactive monitoring 5
Though we discussed monitoring software systems in broad terms earlier, upon a closer
look, it is clear that monitoring of the three main components mentioned previously
is involved in it. The information, measured using related metrics, is different for each
category.
The monitoring of these three components constitutes core monitoring. There are many
other aspects – both internal to the system, such as its health, and external, such as
security – that would make the monitoring of a software system complete. We will look
at all the major aspects of software system monitoring in general in this introductory
chapter, and as it is supported by Datadog in later chapters.
Proactive monitoring
Technically, monitoring is not part of a software system running in production. The
applications in a software system can run without any monitoring tools rolled out. As
a best practice, software applications must be decoupled from monitoring tools anyway.
This scenario sometimes results in taking software systems to production with minimal
or no monitoring, which would eventually result in issues going unnoticed, or, in the
worst-case scenario, users of those systems discovering those issues while using the
software service. Such situations are not good for the business due to these reasons:
One of the mitigating steps taken in response to a production issue is adding some
monitoring so the same issue will be caught and reported to the operations team.
Usually, such a reactive approach increases the coverage of monitoring organically, but
not following a monitoring strategy. While such an approach will help to catch issues
sometimes, there is no guarantee that an organically grown monitoring infrastructure
would be capable of checking the health of the software system and warning about it,
so remediation steps can be taken proactively to minimize outages in the future.
Proactive monitoring refers to rolling out monitoring solutions for a software system
to report on issues with the components of the software system, and the infrastructure
the system runs on. Such reporting can help with averting an impending issue by taking
mitigating steps manually or automatically. The latter method is usually called
self-healing, a highly desirable end state of monitoring, but hard to implement.
The key aspects of a proactive monitoring strategy are as follows.
The following figure illustrates how the software application and monitoring tool are
running from two data centers, which ensures the availability of the software system:
• Monitor cloud resources that are hard to monitor using standard monitoring tools
• Be used as a secondary monitoring system, mainly to cover infrastructure
monitoring
• Monitor the rest of the monitoring infrastructure as a meta-monitoring tool
12 Introduction to Monitoring
The scenarios we discussed here were simplified to explain the core concepts. In real
life, a monitoring solution is rolled out with multiple tools and some of them might be
deployed in-house and others might be cloud-based. When such complex configurations
are involved, not losing track of the main objectives of proactive monitoring is the key
to having a reliable monitoring system that can help to minimize outages and provide
operational insights that will contribute to fine-tuning the application system.
Host
A host used to mean a physical server during the data center era. In the monitoring world,
it usually refers to a device with an IP address. That covers a wide variety of equipment
and resources – bare-metal machines, network gear, IoT devices, virtual machines, and
even containers.
Some of the first-generation monitoring tools, such as Nagios and Zenoss are built around
the concept of a host, meaning everything that can be done on those platforms must be
tied to a host. Such restrictions are relaxed in new-generation monitoring tools such as
Datadog.
Agent
An agent is a service that runs alongside the application software system to help with
monitoring. It runs various tasks for the monitoring tools and reports information back to
the monitoring backend.
The agents are installed on the hosts where the application system runs. It could be
a simple process running directly on the operating system or a microservice. Datadog
supports both options and when the application software is deployed as microservices, the
agent is also deployed as a microservice.
Monitoring terminology and processes 13
Metrics
Metrics in monitoring refers to a time-bound measurement of some information that
would provide insight into the workings of the system being monitored. These are some
familiar examples:
The important thing to note here is that a metric is time-bound and its value will change.
For that reason, the total disk space on the root partition is not considered a metric.
A metric is measured periodically to generate time-series data. We will see that this
time-series data can be used in various ways – to plot charts on dashboards, to analyze
trends, and to set up monitors.
A wide variety of metrics, especially those related to infrastructure, are generated by
monitoring tools. There are options to generate your own custom metrics too:
Up/down status
A metric measurement can have a range of values, but up or down is a binary status. These
are a few examples:
• A process is up or down
• A website is up or down
• A host is pingable or not
Tracking the up/down status of various components of a software system is core to all
monitoring tools and they have built-in features to check on a variety of resources.
14 Introduction to Monitoring
Check
A check is used by the monitoring system to collect the value of metrics. When it is done
periodically, time-series data for that metric is generated.
While time-series data for standard infrastructure-level metrics is available out of the box
in monitoring systems, custom checks could be implemented to generate custom metrics
that would involve some scripting.
The active/passive or pull/push model of data collection is standard across all monitoring
systems. The method would depend on the type of metrics a monitoring system collects.
You will see in later chapters that Datadog supports both methods.
Threshold
A threshold is a fixed value in the range of values possible for a metric. For example, on
a root partition with a total disk space of 8 GB, the available disk space could be anything
from 0 GB to 8 GB. A threshold in this specific case could be 1 GB, which could be set as
an indication of low storage on the root partition.
There could be multiple thresholds defined too. In this specific example, 1 GB could be
a warning threshold and 500 MB could be a critical or high-severity threshold.
Monitor
A monitor looks at the time-series data generated for a metric and alerts the alert
recipients if the values cross the thresholds. A monitor can also be set up for the up/down
status, in which case it alerts the alert recipients if the related resource is down.
Alert
An alert is produced by a monitor when the associated check determines that the metric
value crosses a threshold set on the monitor. A monitor could be configured to notify the
alert recipients of the alerts.
Alert recipient
The alert recipient is a user in the organization who signs up to receive alerts sent out by a
monitor. An alert could be received by the recipient through one or more communication
channels such, as E-Mail, SMS, and Slack.
Severity level
Alerts are classified by the seriousness of the issue that they unearth about the software
system and that is set by the appropriate severity level. The response to an alert is tied to
the severity level of the alert.
A sample set of severity levels could consist of the levels Informational, Warning, and
Critical. For example, with our example of available disk space on the root partition, at
30% of available disk space, the monitor could be configured to alert as a warning and at
20% it could alert as critical.
16 Introduction to Monitoring
As you can see, setting up severity levels for an increasing level of seriousness would
provide the opportunity to catch issues and take mitigative actions ahead of time, which is
the main objective of proactive monitoring. Note that this is possible in situations where
a system component degenerates over a period of time.
A monitor that tracks an up/down status will not be able to provide any warning, and so
a mitigative action would be to bring up the related service at the earliest. However, in
a real-life scenario, there must be multiple monitors so at least one of them would be able
to catch an underlying issue ahead of time. For example, having no disk space on the root
partition can stop some services, and monitoring the available space on the root partition
would help prevent those services from going down.
Notification
A message sent out as part of an alert specific to a communication platform such as email
is called a notification. There is a subtle difference between an alert and a notification,
but at times they are considered the same. An alert is a state within the monitoring
system that can trigger multiple actions such as sending out notifications and updating
monitoring dashboards with that status.
Traditionally, email distribution groups used to be the main communication method used
to send out notifications. Currently, there are much more sophisticated options, such as
chat and texting, available out of the box on most monitoring platforms. Also, escalation
tools such as PagerDuty could be integrated with modern monitoring tools such as
Datadog to route notifications based on severity.
Downtime
The downtime of a monitor is a time window during which alerting is disabled on that
monitor. Usually, this is done for a temporary period while some change to the underlying
infrastructure or software component is rolled out, during which time monitoring on that
component is irrelevant. For example, a monitor that tracks the available space on a disk
drive will be impacted while the maintenance task to increase the storage size is ongoing.
Monitoring platforms such as Datadog support this feature. The practical use of this
feature is to avoid receiving notifications from the impacted monitors. By integrating
a CI/CD pipeline with the monitoring application, the downtimes could be scheduled
automatically as a prerequisite for deployments.
Monitoring terminology and processes 17
Event
An event published by a monitoring system usually provides details of a change that
happened in the software system. Some examples are the following:
Note that none of these events demand immediate action but are informational. That's
how an event differs from an alert. A critical alert is actionable but there is no severity
level attached to an event and so it is not actionable. Events are recorded in the monitoring
system and they are valuable information when triaging an issue.
Incident
When a product feature is not available to users it is called an incident. An incident occurs
when some outage happens in the infrastructure, this being hardware or software, but
not always. It could also happen due to external network or internet-related access issues,
though such issues are uncommon.
The process of handling incidents and mitigating them is an area by itself and not
generally considered part of core monitoring. However, monitoring and incident
management are intertwined for these reasons:
On call
The critical alerts sent out by monitors are responded to by an on-call team. Though the
actual requirements can vary based on the Service-Level Agreement (SLA) requirements
of the application being monitored, on-call teams are usually available 24x7.
In a mature service engineering organization, three levels of support would be available,
where an issue is escalated from L1 to L3:
• The L1 support team consists of product support staff who are knowledgeable about
the applications and can respond to issues using runbooks.
• The L2 support team consists of Site Reliability Engineers (SREs) who might also
rely on runbooks, but they are also capable of triaging and fixing the infrastructure
and software components.
• The L3 support team would consist of the DevOps and software engineers who
designed and built the infrastructure and software system in production. Usually,
this team gets involved only to triage issues that are not known.
Runbook
A runbook provides steps to respond to an alert notification for on-call support personnel.
The steps might not always provide a resolution to the reported issue and it could be as
simple as escalating the issue to an engineering point of contact to investigate the issue.
Types of monitoring
There are different types of monitoring. All types of monitoring that are relevant to
a software system must be implemented to make it a comprehensive solution. Another
aspect to consider is the business need of rolling out a certain type of monitoring.
For example, if customers of a software service insist on securing the application they
subscribe to, the software provider has to roll out security monitoring.
(The discussion on types of monitoring in this section originally appeared in the article
Proactive Monitoring, published by me on DevOps.com.)
Types of monitoring 19
Infrastructure monitoring
The infrastructure that runs the application system is made up of multiple components:
servers, storage devices, a load balancer, and so on. Checking the health of these devices is
the most basic requirement of monitoring. The popular monitoring platforms support this
feature out of the box. Very little customization is required except for setting up the right
thresholds on those metrics for alerting.
Platform monitoring
An application system is usually built using multiple third-party tools such as the
following:
Checking the health of these application components is important too. Most of these tools
provide an interface, mainly via the REST API, that can be leveraged to implement plugins
on the main monitoring platform.
Application monitoring
Having a healthy infrastructure and platform is not good enough for an application to
function correctly. Defective code from a recent deployment or third-party component
issues or incompatible changes with external systems can cause application failures.
Application-level checks can be implemented to detect such issues. As mentioned earlier,
a functional or integration test would unearth such issues in a testing/staging
environment, and an equivalent of that should be implemented in the production
environment also.
The implementation of application-level monitoring could be simplified by building
hooks or API endpoints in the application. In general, improving the observability of the
application is the key.
Monitoring is usually an after-thought and the requirement of such instrumentation is
overlooked during the design phase of an application. The participation of the DevOps
team in design reviews improves the operability of a system. Planning for application-level
monitoring in production is one area where DevOps can provide inputs.
Business monitoring
The application system runs in production to meet certain business goals. You could
have an application that runs flawlessly on a healthy infrastructure but still, the business
might not be meeting its goals. It is important to provide that feedback to the business at
the earliest opportunity to take corrective actions that might trigger enhancements of the
application features and/or the way the business is run using the application.
These efforts should only complement the more complex BI-based data analysis methods
that could provide deeper insights into the state of the business. Business-level monitoring
can be based on transactional data readily available in the data repositories and the data
aggregates generated by BI systems.
Both application- and business-level monitoring are company-specific, and plugins have
to be developed for such monitoring requirements. Implementing a framework to access
standard sources of information such as databases and REST APIs from the monitoring
platform could minimize the requirement of building plugins from scratch every time.
Types of monitoring 21
Last-mile monitoring
A monitoring platform deployed in the same public cloud or data center environment as
where the applications run cannot check the end user experience. To address that gap,
there are several SaaS products available on the market, such as Catchpoint and Apica.
These services are backed up by actual infrastructure to monitor the applications in
specific geographical locations. For example, if you are keen on knowing how your mobile
app performs on iPhones in Chicago, that could be tracked using the service provider's
testing infrastructure in Chicago.
Alerts are set up on these tools to notify the site reliability engineering team if the
application is not accessible externally or if there are performance issues with the
application.
Log aggregation
In a production environment, a huge amount of information is logged in various log files,
by operating system, platform components, and application. They will get some attention
when issues happen and normally are ignored otherwise. Traditional monitoring tools
such as Nagios couldn't handle the constantly changing log files except for alerting on
some patterns.
The advent of log aggregation tools such as Splunk changed that situation. Using
aggregated and indexed logs, it is possible to detect issues that would have gone unnoticed
earlier. Alerts can be set up based on the info available in the indexed log data. For
example, Splunk provides a custom query language to search indexes for operational
insights. Using APIs provided by these tools, the alerting can actually be integrated with
the main monitoring platform.
To leverage the aggregation and indexing capabilities of these tools, structured data
outputs can be generated by the application or scripts that will be indexed by the log
aggregation tool later.
Meta-monitoring
It is important to make sure that the monitoring infrastructure itself is up and running.
Disabling alerting during deployment and forgetting about enabling it later is one of the
common oversights that has been seen in operations. Such missteps are hard to monitor
and only improvements in the deployment process can address such issues.
Let's look at a couple of popular methods used in meta-monitoring:
22 Introduction to Monitoring
Pinging hosts
If there are multiple instances of the monitoring application running, or if there is a
standby node, then cross-checks can be implemented to verify the availability of hosts
used for monitoring. In AWS, CloudWatch can be used to monitor the availability of an
EC2 node.
Noncore monitoring
The monitoring types that have been discussed thus far make up the components of a core
monitoring solution. You will see most of these monitoring categories in a comprehensive
solution. There are more monitoring types that are highly specialized and that would be
important components in a specific business situation.
Security monitoring
Security monitoring is a vast area by itself and there are specialized tools available for that,
such as SIEM tools. However, that is slowly changing and general-purpose monitoring
tools including Datadog have started offering security features to be more competitive in
the market. Security monitoring usually covers these aspects:
As you can see, these objectives might not strictly be covered by the core monitoring
concepts we have discussed thus far and we'll have to bring in a new set of terminology
and concepts to understand it better, and we will look at those details later in the book.
Overview of monitoring tools 23
On-premises tools
This group of monitoring applications have to be deployed on your infrastructure to run
alongside the application system. Some of these tools might also be available as an SaaS,
and that will be mentioned where needed.
The objective here is to introduce the landscape of the monitoring ecosystem to
newcomers to the area and show how varied it is.
24 Introduction to Monitoring
Nagios
Nagios is a popular, first-generation monitoring application that is well known for
monitoring systems and network infrastructure. Nagios is general-purpose, open source
software that has both free and licensed versions. It is highly flexible software that could be
extended using hundreds of plugins available widely. Also, writing plugins and deploying
them to meet custom monitoring requirements is relatively easy.
Zabbix
Zabbix is another popular, first-generation monitoring application that is open source and
free. It's a general-purpose monitoring application like Nagios.
TICK Stack
TICK stands for Telegraf, InfluxDB, Chronograf, and Kapacitor. These open source
software components make up a highly distributed monitoring application stack and
it is one of the popular new-generation monitoring platforms. While first-generation
monitoring tools are basically monolithic software, new-generation platforms are divided
into components that make them flexible and highly scalable. The core components of the
TICK Stack perform these tasks:
• Telegraf: Generates metrics time-series data.
• InfluxDB: Stores time-series monitoring data for it to be consumed in various ways.
• Chronograf: Provides a UI for metrics times-series data.
• Kapacitor: Sets monitors on metrics time-series data.
Prometheus
Prometheus is a popular, new-generation, open source monitoring tool that collects
metrics values by scraping the target systems. Basically, a monitoring system relies
on collecting data using active checks or the pull method, as we discussed earlier.
Prometheus-based monitoring has the following components:
The ELK Stack components are open source software and free versions are available.
SaaS versions of the stack are also available from multiple vendors as a licensed software
service.
Splunk
Splunk is pioneering licensed software with a large install base in the log aggregation
category of monitoring applications.
Zenoss
Zenoss is a first-generation monitoring application like Nagios and Zabbix.
Cacti
Cacti is a first-generation monitoring tool primarily known for network monitoring.
Its features include automatic network discovery and network map drawing.
Sensu
Sensu is a modern monitoring platform that recognizes the dynamic nature of
infrastructure at various levels. Using Sensu, the monitoring requirements can be
implemented as code. The latter feature makes it stand out in a market with a large
number of competing monitoring products.
Sysdig
The Sysdig platform offers standard monitoring features available with a modern
monitoring system. Its focus on microservices and security makes it an important product
to consider.
26 Introduction to Monitoring
AppDynamics
AppDynamics is primarily known as an Application Performance Monitoring (APM)
platform. However, its current version covers standard monitoring features as well.
However, tools like this are usually an add-on to a more general-purpose monitoring
platform.
SaaS solutions
Most new-generation monitoring tools such as Datadog are primarily offered as
monitoring services in the cloud. What this means is that the backend of the monitoring
solution is hosted on the cloud, and yet, its agent service must run on-premises to collect
metrics data and ship that to the backend. Some tools are available both on-premises and
as a cloud service.
Sumo Logic
Sumo Logic is a SaaS service offering for log aggregation and searching primarily.
However, its impressive security-related features could also be used as a Security
Information and Event Management (SIEM) platform.
New Relic
Though primarily known as an APM platform initially, like AppDynamics, it also supports
standard monitoring features.
Dynatrace
Dynatrace is also a major player in the APM space, like AppDynamics and New Relic.
Besides having the standard APM features, it also positions itself as an AI-driven tool that
correlates monitoring events and flags abnormal activities.
Catchpoint
Catchpoint is an end user experience monitoring or last-mile monitoring solution. By
design, such a service needs to be third-party provided as the related metrics have to be
measured close to where the end users are.
Overview of monitoring tools 27
There are several product offerings in this type of monitoring. Apica and Pingdom are
other well-known vendors in this space.
Cloud-native tools
Popular public cloud platforms such as AWS, Azure, and GCP offer a plethora of services
and monitoring is just one of them. Actually, there are multiple services that could be used
for monitoring purposes. For example, AWS offers CloudWatch, which is primarily an
infrastructure and platform monitoring service, and there are services such as GuardDuty
that provide sophisticated security monitoring options.
Cloud-native monitoring services are yet to be widely used as general-purpose monitoring
solutions outside of the related cloud platform even though Google operations and Azure
Monitor are full-featured monitoring platforms.
However, when it comes to monitoring a cloud-specific compute, storage, or networking
service, a cloud-native monitoring tool might be better suited. In such scenarios, the
integration provided by the main monitoring platform can be used to consolidate
monitoring in one place.
AWS CloudWatch
AWS CloudWatch provides infrastructure-level monitoring for the cloud services offered
on AWS. It could be used as an independent platform to augment the main monitoring
system or be integrated with the main monitoring system.
Google operations
This monitoring service available on GCP (formerly known as Stackdriver) is a full-stack,
API-based monitoring platform that also provides log aggregation and APM features.
Azure Monitor
Azure Monitor is also a full-stack monitoring platform like operations on GCP.
28 Introduction to Monitoring
Summary
In this chapter, you learned the importance of monitoring a software system and how
that is important for operating the business. You were also introduced to various types of
monitoring, real-life use cases of monitoring, popular monitoring tools on the market,
and, most importantly, the monitoring core concepts and terminology that are used across
all the monitoring tools.
While this chapter provided a comprehensive overview of monitoring concepts, tools, and
the market, the next chapter will introduce you to Datadog specifically and provide details
on its agent, which has to be installed on your infrastructure to start using Datadog.
2
Deploying the
Datadog Agent
In the previous chapter, we learned that the cornerstone of a monitoring tool is the group
of metrics that helps to check the health of the production system. The primary tasks of
the monitoring tool are to collect metric values periodically as time series data and to alert
on issues based on the thresholds set for each metric.
The common method used by monitoring tools to collect such data is to run an agent
process close to where the software application runs, be it on a bare-metal server, virtual
machine, or container. This would enable the monitoring agent to collect metric values
directly by querying the software application and the infrastructure where it runs.
Datadog collects such data in various ways and like other monitoring tools, it also
provides an agent. The agent gathers monitoring data from the local environment and
uploads that to the Datadog SaaS backend in the cloud. In this chapter, we will learn how
the Datadog Agent is configured to run in production environments.
This chapter will cover the following topics:
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools and
resources:
• An Ubuntu 18.04 environment with Bash shell. The Datadog Agent can be
installed on a variety of operating systems and Ubuntu is chosen only as a sample
environment.
• A Datadog account and a user with admin-level access.
• Docker
Runtime configurations
There are multiple ways you can deploy the Datadog Agent in runtime environments to
collect events and data, and such configurations depend largely on how the applications
are deployed. For example, if all the applications run directly on the host, then the
Datadog Agent is run directly on the host as well. Let's look at the common runtime
configurations.
The Datadog Agent can be configured to run in three different ways locally, as illustrated
in the diagrams shown as follows. In all the cases, the agent also collects data on the
infrastructure health in addition to collecting application-specific metrics:
Figure 2.1 – The Datadog Agent as a service on the host monitoring services
• As a service on the Docker host monitoring application containers: In this case,
the software application is deployed as containers on the Docker host and the
Datadog Agent runs directly on the host, monitoring the health of the containers
and the application:
Figure 2.2 – The Datadog Agent as a service on the host monitoring microservices
• As a container on the Docker host monitoring application containers: In this
configuration, both the Datadog Agent and the application are run in containers on
the Docker host:
Figure 2.4 – The Datadog Agent communicates with its SaaS backend
In all these three configurations and their variations, the Datadog Agent should be able
to connect to the Datadog SaaS backend through the company network and internet,
to upload the metrics collected locally. Therefore, configuring the company network
firewall to enable the traffic going out from the Datadog Agent is a prerequisite for it to
be operational. While this network access is allowed by default in most environments, in
some restrictive situations, configuring the network suitably is a requirement for rolling
out Datadog.
In general, if the application software is deployed as microservices, it is better to
also deploy the Datadog Agent as a microservice. Likewise, in a non-microservice
environment, the Datadog Agent is run directly on the hosts. Maintenance tasks such
as version upgrades are very easy if the agent is deployed as a microservice, which is the
preferred method in a compatible environment.
Before an agent can be installed, you should sign up for a Datadog account. Datadog
allows you to try out most of its features for free for a 2-week trial period. Once you have
access to an account, you will get access to an API key for that account. When an agent
is installed, the API key has to be specified and that's how the Datadog SaaS backend
correlates the agent traffic to a customer account.
For the sample steps, we will use Datadog Agent 7. Older versions are also supported, and
version-specific steps can be found in the official documentation.
On Linux distributions, the installation step involves just one command that can be
executed from the command line. To obtain the command, you can follow these steps on
the Datadog dashboard:
2. On the left pane, the target operating system can be selected to view the command
that can be used to install the agent on that platform:
Basically, this command sets two environment variables pointing to the Datadog
Agent version to be installed and the API key, downloads the install script, and
executes it.
Installation of the Datadog Agent directly on a host machine is as simple as that. We will
see how it can be deployed as a container later.
Once the agent is installed successfully, it will try to connect to the Datadog backend.
The corresponding host will be listed on the dashboard under Infrastructure | Host Map
and Infrastructure List if the agent is able to connect to the backend. This is a method to
quickly verify whether an agent is operational at any time.
On Linux platforms, the logs related to the installation process are available in the dd_
agent.log log file, which can be found in the current directory where the install script
is run. If the installation process fails, it will provide pointers on what has gone wrong.
The agent log files are available under the /var/log/datadog directory.
As mentioned earlier, the steps for installing the Datadog Agent on a specific operating
system can be obtained by navigating to the Integrations | Agent window. The supported
operating systems and platforms are listed on the left pane, and by clicking on the
required one, you can get the steps, as shown in Figure 2.6 for Ubuntu.
Agent components 35
Agent components
The Datadog Agent is a service that is composed of multiple component processes doing
specific tasks. Let's look at those in detail to understand the workings of the Datadog
Agent.
On Ubuntu systems, the Datadog Agent service is named datadog-agent. The runtime
status of this service can be checked and maintained using the system command service
like any other service.
The /etc/datadog-agent directory has all the configuration files related to the
Datadog Agent running on that machine. The YAML /etc/datadog-agent/
datadog.yaml file is the main configuration file. If any change is made to this file, the
Datadog service needs to be restarted for those changes to take effect.
The /etc/datadog-agent/conf.d/ directory contains configuration files related
to the integrations that are run on that host. We will see the configuration requirements
for integrations and how they are installed in Chapter 9, Integrating with Platform
Components, which is dedicated to discussing integrating Datadog with cloud platform
applications.
There are three main components in the Datadog Agent service:
• Collector: As the name suggests, the Collector collects the system metrics every 15
seconds. The collection frequency can be different for other types of metric types
and the Datadog documentation provides that information.
• Forwarder: The metrics collected locally are sent over HTTPS to the Datadog
backend by the Forwarder. To optimize the communication, the metrics collected
are buffered in memory prior to shipping them to the backend.
• DogStatsD: StatsD is a general-purpose metrics collection and aggregation
daemon that runs on port 8125 by default. StatsD is a popular interface offered
by monitoring tools for integrating with external systems. DogStatsD is an
implementation of StatsD by Datadog and it is available as a component of the
Datadog Agent. We will see later in this book how StatsD can be used to implement
lightweight but very effective integrations.
Besides these three components, there are optional processes that can be started by
specifying them in the datadog.yaml file:
• APM agent: This process is needed to support the APM feature and it should be run
if the APM feature is used.
• Process agent: To collect details on the live processes running on the host, this
component of the Datadog Agent process needs to be enabled.
36 Deploying the Datadog Agent
• Agent UI: The Datadog Agent also provides a UI component that runs directly
on the host where the Datadog Agent is running. This is not a popular option;
the information about a host is usually looked up on the main dashboard, which
provides complete insight into your infrastructure and the applications running on
it. However, it could be used for ad hoc purposes, for example, troubleshooting on
consumer platforms such as macOS and Windows.
Agent as a container
As mentioned earlier, the Datadog Agent can be installed as a container on a Docker host.
Though the actual options might differ, the following is a sample command that explains
how the Datadog Agent is started up as a container:
The Docker image of Datadog Agent 7 is pulled from Docker Hub in this example.
When the agent is installed on the host, you have seen that datadog.yaml is used to
set the configuration items. With a Docker image, that option is not directly available.
However, any custom changes in it could be done by setting the corresponding
environment variables. For example, in this example, api_key is set by passing the DD_
API_ KEY environment variable. In a Kubernetes cluster, the Datadog Agent is installed
as a DaemonSet, and that configuration will ensure that the agent container is deployed
on all the nodes in the cluster. DD_API_KEY is specified as a Kubernetes secret. Datadog
provides multiple templates for creating the Kubernetes manifest that can be used for
deploying Datadog in your cluster. kubectl is used to configure and deploy the Datadog
Agent.
• The Datadog Agent can be baked into the machine image used to boot up a bare-
metal machine or spin up a virtual machine. For example, in AWS, the agent can be
preinstalled and preconfigured for the target environment in the Amazon Machine
Image (AMI).
• Use an orchestration and configuration management tool such as Ansible to
deploy the agent on multiple machines parallelly so the deployment task will scale
operationally.
In a public cloud environment, the preferred method is always using a machine image
because the hosts can be spun up and shut down on demand using features such as
autoscaling. In such scenarios, a semi-automated method such as using Ansible is
not viable. However, Ansible can be used to generate machine images and related
configuration tasks.
The easiest way to start monitoring containers running on the host, in this configuration,
is to enable Docker integration. Though the specific steps to do that could be slightly
different on different target operating system platforms, the following example of enabling
it on Ubuntu 18.04 provides the general steps involved:
1. On the Datadog UI, navigate to Integrations | Integrations, and then search for
Docker:
init_config:
instances:
- url: "unix://var/run/docker.sock"
new_tag_names: true
40 Deploying the Datadog Agent
To verify whether the Docker integration works on a host, you can look up the
Containers dashboard from the Infrastructure main menu, as shown in the
following screenshot:
The complete list of Docker-specific metrics are listed under the Metrics tab on the
Docker integration page, as shown in the following screenshot:
6. In the output, look for the integration of interest. In this example, we looked at
Docker and Redis and the following mentions of those are in the output:
docker
------
Instance ID: docker [OK]
Configuration Source: file:/etc/datadog-agent/
conf.d/docker.d/conf.yaml
Total Runs: 1
Metric Samples: Last Run: 21, Total: 21
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 1
Average Execution Time : 63ms
Last Execution Date : 2020-08-28 07:52:01.000000
UTC
Last Successful Execution Date : 2020-08-28
07:52:01.000000 UTC
redisdb (3.0.0)
---------------
Instance ID: redisdb:fbcb8b58205c97ee [OK]
Configuration Source: file:/etc/datadog-agent/
conf.d/redisdb.d/conf.yaml
Total Runs: 1
Metric Samples: Last Run: 33, Total: 33
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 1
Average Execution Time : 91ms
Last Execution Date : 2020-08-28 07:52:08.000000
UTC
Last Successful Execution Date : 2020-08-28
07:52:08.000000 UTC
metadata:
version.major: 2
version.minor: 8
version.patch: 23
version.raw: 2.8.23
version.scheme: semver
44 Deploying the Datadog Agent
If there are issues, you will see error messages under the related sections.
The deployment strategy involves baking in the Datadog Agent and the configuration
files relevant to your environment on a machine image, such as AMI in AWS. To get all
the relevant metrics published to Datadog from the containers running on a host,
a customized Datadog Agent configuration file and configuration files related to various
integrations similar to those we have in the examples are needed.
Best practices
As you have seen, there are multiple ways to install and configure the Datadog Agent, and,
for someone new to Datadog, it could be daunting to determine how the agent can be
rolled out and fine-tuned efficiently to meet the monitoring requirements. However, there
are a few things that are obvious as best practices, and let's summarize those here:
• If the agent is installed on the host, plan to include it in the machine image used to
spin up or boot the host.
• Set up Ansible playbooks or similar tools to make ad hoc changes to the Datadog
Agent on the host. This is not recommended for some complex infrastructure
environments, especially where bare-metal servers are used, so some in-place
change might be needed.
• When containers are to be monitored, plan to deploy the agent also as a container.
• Plan to collect tags from underlying infrastructure components such as Docker and
Kubernetes by suitably configuring the agent.
Summary
The Datadog Agent can be used for monitoring both classic and microservices-based
environments that are built on a variety of cloud platforms and operating systems. To
collect and publish monitoring metrics into its SaaS backend, an agent needs to be run on
the local environment. The agent could be run directly on the host machine, as a container
on a Docker host, or as a service in a microservice orchestration framework such as
Kubernetes. This chapter looked at various configuration options available for deploying
the Datadog Agent and typical use cases.
In the next chapter, we will look at key features of the Datadog UI. Though most of the
changes can be done using APIs, the Datadog UI is a handy tool for both users and
administrators to get a view into Datadog's backend, especially the custom dashboards
that provide visual insights into the state of infrastructure and the application software
system.
3
The Datadog
Dashboard
In the previous chapter, we looked at the various ways the Datadog agents running on
your infrastructure upload monitoring metrics to Datadog's SaaS backend. The Datadog
dashboard provides an excellent view of that data, and it can be accessed using popular
web browsers. The administrators use the dashboard to perform a variety of tasks –
managing user accounts and their privileges, creating operational dashboards, enabling
integrations, and creating monitors are typical examples. The dashboard provides a chat
window for the users to contact customer support as well.
48 The Datadog Dashboard
While the dashboard has a large number of features, some of them are more important
than others, and we will study them in detail in this chapter. These are the important
dashboard features:
• Infrastructure List
• Events
• Metrics Explorer
• Dashboards
• Integrations
• Monitors
• Advanced features
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools and
resources:
• An Ubuntu 18.04 environment with Bash shell. The examples might work on
other Linux distributions as well, but suitable changes must be done to the Ubuntu
specific commands.
• A Datadog account and a user with admin-level access.
• A Datadog Agent running at host level or as a microservice depending on the
example, pointing to the Datadog account.
Infrastructure List
Once an agent is up and running on a host, the agent starts reporting into the
Datadog backend in the cloud. The host will get added to the infrastructure lists once
communication between the agent running on it and the backend is successfully
established.
The infrastructure lists are available underneath the Infrastructure main menu. The most
important ones are Host Map and Infrastructure List:
Infrastructure List 49
By clicking on a specific block, you can view monitoring details on the related host such as
the hostname, the tags defined at the host level, various system-level information, and the
list of services being monitored, as shown in the following screenshot:
Events
Most monitoring tools only focus on collecting and reporting time-series metrics data. In
comparison, Datadog reports on events too, and that is one of its attractive features. These
are system-level events such as the restarting of a Datadog agent and the deployment of a
new container:
The Events dashboard can be accessed using the Events main menu option on the
Datadog UI. You can directly add an update to the events stream using this dashboard
and also comment on an already posted event. This social media-inspired feature is
useful in situations where you need to communicate or add clarity about some system
maintenance-generated events to remote teams.
The events listed on the dashboard can be filtered using various options, including search.
Additionally, there is an option to aggregate related events, which is useful in adding
brevity to the events listing.
Events is one of the main menu options, and using it, the dashboard can be accessed.
Metrics Explorer
The metrics collected by Datadog can be viewed and searched for from the Metrics
Explorer dashboard, which can be accessed from the main Metrics menu:
Dashboards
A Datadog dashboard is a visual tool that can be used to track and analyze metrics. It
offers a rich set of features that can be used to build custom dashboards, in addition to the
dashboards available out of the box.
The Dashboards main menu option has two options, Dashboard List and New
Dashboard:
The graphs in a Timeboard dashboard will share the same time frame, so it will be useful
for comparing multiple metrics side by side. This feature makes a Timeboard dashboard
useful for triaging issues. The Screenboard dashboard doesn't have a time constraint, and
disparate objects can be assembled on it. In that sense, a Screenboard type dashboard is
more suitable for building a traditional monitoring dashboard.
Now, let's look at how a Timeboard type dashboard is created to explain the dashboard
creation process.
The first step is to set the name and type of the dashboard, as shown in the previous
screenshot.
In the next step, you should see the Add graph option. By clicking on it, you will get the
option to add different types of widgets, as shown in the following screenshot:
There are several options available to narrow down the features you want to implement
for a graph. In the graph shown in the following screenshot, any available metric can be
picked to plot the graph, and the scope can be filtered using the choices available in the
Graph your data | from dropdown. In this example, the system.cpu.user metric has been
selected and that has been filtered down to a single i-021b5a51fbdfe237b host:
The configuration options available for sharing a graph are self-explanatory. Essentially, it
will generate a JSON code snippet that can be used to embed the graph on a web page.
Integrations
As shown in the following screenshot, underneath this tab, the Datadog integrations with
third-party tools are available out of the box and can be viewed and searched for:
The main Integrations menu 59
Additionally, this interface provides a list of metrics that will be available through this
integration. The following screenshot shows the metrics that will be published by the
NGINX integration:
APIs
Underneath the APIs tab on the Integrations dashboard, you can find all of the resources
you need to integrate your application with Datadog. Datadog provides APIs and gateways
to interact with its backend programmatically, and a matured monitoring system will
leverage such options to integrate with Datadog in order to add custom metrics and
features that might not be available out of the box:
The main Integrations menu 61
• API Keys: As you might have seen in the previous chapter, Datadog Agent, the
API key is the information that correlates an agent with your account, and it
is a mandatory configuration item that needs to be set in the Datadog agent
configuration file.
• Application Keys: A script or an application needs to authenticate with the Datadog
backend before it can start executing APIs to do something, such as publishing an
event or a metric. For authentication, an API key-application key pair is needed,
and that can be generated using the dashboard.
• Events API Emails: This is an email gateway in which events can be published to
your account and posted on the Events dashboard.
• Client Token: A client token is used to publish events and ship logs from a web or
mobile application running inside a Real User Monitoring (RUM) setup.
Agent
Underneath the Agent tab, you can find all the information needed to install a Datadog
agent for a target platform. Usually, a command line to install the agent is available on this
page.
62 The Datadog Dashboard
Embeds
We have learned how a new dashboard can be created and shared in the Dashboards
section. The dashboard can be embedded on a web page using the related JSON code.
Using this option, the shared dashboards can be listed as follows:
Monitors
We already discussed the basic concepts of monitors in Chapter 1, Introduction to
Monitoring. Datadog provides an advanced implementation of monitors with a variety of
features. The monitors can be created and maintained from the dashboard manually. The
following screenshot provides a list of options available under this menu item:
Monitors 63
• Manage Monitors: Under this tab, all the available monitors are listed, and
monitors can be selected from that list for update.
• Triggered Monitors: Essentially, thresholds are set on a metric to define a monitor,
and they trigger when the threshold values are reached. Under this menu, a list of
these triggered events is listed.
• New Monitor: A new monitor can be created manually by following the workflow
provided here. Although a monitor is usually tied to a metric, Datadog provides a
variety of monitors. We will discuss this, in more detail, in Chapter 8, Monitors and
Alerts. However, in the next section, we will learn how a simple monitor can be set
up to help you to gain an understanding of the basics.
• Manage Downtime: During a downtime window, a monitor will not trigger even
if it hits a threshold. The classic use of downtime is to silence the monitors during
some maintenance or deployment activity. Under this tab, downtimes can be
scheduled.
64 The Datadog Dashboard
We have looked at all the common options that you can use on the Datadog UI. The
interface offers many more options in which to use all of the features offered by Datadog.
In the following section, we will also briefly look at the advanced features that you can
access from the Datadog UI.
Advanced features
There are a lot more menu options than the ones we have looked at, so far, in the Datadog
dashboard. However, these are the core features that you will use on a regular basis,
especially if you are a DevOps engineer who is responsible for maintaining the Datadog
account. Let's take a look at some of them and learn what exactly each option covers:
• Security: The Security Monitoring option can be accessed from this menu. We will
discuss this, in more detail, in Chapter 14, Miscellaneous Monitoring Topics.
• UX Monitoring: Synthetic tests allow you to monitor the User Experience (UX)
in a simulated manner. This monitoring option allows you to monitor UX by using
different methods. We will discuss this feature in Chapter 13, Managing Logs Using
Datadog.
• Real User Monitoring: RUM measures the actual UX directly with the help of
probes embedded in the application. This is a type of last-mile monitoring that we
discussed in Chapter 1, Introduction to Monitoring. We have looked at the important
menu options available on the Datadog dashboard that are related to important
Datadog features. Familiarizing yourself with these menu options is important
because the dashboard is the main interface that users will use to interact with the
Datadog application on a regular basis.
Summary
In this chapter, we looked at the Datadog dashboard menu options that are central to how
users interact with the Datadog backend. A lot of features available on the dashboard,
such as the creation and maintenance of monitors, can be automated. However, the
dashboarding feature in Datadog is one of the best in the industry, and creating custom
operational and management reporting dashboards is very easy. The features we discussed
here can be used by a Datadog user, especially an administrator, on a regular basis.
In the next chapter, we will learn how to manage your Datadog account. Again, the
dashboard is the main interface used by administrators for this purpose.
4
Account
Management
In the last chapter, we looked at the main features of the Datadog user interface. The
account management features are also part of the user interface but those warrant a
separate discussion as account management is administrative in nature and not all the
users of Datadog would be accessing those.
Datadog supports Single Sign-On (SSO) and it provides key-based API support. It also
supports multiple organizations within a customer account that can be used to build
isolated monitoring environments, sometimes to meet a compliance requirement for
separating development and production accounts, or for isolating client accounts hosted
by a SaaS provider.
When a user is created, first the privileges for that user can be set by using default roles
available out of the box or it can be done finely using custom roles that an administrator can
set up. Within a Datadog account, multiple organizations can be created to group or partition
the environments that need to be monitored. While Datadog provides a native authentication
system for users to log in, it also supports SSO. API and application key pairs can be created
for Datadog users and a key pair can be used to access Datadog programmatically.
This chapter covers all such account management features that are available on the
Datadog UI and specifically the following:
• Managing users
• Granting custom access using roles
68 Account Management
• Setting up organizations
• Implementing Single Sign-On (SSO)
• Managing API and application keys
• Tracking usage
Technical requirements
There are no technical requirements for this chapter.
Managing users
After Datadog is rolled out into an environment that mainly involves deploying the
agent, the users can be given access to Datadog interfaces that provide insight into the
infrastructure and the application software system. Though users can get API access to
Datadog, they primarily use the UI to consume the information available.
A user can be added by sending out an invitation by email, which could be done from the
main menu item Team on the Datadog dashboard.
The following screenshot is a sample Team window:
Users can be invited using a form available at the bottom of the Team window, as
shown here:
• Datadog Admin Role: As the name indicates, a user with this role is an administrator
and that user can manage other users and has access to billing information.
• Datadog Standard Role: A user with this role has full access to various
Datadog features.
70 Account Management
• Datadog Read Only Role: As the name suggests, a user with this role can have
read-only access to various features. Typically, this is the access assigned to a lot of
users of Datadog organizations that cover production environments.
As you have seen in this section, access for a user can be granted using one or more
predefined roles. It is possible to define the user access more finely using roles so let's see
how that can be accomplished.
While assigning one or more predefined roles might be enough in most access
management use cases, users could be assigned more specific privileges using custom roles.
The custom roles can be set up by an administrator and be assigned to users after that.
Here are a few steps you need to follow to create a custom role:
Important note
An existing custom role can be updated following similar steps as well.
72 Account Management
In this section, we have seen how user access can be finely defined using custom roles
if predefined roles don't serve the access control requirements. The resources and users
can be grouped under an organization in a Datadog account so let's see how that could
be configured.
Setting up organizations
It is possible to set up multiple organizations in the same Datadog account. This
multi-organizations feature is not enabled by default and you must request customer
support for it. Multiple child-organizations are useful in cases where monitoring needs to
be isolated for parts of the infrastructure and the applications run on it:
An organization can be tied to a sub-domain for better identification. You need to get
this feature also enabled by customer support. Once that is available, you can access an
organization using a specific URL such as https://prod.datadoghq.com. In this
case, the sub-domain product points to an organization.
Now, let's see how an organization can be set up:
1. After customer support enables the multi-organization feature, you will be able to
see the New Organization feature as in the following screenshot:
Setting up organizations 73
3. The Datadog UI will switch to the new organization upon its creation. You can
switch between organizations by just clicking on the corresponding label under the
SWITCH ORGANIZATION section, as shown in the following screenshot:
Let's look at a few key single sign-on features that Datadog provides:
Though the details can be different, a SAML-based SSO setup would require these
generic steps:
1. Set up the third-party SSO provider as the SAML identity provider (IdP). This
is done on the third-party side. Download the IdP metadata file after completing
this setup.
2. Using the Configure SAML menu option on the Datadog UI, configure SSO mainly
by uploading the IdP metadata created in the last step:
Identity Provider (IdP)-initiated: In this case, the login is initiated by the identity
provider, the third-party platform. With this option, a Datadog user initiates the
logging process from a dashboard provided by the third party and the user is
already logged into the third-party platform.
Service Provider (SP)-initiated: A Datadog user will initiate the logging-in process
using a Datadog URL such as https://app.datadoghq.com/account/
login and that will take the user to the third-party platform for authentication.
Now let's see how SSO is configured with Google as the IdP provider:
1. Google provides steps to generate the IdP metadata for Datadog. Follow the steps
available at https://support.google.com/a/answer/7553768 to
generate it in the Google Admin app and download the IdP metadata file.
2. In the Datadog UI, upload the IdP metadata file from the Configure SAML page as
in the following screenshot:
3. Once the IdP metadata file is uploaded and loaded, click the Enable button to
enable SAML as in the following screenshot:
5. For the preceding workflow to succeed, the user needs to be set up already in
Datadog. Datadog provides an option called Just-in-Time Provisioning, which is
basically the creation of users originating from the whitelisted domains. This avoids
the requirement of provisioning a user explicitly in Datadog. The configuration can
be done as in the following screenshot:
The API key is associated with the organization and the application key is associated with
a user account.
The keys can be set up from the Integrations | APIs page:
options = {
'api_key': '<DD_API_KEY>',
'app_key': '<DD_APPLICATION_KEY>'
}
initialize(**options)
There are other, less-used options also available to access Datadog indirectly for
information and publishing metrics:
• Events API Email: On the APIs page, you can also set up an @dtdg.co email
account to publish events by sending emails to it.
• Client Token: A client token is used to instrument web and mobile applications
for them to send logs and events to the Datadog backend. This is primarily used to
gauge the last-mile user experience.
Datadog provides various options to track your usage of the Datadog service so let's look
at important features related to that in the next section.
80 Account Management
Tracking usage
Like other SaaS services, the use of Datadog tends to be elastic as it's typically deployed on
infrastructure provisioned in public clouds.
Important note
Please note that Datadog could be deployed in a classic, bare-metal
environment as well, but such use cases are becoming less common.
As the billing is tied to on-demand usage, it's important for an administrator to track the
usage pattern:
1. Use the Plan & Usage menu option under your user profile to get to the details of
the plan you have subscribed and the usage of Datadog as a service as shown in the
following screenshot:
Figure 4.15 – Plan & Usage page with details of the plan
3. Under the Usage tab, the details of Datadog usage can be viewed as in the
following screenshot:
From this page, you can find out how many machines Datadog has been monitoring over
a specific period of time, which has a bearing on the billing. Also, you can understand
how many containers Datadog has been monitoring:
• Under the Billing History tab, as the name suggests, the billing history and related
receipts can be viewed and downloaded.
• Under the Multi-Org tab, you can view aggregated usage and related trends. That
would be very useful if you have partitioned your use of Datadog into multiple
organizations. For example, if you navigate to Multi-Org Usage | Long-Term
Trends, you can see various usage as shown in the following screenshot:
Figure 4.17 – Overall usage trend charts under the Multi-Org tab
4. Under the Multi-Org Usage | Month-to-Date Usage tab, you can also see a
summary of itemized usage per organization as in the following screenshot:
We have looked at multiple features and options available for managing the Datadog
account set up for a company and its Datadog users, now let's look at the best practices of
managing a Datadog account.
Best practices
Here are some of the best practices related to managing accounts and users:
Summary
Account and user management is an administrative task and regular users will not get into
it except for tasks such as creating application keys. The multi-organizations feature can be
used to partition the account into logical units, and we looked at sample scenarios where
they are useful. Datadog supports industry-standard SSO options using SAML and we
learned how to implement those. There are multiple methods available for programmatic
access and the primary mechanism is using a key pair, and we went over the steps for
generating them. Fine-grained access to Datadog for a user can be implemented using
custom user roles and we learned the steps to configure those for a user.
This is the last chapter in the first part of the book, which provided an overview of
monitoring in general and an introduction to Datadog and getting started with it
in particular. In the subsequent parts and chapters of the book, we will discuss how
monitoring features are implemented and used in Datadog in detail. In the next chapter,
we will learn more about metrics, events, and tags – three important concepts that
Datadog heavily depends on.
5
Metrics, Events,
and Tags
In the first part of the book, we discussed general monitoring concepts and how to get
started with Datadog, including installing the Agent and navigating the main menu
options on the Datadog UI, the main interface available to the end users. In the first
chapter, we also discussed metrics as a central concept in all modern monitoring
applications, being used to measure the health and the state of software systems. Also, we
looked at tags as a means to group and filter metrics data and other information such as
events generated by the monitoring systems, especially Datadog.
In this chapter, we will explore in detail how metrics and tags, two important constructs
that Datadog heavily depends on, are implemented. Metrics are the basic entities that
are used for reporting monitoring information in Datadog. Datadog also uses tags to
organize metrics and other types of monitoring information such as events and alerts. It
is important to discuss metrics and tags together as appropriate tagging of metrics is very
important to make sense out of the large volume of metrics time series data that will be
available in a typical Datadog account.
86 Metrics, Events, and Tags
While metrics, a central concept in any monitoring system, help to measure the health
of a system continuously, an event captures an incident that occurs in a system. The
crashing of a process, the restarting of a service, and the reallocation of a container are
examples of system events. A metric is measured at a specific time interval and there is a
numeric value associated with it, but an event is not periodic in nature and provides only
a status. Tagging can be used to organize, group, and search just for events that are used
with metrics.
In this chapter, we will discuss metrics, events, and tags in detail. Specifically, we will cover
these topics:
Technical requirements
To try out the examples mentioned in this book, you need to have the following
tools and resources:
- A Datadog account and a user with admin-level access.
- A Datadog Agent running at host level or as a microservice depending on the example,
pointing to the Datadog account.
The preceding example demonstrates how metrics and tags work in tandem to draw
insights from a vast cache of time series metrics data published to Datadog. The metrics
as implemented in Datadog are far more elaborate than the basic concepts we discussed in
Chapter 1, Introduction to Monitoring; let's look at some of the concepts related to metrics
as they are defined in Datadog.
88 Metrics, Events, and Tags
Metric data
A metric data value is a measure of a metric at a point in time. For example, the system.
memory.free metric tracks the free memory available on a host. Datadog measures
that metric and reports the value periodically, and that value is the metric data value. A
series of such measurements will generate a time series data stream, as seen plotted in the
example in Figure 5.1.
Metric type
The main metric types are count, rate, and gauge, and they differ in terms of how metric
data points are processed to publish as metric values:
• Count: The metric data received during the flush time interval is added up to be
returned as a metric value.
• Rate: The total of metric data received during the flush time interval is divided by
the number of seconds in the interval to be returned as a metric value.
• Gauge: The latest value received during the flush time interval is returned as the
metric value.
Metric unit
A metric type is a group of similar measurement units. For example, the bytes group is a
type of storage and memory measurement and it consists of units such as bit, byte, kibibyte,
mebibyte, gibibyte, and so on. The time group consists of units from a nanosecond to a
week. A full list of metric units is available in the Datadog documentation (https://
docs.datadoghq.com/developers/metrics/units/) for reference.
Query
A query returns values from a time series metric dataset, for the given filter conditions and
a time window. For example, the time series data related to the system.memory.free
metric will be reported for all the machines where the Datadog Agent runs.
Understanding metrics in Datadog 89
If you want to monitor the metric only for a specific host during the last one hour, the
Datadog-defined host tag can be used to narrow down the machine to a specific host,
and you can also specify a time range. The ways to specify the query parameters would
depend on the interface that you would use to run the query. For example, you have seen
how a query works in the Metrics Explorer window on the Datadog dashboard where
those parameters are specified visually.
Datadog identifies the following parts in a query:
• Scope: By specifying the appropriate tags, the results of a query can be narrowed
down. In the last example, the host tag set the scope to a single host.
• Grouping: A metric such as system.disk.free would have more than one
time series dataset for the same host, as one would be generated for each disk, and
typically there would be multiple disks on a host. Assume that you are interested
in monitoring the total disk space available on all the web hosts. If those hosts are
tagged as web hosts using a custom tag such as host_type having the value web,
then it could be used to set the scope and the host tag could be used for grouping.
• Time aggregation and rollup: To display the data returned by a query, the data
points are aggregated to accommodate them. Datadog returns about 150 data points
for a time window. If the time window is large, the data points are aggregated into
time buckets and that process is called rollup. The metrics that are available in
Datadog out of the box are generated by integrations that are activated as part of
installing the Datadog Agent. The main integrations in this category that provide
metrics for infrastructure monitoring are the following:
A. System: Generates CPU-, load-, memory-, swap-, IO-, and uptime-related
metrics, identified by the patterns such as system.cpu.*, system.load.*, and
system.mem.*
B. Disk: Generates disk storage-related metrics, identified by system.disk.*
C. Directory: Generates directory- and files-related metrics, identified by the
system.disk.directory.* pattern
D. Process: Generates process-specific compute metrics, identified by the system.
processes.* pattern
90 Metrics, Events, and Tags
The integration pages on the Datadog UI provide a complete list of metrics available
through these integrations. The metadata associated with a metric can be viewed and
some of the settings can be edited in the Metrics Summary window:
• Navigate to Metrics | Metrics Summary, where all the metrics are listed. A specific
metric can be searched using its name or tags associated with it. Look at the
following screenshot for the example of searching for Redis metrics:
Figure 5.2 – Searching for Redis metrics in the Metrics Summary interface
• The metadata of a specific metric can be viewed by clicking on any metric of
interest. In the following screenshot, the summary of the redis.mem.used
metric is provided:
Tagging Datadog resources 91
Defining tags
The following are the main rules and best practices for defining a tag:
• A tag should start with a letter and can contain the following characters:
alphanumeric, underscores, minuses, colons, periods, and slashes.
• A tag should be in lowercase letters. If uppercase letters are used, they will be
converted into lowercase. The auto-conversion of the case can create confusion, and
so it is advisable to define tags only in lowercase.
• A tag can be up to 200 characters long.
• The common pattern of a tag is in <KEY>:<VALUE> format, which prepares a
tag to be used in different parts of a query as we saw earlier. There are reserved
keywords for tags, such as host, process, and env, that Datadog associates
special meanings with, and their use should be in line with the related meanings.
Tagging methods
Datadog offers multiple ways to tag metrics. However, the following two methods are the
most common:
Both of these methods are done at the Datadog Agent level. The tagging can also be done
from the Datadog UI, using the Datadog API, and as part of DogStatsD integration.
In the Datadog Agent config file, datadog.yml, the config item tags can be used to add
tags as follows:
tags:
- "KEY1:VALUE1"
- "KEY2:VALUE2"
tags:
- "KEY1:VALUE1"
- "KEY2:VALUE2"
For example, for the Redis integration, this change is done in /etc/datadog-agent/
conf.d/redisdb.d/config.yaml.
In addition to tags applied from the configuration files, some integrations also supply tags
by inheriting them from the source applications and platforms. The best example is the
extraction of AWS tags by Datadog integrations for related AWS services.
The use of tagging goes beyond just filtering metrics. The tags could also be applied to
other Datadog resources for filtering and grouping, as is done with metrics data. The
following are the main Datadog resources that could be tagged:
• Events
• Dashboards
• Infrastructure: mainly hosts, containers and processes
• Monitors
The use of tags in Datadog is widespread, and the preceding list covers only the important
resources. When the count of a specific resource type such as events is very large, it's
impossible to trace a specific instance without the help of a tag. For example, the events
associated with a specific type of host could be pulled by tagging those events suitably.
Wherever there is a need to look for a subset of information, tags are used as the keywords
to search for related resources and metrics data. That means the monitoring data that
has been published to Datadog must be tagged well so that subsets of that data can be
extracted easily.
Datadog and the various third-party product integrations it provides generate the bulk of
the metrics that we use in real life. In the next section, we will discuss custom metrics and
how they can be published to Datadog.
Custom tags can be associated with the preceding set of metrics by defining them in
the relevant configuration files. Some integrations also offer to inherit tags from source
applications and platforms.
In addition to this group of metrics that are available out of the box or that can be enabled
easily, there are multiple options available in Datadog to define custom metrics and tags.
This is one of the features that makes Datadog a powerful monitoring platform that can be
fine-tuned for your specific requirements.
Defining custom metrics 95
• Name: A name should not be more than 200 characters long, should begin with a
letter, and should contain only alphanumeric characters, underscores, and periods.
Unlike tags, metric names are case-sensitive.
• Value and timestamp: The metric value is published with a timestamp.
• Metric type: As discussed earlier, the main types are count, rate, and gauge.
• Interval: This sets the flush time interval for the metric.
• Tags: The tags specifically set for this metric.
• Datadog Agent checks: The Datadog Agent can be configured to run custom
scripts as checks, and that interface can be leveraged to publish custom metrics. We
will learn more about implementing custom checks in Chapter 8, Integrating with
Platform Components.
• Datadog REST API: Datadog resources can be viewed, created, and managed using
the REST API Datadog provides. The custom metrics and tags can be handled that
way as well. This is mainly useful when Datadog is not running close to a software
system that needs to be monitored using Datadog. We will learn how to use the
Datadog REST API in Chapter 9, Using the Datadog REST API.
• DogStatsD: This is a Datadog implementation of the StatsD interface available for
publishing monitoring metrics. We will discuss this more in Chapter 10, Working
with Monitoring Standards.
• PowerShell: This provides an option to submit metrics from Microsoft platforms.
The Datadog UI has a dashboard for the event stream, and we will see how the events
are listed on it and what event details are published. In a large-scale environment, the
event stream volume can be quite large, and you may have to search for events of interest.
While events are largely informational, it might make sense to be notified about certain
categories of events, such as the stopping of an important service. Events are usually
generated by Datadog, but Datadog also offers the feature to generate custom events as an
integration option.
Let's see how Datadog handles events out of the box using its event stream dashboard.
96 Metrics, Events, and Tags
• By source: On the left pane of the dashboard, the FROM section lists the sources
of the events.
• By Priority: On the left pane of the dashboard, the PRIORITY section lists
the priorities.
• By Severity: On the left pane of the dashboard, the STATUS section lists the
severity of the events.
By selecting one or more event types described here, the events listed on the event stream
dashboard can be filtered to see the specific ones you need to look at. The related events
are aggregated if the Aggregate related events option is checked on the dashboard as in
the following screenshot:
Searching events 97
Figure 5.5 – The Aggregate related events option on the event stream dashboard
In the next section, we will see how specific events can be located on the event
stream dashboard.
Searching events
In a large-scale environment where the Datadog Agent runs on several hundred hosts
monitoring a variety of microservices and applications running on them, there will
be scores of events published to the event stream dashboard every minute. In such a
situation, manually looking through the event stream is not viable, and standard search
methods have to be used to locate events of interest. Datadog provides multiple options
for searching and filtering to get the correct set of events.
As we have seen in the previous section, you can specify a time window in the Show field
to look at events only from a specific time period. As shown in the following screenshot,
the time window can be specified in a variety of ways:
There is a full text search option available, which you can use to look for events using
keywords. In the following example, the ntp keyword is used to list only the related events:
• Filters: In the previous section, we saw that the event sources are listed under the
FROM section on the dashboard. They could be specified using this query format.
sources:<source_name_1>, <source_name_2>: In this case, the search will
be run for events from source_name_1 or source_name_2.
Similarly, you can specify tags, hosts, status (error, warning, or success), and priority
(low or normal) as the filters in the query.
Notifications for events 99
• Boolean operators: The OR, AND, and NOT Boolean operators can be used to
combine basic search conditions. See the following examples:
ntp OR nagios will search for either one of the ntp and nagios keywords.
tags:region:us-west-2 AND environment:production will
look for events tagged with the key values region:us-west-2 and
environment:production.
* NOT "ntp" will list all the events that do not contain the ntp keyword.
Now let's see how we can get notified of events.
As you can see, an event notification can be sent to individual emails, all, and Datadog
Support. All the Datadog users in the organization will be notified of the event if all is
chosen. A support ticket will be created if Datadog Support is selected.
The event stream is integrated with the following integrations, using which the event
details can be forwarded to the related tools:
• Slack: Post a notification to a Slack channel by specifying the address in this format:
@slack-<slack_account>-<slack-channel>.
• Webhook: Send a notification to a webhook by just specifying @webhook_name.
(webhook_name must be configured in Datadog for this to work, and we will see
how to do this in Chapter 9, Using the Datadog REST API.)
• Pagerduty: By specifying @pagerduty, the event details can be forwarded as an
alert to PagerDuty.
In the next section, we will see how custom events are created and added to the
event stream.
Generating events
Thus far, we have discussed how to view, search, and escalate events that are generated by
Datadog. As with metrics and tags, custom events can be created as well in Datadog. The
primary use of this feature is to use the event stream dashboard as a communication hub
for Site Reliability Engineering (SRE) and production engineering teams.
A common scenario is deploying to an environment monitored by Datadog. During
deployment, multiple services and processes will be restarted, and there will be a large
volume of events. Posting a deployment as an event with details might help to alleviate the
concerns of any other teams monitoring the environment but not actively participating in
the deployment process.
An event can be posted from the event stream dashboard as a status update. Look at the
following example:
Generating events 101
We will learn how to use the Datadog API in detail in Chapter 9, Using the Datadog REST
API. Posting an event to the event stream using the Datadog API will be done there as
an example.
Now, let's look at the best practices around defining and using metrics, events, and tags
in Datadog.
102 Metrics, Events, and Tags
Best practices
The following are the best practices related to creating and maintaining metrics and tags:
• Make sure that all the tags available through various integrations are enabled. This
mainly involves inheriting tags from source platforms such as public cloud services,
Docker, and Kubernetes. When complex applications are part of your environment,
it is better to have more tags to improve traceability.
• Add more tags from the Datadog Agent and integrations to partition your metrics
data easily and to track the environment, services, and owners efficiently.
• Have a namespace schema for your custom metrics using periods so that they can
be grouped and located easily – for example, mycompany.app1.role1.*.
• Format the names and values of metrics and tags according to the guidelines.
Datadog silently makes changes to names and values if their format is not
compliant. Such altered names could cause confusion on the ground as they will be
different from the expected values.
• If some incident happens and not much information is available immediately from
monitoring alerts, query the event stream for clues.
• Post custom messages from the dashboard that might help with some ongoing activity,
such as a deployment, especially if you operate in a distributed team environment.
• Consider posting events to the event stream from automation processes if they will
have some impact on the resources that Datadog monitors.
• Use the notification option to raise tickets with Datadog Support if an event
warrants any digging by them.
Summary
Metrics are central to a modern monitoring system and Datadog is no exception. Datadog
offers a rich set of options to generate and use metrics. Though tags have general-purpose
uses across all Datadog resources, their main use is to group and filter massive amounts of
metrics data.
In this chapter, we learned how metrics and tags are defined and how metrics and other
resources are grouped and searched for using tags. Besides Datadog-defined metrics and
tags, there are multiple ways to add custom metrics and tags, and some of those methods
were discussed in this chapter.
Summary 103
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools
installed and resources available:
By choosing one or more tags in the Filter by field, only the related hosts will be listed, as
shown in the following screenshot:
In this case, the tag that was selected to group the hosts is metadata_platform. Here,
we can see that the five Linux hosts are grouped together and that the lone Mac (darwin)
host is shown separately from the Linux group.
Important note
The Group hosts by tags option only uses the key of a tag, but the Filter by
option requires its value also.
You have already seen that there are multiple ways to list and locate a host on the Host Map
dashboard. If you would like to look at a specific host, you can click on it, which will provide
you with a zoomed-in version of it in another window containing host-specific details:
• Apps: The links to details about the integrations that have been implemented for
this host are listed in the hexagonal shape.
• Agent: The version of the agent.
• System: The operating system-level details are provided here, which can be drilled
down into multiple levels.
• Container: If containers are running on the host, the related information will be
available here.
• Metrics: Host-level metrics that were published recently are listed here.
110 Monitoring Infrastructure
• Tags: The host-level tags are listed here. While the host tag pointing to the name
of the host is available by default, more host-level tags can be set using the tags
option in the Datadog agent configuration file.
At the top of this interface, the host name and host aliases are provided. Also, links to the
following dashboards with some relation to the host are available at the same location:
• Dashboard: This is a dashboard that's specific to the host with charts on important
host-level metrics. We will look at this dashboard soon.
• Networks: The dashboard for Network Performance Monitoring. This feature
needs to be enabled in the Datadog agent for this to be available.
• Processes: If the Live Processes feature is enabled from the Datadog agent, this
dashboard will be populated with details about the processes running on the host.
We will look at it in detail in the System processes section.
• Containers: This link will lead you to the dashboard where containers running on
the host are listed.
Now, let's look at some of these detailed dashboards that we can access from the Host
Map zoom-in window.
By navigating to Apps | Agent, you can pull up a dashboard with Datadog agent-specific
metrics listed, as shown in the following screenshot:
By navigating to Apps | system, you will see a dashboard containing charts about the
core infrastructure metrics for the host, such as CPU usage, load averages, disk usage and
latency, memory availability and usage, and network traffic. The following screenshot
provides such a sample dashboard:
For each category of system metrics, a widget is available on the host dashboard with
related metrics charted. Using the menu options available in the top-right corner, it's
possible to drill down further into these widgets.
By clicking on the left button, the widget can be viewed in full-screen mode with more
options. The middle button pulls up a menu with options that are related to managing the
chart. The right button provides details on the metrics that are being used on the chart. A
screenshot of these menus, with the middle button highlighted, is as follows:
CPU usage
By hovering over the CPU usage chart, the related metrics can be displayed on the
dashboard, as shown in the following screenshot:
These are the main CPU usage-related metrics and their meanings:
By tracking these metrics, you will get an overall picture of the CPU usage on the host and
by building monitors around those, the tracking of these metrics could be made automated.
Load averages
The load on a host that runs some UNIX-based operating system such as Linux or macOS
is measured in terms of the number of concurrent processes running on it. Usually, the
load is tracked as the average of that measurement over the last 1 minute, 5 minutes, and
15 minutes. That's why a UNIX command such as uptime returns three measurements
for load averages, as shown in the following example:
$ uptime
12:18 up 2 days, 3:29, 4 users, load averages: 1.45 1.71 1.76
The load averages data for the host are plotted on the Load Averages chart on the host
dashboard, as shown in the following screenshot:
By hovering over the chart, the averages for 1-minute, 5-minute and 15-minute loads can
be displayed, as shown in the preceding screenshot. The important metrics in this category
are the following, all of which have self-documented names:
In this section, we looked at system load-related metrics. Now, let's look swap
memory metrics.
Available swap
Swap space on a Linux host allows the inactive pages in memory to be moved to disk
to make room in the RAM. The following system metrics are plotted on this chart on
the host dashboard:
The swap memory metrics can be viewed on the dashboard, as shown in the following
screenshot:
There are more swap-related metrics that are published by Datadog besides those being
used on this chart, and they could be used on custom charts and monitors as well.
Disk latency
The disk latency of a storage device is determined by the delay in getting an I/O request
processed, and that also includes the waiting time in the queue. This is tracked by the
system.io.await metric:
Memory breakdown
Though Datadog publishes a large set of metrics related to the available memory and its
usage on the host, this widget charts only two metrics: system.memory.usable and
a diff of system.memory.total and system.memory.usable, which provides an
estimate of the reserved memory:
In this section, we have looked at important memory-related metrics. In the next section,
we'll look at how disk storage usage can be monitored using the related metrics.
Inventorying the hosts 117
Disk usage
Tracking the use of storage on disks attached to a host is one of the most common aspects
of infrastructure monitoring. This widget charts the disk usage for every device by using
the system.disk.in_use metric, which is the amount of disk space that's being used
as a fraction of the total available on the device:
• system.disk.free: The amount of disk space that's free on the device in bytes
• system.disk.total: The total amount of disk space on the device in bytes
• system.disk.used: The amount of disk space being used on the device in bytes
• system.fs.inodes.free: The number of inodes free on the device
• system.fs.inodes.total: The total number of inodes on the device
• system.fs.inodes.used: The number of inodes being used on the device
• system.fs.inodes.in_use: The number of inodes being used as a fraction
of the total available on the device
In this section, we looked at various metrics that help us monitor storage disk usage. In
the next section, we'll review the metrics available for monitoring network traffic.
118 Monitoring Infrastructure
Network traffic
The inbound and outbound network traffic on the host is tracked by the system.net.
bytes_rcvd and system.net.bytes_sent metrics, respectively. The following
screenshot shows the network traffic widget that's available on the host dashboard:
Listing containers
A microservice is deployed using one or more containers running on different hosts. If
a Datadog agent runs on those hosts and it's been configured to detect those containers,
Datadog will list them on the Containers dashboard. The Datadog agent configuration
changes that are needed for discovering containers running on a host were discussed in
Chapter 2, Deploying a Datadog Agent.
To get to the Containers dashboard, you can navigate to Infrastructure | Containers.
From there, you will be able to see the dashboard shown in the following screenshot:
The processes running on a host are not reported by the Datadog agent by default, so
you need to enable that feature. To do that, add the following option to the datadog-
agent.yml Datadog agent configuration file and restart the agent service:
process_config:
enabled: 'true'
124 Monitoring Infrastructure
Usually, this configuration option is available but commented out in the agent
configuration file, so you just need to uncomment these lines before restarting the agent.
Thus far, we've mentioned real and virtual machines and containers as the building blocks
of infrastructure. In the next section, we will see how Datadog can help with monitoring
a serverless computing environment.
Best practices
These are the best practices that should be followed for monitoring infrastructure:
• Datadog provides lots of metrics that cover various aspects of compute, storage, and
network monitoring out of the box. Make sure that you only focus on the important
metrics from each category.
• Containers are becoming core constructs of the infrastructure that hosts modern
applications. If microservices are used, it is important to monitor the containers and
obtain insights that could be used to fine-tune the performance and availability of
applications.
• Aggregating the processes would make it easy to compare runtime environments,
and that feature must be enabled.
• Though there is no infrastructure to manage in serverless computing, Datadog
provides options for monitoring the application functions that should be enabled
for tracking the related metrics.
126 Monitoring Infrastructure
Summary
In a traditional computing scenario where hosts are involved, infrastructure monitoring
is the core of monitoring. Hosts include both bare-metal and virtual machines. With
the increased use of microservices to deploy applications, the container has become an
important element in building infrastructure. Datadog provides features that support
both host- and container-level monitoring. The Datadog feature of aggregating processes
from the hosts makes comparing runtime environments at the host level easy. Though no
infrastructure is involved, Datadog can be used to monitor serverless computing resources
such as AWS Lambda functions. In this chapter, you learned how various components of
the infrastructure that run a software application system are monitored using Datadog.
These infrastructure components could be in traditional data centers or public clouds,
and they could be virtual machines or containers.
Though useful metrics can be published to Datadog or generated by it, if you do not
use monitors and alerts, they are not very useful. In the next chapter, we will learn how
monitors and alerts are set up in Datadog.
7
Monitors and Alerts
In the last chapter, we learned how infrastructure is monitored using Datadog. The
modern, cloud-based infrastructure is far more complex and virtual than the data center-
based, bare-metal compute, storage, and network infrastructure. Datadog is designed to
work with cloud-centric infrastructure, and it meets most of the infrastructure monitoring
needs out of the box, be it a bare-metal or public cloud-based infrastructure.
A core requirement of any monitoring application is to notify you about an ongoing issue.
Ideally, before that issue results in a service outage. In previous chapters, we discussed
metrics and how they are generated, viewed, and charted on dashboards. An important
use of metrics is to predict an upcoming issue. For example, by tracking the system.
disk.free metric on a storage device, it is easy to notify when it reaches a certain point.
By combining the system.disk.total metric to that equation, it's also possible to
track the available storage as a percentage.
A monitor typically tracks a time-series value of a metric, and it sends out a notification
when the metric value is abnormal with reference to a threshold during a specified time
window. The notification that it sends out is called a warning or alert notification. Such
notifications are commonly referred to as alerts. Thresholds for warning and critical
statuses are set on the monitor. For example, from the disk storage metrics mentioned
earlier, the percentage of free storage available can be calculated. The warning threshold
could be set at 30 percent and the critical threshold could be set at 20 percent.
128 Monitors and Alerts
In this chapter, we will learn about monitors and alerts and how they are implemented in
Datadog, in detail. Specifically, we will cover the following topics:
• Setting up monitors
• Managing monitors
• Distributing notifications
• Configuring downtime
Technical requirements
To try out the examples mentioned in this book, you need to have the following
tools installed and resources available:
• A Datadog account and a user with admin-level access.
• A Datadog Agent running at host level or as a microservice depending on the
example, pointing to the Datadog account.
Setting up monitors
In a generic monitoring system, a monitor is usually tied to metrics or events, and
Datadog covers these and much more. There are multiple monitor types based on the data
sources defined in Datadog. Some of the most important ones include the following:
• Metric: As mentioned at the beginning of this chapter, metrics are the most
common type of information used to build monitors. A metric type monitor is
based on the user-defined thresholds set for the metric value.
• Event: This monitors the system events tracked by Datadog.
• Host: This checks whether the Datadog agent on the host reports into the Datadog
Software as a Service (SaaS) backend.
• Live process: This monitors whether a set of processes at the operating system level
are running on one or a group of hosts.
• Process check: This monitors whether a process tracked by Datadog is running on
one or a group of hosts.
• Network: These monitors check on the status of the TCP/HTTP endpoints.
• Custom check: These monitors are based on the custom checks run by the
Datadog agent.
Setting up monitors 129
Now, we will look at how monitors of these types are created and what sort of
instrumentation needs to be done to support them on the Datadog agent side.
To create a new monitor, on the Datadog dashboard, navigate to Monitors | New
Monitor, as shown in the following screenshot:
By clicking on this option, you will get to an elaborate form where all the information
needed for the monitor will be provided to create it. As there are several options available
when setting up a monitor, this form is rather long. We will view it in the following
screenshots:
Figure 7.3 – The New Monitor form – selecting the detection method and metric
The first step is to select the detection method. In this example, as shown in Figure 7.3, the
detection method is selected as Threshold Alert. This is the usual detection method in
which the metric value is compared with a static threshold value to trigger an alert.
The following list highlights other detection methods that are available:
• Change Alert: This compares the current metric value with a past value in the time
series.
• Anomaly Detection: This detects any abnormality in the metric time series data
based on past behavior.
• Outliers Alert: This alerts you about the unusual behavior of a member in a group.
For example, increased usage of memory on a specific host in a cluster of hosts
grouped by a tag.
• Forecast Alert: This is similar to a threshold alert; however, a forecasted metric
value is compared with the static threshold.
The selection of the monitor type is important, as the behavior and use of a monitor
largely depend on the type of monitor.
Setting up monitors 131
Important note
Note that only the threshold alert is a standard method that you can find in
other monitoring systems; the rest of them are Datadog-specific. Usually,
some customization will be needed to implement these detection methods in a
generic monitoring system, which Datadog supports out of the box.
In the second step, the metric used for the monitor is selected, and filter conditions are
added using tags to define the scope of the monitor. Let's take a look at each field, in this
section, to understand all of the available options:
• Metric: The metric used for the monitor is selected here. It should be one of the
metrics that is reported by the Datadog agent.
• from: Using available tags, the scope of the monitor is defined. In this example, by
selecting the host and device_name tags, the monitor is defined specifically for a
disk partition on a host.
• excluding: Tags could be selected here to exclude more entities explicitly.
If the filter condition (set in the from and excluding fields) returns more than one set
of metric values, such as system.disk.free values from multiple hosts and devices,
one of the aggregate functions, average, maximum, minimum, or sum, will be selected
to specify which value can be used for metric value comparisons. Picking the correct
aggregate function and tags for grouping is important. If the filter condition returns only
one value already, then this setting is irrelevant.
The alert triggered by this monitor could be configured as Simple Alert or Multi Alert
(in Figure 7.3, Simple Alert is selected). When the scope of the monitor is a single source,
such as a storage device on a host, a simple alert is chosen. If there are multiple sources
involved and the Simple Alert option is chosen for the monitor, an aggregated alert
notification is sent out for all the sources, and an alert notification that is specific to each
source is sent out if the Multi Alert option is chosen.
In this section of the form, the configurations mentioned earlier are all done under the
Edit tab. By clicking on the Source tab, you can view the configurations you have done in
the underlying Datadog definition language, as shown in the following example:
avg:system.disk.free{device_name:xvda1,host:i-
021b5a51fbdfe237b}
132 Monitors and Alerts
In the third part of this form, primarily, threshold values and related configurations for
the monitor are set, as shown in the following screenshot:
• Data window: This from 1 minute to 1 day, and there is also a custom option.
Metric values from this time window are checked with reference to the nature of the
occurrence setting.
• Nature of the occurrence: on average, at least once, at all times,
or in total. As the options suggest, the metric values from the specified time
window are checked for possible issues, warnings, or alerts, based on this setting.
The monitor is marked as recovered from either the warning or alert state when the
related conditions are no longer satisfied. However, you can add optional thresholds that
are specific to recovery using Alert recovery threshold and Warning recovery threshold.
The data window can be chosen as Require or Do not require. If Require is selected, the
monitor will wait for a full window of data to run the checks. This option must be selected
if the source is expected to report metric values regularly. Select Do not require if the
expectation is to run the checks with whatever data points are available during the data
window specified on the monitor.
Besides the warning and alert notifications, the monitor can also report on missing data.
If the Notify option is selected, a notification will be sent by the monitor when the metric
values are not reported during the specified time window in minutes. This option is
chosen when the sources are expected to report metric values during normal operational
conditions, and a no data alert indicates some issue in the application system monitored
by Datadog. If the sources selected in the monitor are not expected to report metric
values regularly, choose the Do not notify option. The no data alert could be set to resolve
automatically after a specified time window. Usually, this is set to Never unless there is a
special case of it being automatically resolved.
Using the Delay evaluation option, the metric values used to evaluate various thresholds
set in the monitor can be offset by a specified number of seconds. For example, if a
60-second evaluation delay is specified, at 10:00, the checks for the monitor will be run
on the data reported from 9:54 to 9:59 for a 5-minute data window. In a public cloud
environment, Datadog recommends as much as a 15-minute delay to reliably ensure that
the metric values are available.
134 Monitors and Alerts
Based on the thresholds set for the monitor, the live status is charted at the top of the
form, and the following screenshot shows how the sample monitor depicts the status it
determined:
The notification can have a title and body, just like an email message. Template variables
and conditionals could also be used to make them dynamic. The following screenshot
indicates what an actual notification template might look like:
• Value: This is the metric value that will trigger the alert.
• threshold: This is the threshold for triggering an alert that the metric value is
compared against.
• warn_threshold: This is similar to the threshold but set for triggering a warning.
• last_triggered_at: This details the time when the alert was triggered in UTC.
If the Multi Alert option is selected, the values of the tags used to group the alerts will
be available as TAG_KEY.name in the notification message. Click on the Use message
template variables help link to know which tags are available for use.
136 Monitors and Alerts
Using conditionals, custom notifications can be sent out based on the nature of the alert
triggered. In the example message template in Figure 7.7, you can see that part of the
message is specific to a warning or an alert. The following is a list of the main conditionals
that can be used in the message template:
Please refer to the Datadog documentation for a full set of template variables and how
they are used at https://docs.datadoghq.com/monitors/notifications.
The notification message could be formatted as well, and the details of the markdown
format can be viewed by following the Markdown Formatting Help link on the form.
The recipients of the notification can be specified using @, as mentioned in the example
template. Usually, an email address or distribution list follows @. The notification message
can be tested by clicking on the Test Notifications button at the bottom of the form, as
shown in Figure 7.9. You should see a pop-up window where one or more test scenarios
can be selected, and the test messages will be sent out to recipients accordingly.
You can tag a new monitor to help with searching for and grouping monitors later. The tag
monitor-type with a value infra has been added to the sample monitor by typing in
the tag key and value in the Tags field.
The renotify if the monitor has not been resolved option must be used for escalating an issue
if no action is taken to resolve the state in the application system that triggered the alert.
An additional message template must be added to the monitor if the renotify option is
selected, as shown in the following screenshot:
Setting up monitors 137
Figure 7.9 – The New Monitor form – the Notify your team and Save monitor
The recipient list specified by @ in the message template in section 4 automatically shows
up in this section and vice versa.
138 Monitors and Alerts
It is possible to configure the monitor to notify the alert recipients of any changes with the
monitor definition itself using the alert recipients when this alert is modified option.
Access to edit the monitor can also be restricted using the editing this monitor to its
creator or administrators option.
The monitor we built in the preceding example is a metric type monitor, which is the most
common type of monitor available on all monitoring platforms, including Datadog. The
creation of all of these kinds of monitors is similar, and so, we will not look at every step
that is needed to set up the remaining monitor types.
Let's take a look at the main features of other monitor types and the scenarios in which
they might be useful:
• Host: This monitor simply checks whether the Datadog agent running on the
host reports into the SaaS backend. In infrastructure monitoring, pinging a host
is a basic monitoring activity, and in Datadog, this is implemented by using the
heartbeats sent by the agent to the backend. Multiple hosts can be monitored from
one monitor by selecting the hosts using the host tag.
• Event: In this monitor type, Datadog provides a keyword-based search of events
during a time window and allows you to set warning and alert thresholds based on
the number of events found in the search result. Setting up an event monitor would
be useful if it was also possible to track an issue based on the event descriptions
posted to Datadog.
• Composite: Composite monitors are built by chaining existing monitors using
logical operators AND (&&), OR (||), and NOT (!). For example, if a, b, and c
are existing monitors, you can create a composite monitor using the following
condition:
( a || b ) && !c
• Process Check: This monitor type is similar to the Live Process monitors in terms
of monitoring the processes running on the hosts. However, a Process Check
monitor can only monitor the processes covered by the Datadog agent check,
process.up; however, it's more organized. The processes to be monitored using
this type of monitor must be defined in the conf.d/process.yaml file on the
Datadog agent side.
• Network: This monitor type, essentially, monitors the TCP and HTTP endpoints
defined in the TCP and HTTP checks configured on the Datadog agent side. The
TCP checks are defined in the conf.d/tcp_check.d/conf.yaml file, and the
HTTP checks are defined in the conf.d/http_check.d/conf.yaml file. The
HTTP check also covers SSL certificate-related verifications for an HTTPS URL.
The network checks return OK, WARN, or CRITICAL. The threshold is set for how
many such status codes must be returned consecutively for the monitor to trigger a
warning or alert.
• Integration: Third-party applications, such as Docker and NGINX, provide
integrations with Datadog that are available out of the box, and they only need to
be enabled when required. Once enabled and configured by the Datadog agent,
the integrations will publish domain-specific metrics and status check results into
Datadog, and these will be available to monitors of this type.
When a monitor of this kind is created, either the integration-specific metric or a
status check can be used as the source of information to be tracked by the monitor.
• Custom Check: When a certain status cannot be checked using various information
reported by a Datadog agent out of the box with or without some configuration
changes, or by the use of integrations, custom scripts can be written and deployed
with the Datadog agent. We will discuss the details of doing that in Chapter 8,
Integrating with Platform Components.
This monitor type is similar to the integration monitor, in which a custom check is
selected for the status check.
• Watchdog: A Watchdog monitor will report on any abnormal activity in the system
monitored by Datadog. Datadog provides various problem scenarios that cover both
computational infrastructures and integrations.
We have looked at how to set up new monitors at great length, and we have also reviewed
different types of monitors that can cater to all possible requirements you might have.
In the next section, we will learn how these monitors can be maintained in a large-scale
environment, where some automation might also be required.
140 Monitors and Alerts
Managing monitors
By navigating to Monitors | Manage Monitors on the Datadog dashboard, you can list all
of the existing monitors, as shown in the following screenshot:
Distributing notifications
Let's recap some of the concepts that we have discussed in this chapter to reiterate the
related workflows. A monitor triggers a warning or an alert when some threshold or
state that the monitor is tracking in the system has been reached. Notifications about this
change in status, from OK to Warning or Alert, and then recovering back to OK, can be
sent out to different communication platforms such as email, Slack, Jira, and PagerDuty.
These notifications can also be posted to any system that supports Webhooks.
142 Monitors and Alerts
We have already learned that just by prefixing a personal or group email address with @,
the notifications could be forwarded to them. It's always a best practice to forward these
notifications to a group email or a distribution list, as they must be addressed by someone
on the team.
The integrations with other tools facilitate the distribution and escalation of alert
notifications for systematic tracking and the closure of issues. Let's take a look at some of
the important tools:
• Jira: Atlassian's Jira is a popular issue tracking tool. By enabling this integration,
a Jira ticket can be created for every alert notification the monitors generate.
Additionally, the creation of the Jira ticket is tracked in Datadog as an event.
• PagerDuty: Tools such as PagerDuty are used by support teams to escalate issues
systematically. An alert notification needs to be distributed to the right resource in
order to triage the related issue, and if that person is not available, another resource
must be notified or the issue has to be escalated. PagerDuty is good at taking care of
such tasks, and it can be configured to implement complex escalation needs.
• Slack: IRC-style team communication platforms such as Slack have been very
popular, and they cover many messaging requirements that email used to address
before. Once integrated, many Datadog tasks, such as muting monitors, can be
initiated from a Slack channel. These options are available in addition to the core
feature of forwarding the alert notifications to Slack channels.
• Webhooks: While Datadog supports integrations with popular issue tracking and
communication platforms such as PagerDuty, Jira, and Slack, it also provides a
custom integration option using webhooks. If specified in the alert notification
template, a webhook will post alert-related JSON data into the target application. It's
up to that application to consume the alert data and take any action based on that.
In Chapter 8, Integrating with Platform Components, we will learn, in detail, how
webhooks are set up in Datadog for integration with other applications.
Monitors can send out lots of notifications via multiple channels based on the integrations
in place, and it might be necessary to mute the monitors at times, in situations such
as maintenance or deployment. In the next section, you will learn how that can be
accomplished.
Configuring downtime 143
Configuring downtime
Using the options provided by Datadog, a group of monitors can be muted during a
scheduled period. Typically, such downtime is needed when you make a change to the
computing infrastructure or deploy some code. Until such changes are completed, and
the impacted environments are stabilized, it might not make much sense to monitor the
software system in production. Additionally, by disabling the monitors, you can avoid
receiving alert notifications in emails, texts, and even phone calls depending on what
integrations are in place.
To schedule downtime, navigate to Monitors | Manage Downtime, and you should see a
form, as shown in the following screenshot, where the existing scheduling will be listed:
By clicking on the Schedule Downtime button, you can add a new downtime schedule, as
follows:
• The downtime can be defined for all monitors or a group of monitors identified by
names or tags.
• The downtime can be one-time or recurring. A recurring downtime is useful when
some of the services being monitored will be unavailable periodically, such as some
applications that are shut down during the weekend.
• A notification message template and recipient list can be specified, similar to alert
notifications, to send out notifications about the downtime.
We have looked at the process of configuring the downtime of a monitor, and that
concludes this chapter. Next, let's take a look at the best practices related to setting up and
maintaining monitors and alerts.
Best practices
Now, let's review the best practices related to monitors and alerts:
• Define a comprehensive list of monitors that will cover all aspects of the working of
the software system that Datadog is monitoring.
• It's possible that you might only be using Datadog to address a certain part of the
monitoring requirement, and, in that case, make sure that the monitors defined
cover your area of focus.
• To avoid alert fatigue, make sure that all alert notifications are actionable, and
the remediation steps are documented in the notification itself or in a runbook.
Continue fine-tuning the thresholds and data window size until you start getting
credible alert notifications. If monitors are too sensitive, they can start generating
noise.
• If any amount of fine-tuning is not enough to tame a monitor from sending out too
many alerts, consider deleting it, and plan to monitor the related scenario in some
other reasonable way.
• Integrate monitors with your company's choice of tools for incident tracking,
escalation, and team communication. Email notifications must be a backup, and
those notifications must be sent out to user groups and not individuals. Email
alerts should also serve as a historical record that can be used as evidence for
postmortems (or the Root Cause Analysis or RCA of production incidents) and
third-party audits such as those required for SOC 2 Type 2 compliance.
146 Monitors and Alerts
• Back up monitors as code by exporting out the JSON definition and maintaining it
inside a source control system such as Git. If possible, maintain monitors in a code
repository by defining monitors in Terraform or Ansible or Datadog's JSON format.
• Socialize the downtime feature and practice it while doing maintenance. Without
that, the alert notifications can become a nuisance, and the credibility of the
monitoring system itself can take a hit.
Summary
In this chapter, you have learned how to create a new Datadog monitor based on
the variety of information available in Datadog about the software system and the
infrastructure that Datadog monitors. Additionally, we learned how those monitors could
be maintained manually or maintained as code using the options available. We looked
at different methods available to integrate the monitors with communication tools to
distribute the alert notifications widely. Finally, you learned how the downtime feature
could be used effectively during maintenance and shutdown periods.
We already discussed a number of third-party tools in the context of using or monitoring
them with Datadog. In the next chapter, we will learn how such integrations are used in
Datadog and how custom integrations can be rolled out.
Section 2:
Extending Datadog
Datadog has many standard monitoring features that are out of the box and immediately
available upon successfully installing agents in an application environment. However, to
roll out a comprehensive monitoring solution, Datadog needs to be extended by adding
custom metrics and features. The Datadog platform provides multiple integration options
to extend it to implement advanced monitoring requirements. This part of the book
explores all such options in detail.
This section comprises the following chapters:
In this chapter, we will learn about the details of platform monitoring using the
integration options available in Datadog as outlined above. Specifically, we will cover these
topics:
• Configuring an integration
• Tagging an integration
• Reviewing supported integrations
• Implementing custom checks
Technical requirements 151
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools
installed and resources available:
- An Ubuntu 18.04 environment with Bash shell. The examples might work on
other Linux distributions as well but suitable changes must be done to the Ubuntu
specific commands.
- A Datadog account and a user with admin-level access.
- A Datadog Agent running at host level or as a microservice depending on the
example, pointing to the Datadog account.
Configuring an integration
Datadog provides integration with most of the popular platform components and
they need to be enabled as needed. We will see the general steps involved in enabling
integration with an example.
The available integrations are listed on the Integrations dashboard as in the following
screenshot, and it's directly accessible from the main menu:
As you can see, in Figure 8.1, the third-party software components are listed on the
Integrations dashboard. Using the search option available at the top of this dashboard, the
integrations that are already installed can be filtered out. Also, the available integrations
can be looked up using keywords.
By clicking on a specific integration listing, you can get all the details related to that
integration. For example, the following screenshot provides such details for integration
with NGINX, a popular web server:
Now let's see how the NGINX integration is configured to get it enabled for a specific
NGINX instance and that will demonstrate the general steps involved in configuring any
integration.
The first step is to enable the integration in your account and that can be accomplished by
just clicking the Install button on the integration listing as you can see in Figure 8.3.
154 Integrating with Platform Components
In the sample case here, we will try to enable the integration for an open source version
of an NGINX instance running on an AWS EC2 host that runs on the Linux distribution
Ubuntu 16.04. Note that the actual configuration steps would also differ depending on
the operating system where the platform component and Datadog agent run, and these
steps are specific to the environment described previously:
1. Make sure that both the Datadog Agent and NGINX services are running on the
host. This could be checked as follows on Ubuntu 16.04:
$ sudo service nginx status
2. You will see a status similar to the following if the service runs OK:
Active: active (running) since Mon 2021-01-04 03:50:39
UTC; 1min 7s ago
(Only the line corresponding to service status is provided here for brevity.)
Let's do the configuration changes needed on the NGINX side first.
3. Check if the stub status module is installed with the NGINX instance that is needed
for the integration to work:
$ nginx -V 2>&1| grep -o http_stub_status_module
http_stub_status_module
access_log off;
allow 127.0.0.1;
deny all;
location /nginx_status {
# Choose your status module
Configuring an integration 155
7. To check if the integration works correctly, you can look at the related information
in the Datadog Agent status:
$ sudo datadog-agent status
Getting the status from the agent.
nginx (3.8.0)
-------------
Instance ID: nginx:16eb944e0b242d7 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/
nginx.d/conf.yaml
Total Runs: 5
Metric Samples: Last Run: 7, Total: 35
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 5
Average Execution Time : 4ms
Last Execution Date : 2021-01-04 04:00:29.000000
UTC
Last Successful Execution Date : 2021-01-04
04:00:29.000000 UTC
metadata:
version.major: 1
version.minor: 10
version.patch: 3
version.raw: 1.10.3 (Ubuntu)
version.scheme: semver
Tagging an integration
The metrics published by an integration can be tagged at the integration level for better
visibility with Datadog resources where those metrics will be used eventually. We will see
how that's done with the intention of implementing the best practices for effectiveness. We
already learned in Chapter 2, Deploying Datadog Agent, and, in Chapter 5, Metrics, Events,
and Tags, that host-level, custom tags can be added by using the tags configuration item
in the datadog.yaml file. Custom tags added using this option will be available on all
the metrics generated by various integrations running on that host. The tagging option
is available at the integration level also, and the related tags will be applied only on the
integration-specific metrics.
In the use case that was mentioned in the previous section related to using NGINX for
different roles, this multi-level tagging method will be useful to filter the NGINX metrics
originating from multiple hosts. For example, at the host level, it can be identified as a
web server or proxy server with the tag role:proxy-server or role:web-server,
and at the integration level, more tags can be applied indicating the specific name of
the component with the tag component:nginx. Note that this approach provides
the flexibility to track the role of platform components such as HAProxy, NGINX, and
Apache that could be used for a variety of HTTP and proxy serving roles in the entire
application system.
Now, let's see how this tagging strategy we just discussed could be implemented in the
sample case from the last section.
In the datadog.yaml file, the following tag is added:
tags:
- role:proxy-server
tags:
- component:nginx
- nginx-version:open-source
Note that the additional tag nginx-version will help to identify what kind of NGINX
is used. To start applying these tags to the metrics, the Datadog Agent has to be restarted.
After that, you can filter the metrics using these tags as in the following examples.
Tagging an integration 159
In the first example, as shown in the following screenshot in Figure 8.5, you can see that an
integration metric, nginx.net.connections, is tagged with a host-level tag, proxy-
server. Note that all the host-level metrics and the metrics from other integrations will
also be tagged in the same fashion. For example, the system-level metric system.mem.
used will also be tagged with role:proxy-server once that is enabled:
In the last two sections, we have learned the basics of how to enable an integration and
how to tag the metrics published by an integration. With that basic understanding, we will
dig deeper into the broader picture of the Datadog support for third-party applications
and how they are integrated and organized. We will also look at some of the important
integrations that we will see ourselves using most of the time with Datadog as they are the
common platform components used to build a variety of software application systems.
• Agent-based: In the example we saw earlier in this chapter, the Datadog Agent
configuration had to be updated to enable the integration. That is required because
the platform component, NGINX in the example, runs on a host. It could be run
as a microservice also and yet an agent is needed to monitor that environment.
Essentially, the integration in that case is managed by the local agent. The Datadog
Agent is shipped with official integrations and they are readily available as we saw in
the example on NGINX. There are more integrations available that are community-
built and they can be checked out at https://github.com/DataDog/
integrations-extras on GitHub. Each integration is documented in its own
README file there.
• Crawler-based: A software platform is not always built using components running
only locally. Some services could be cloud-based and Datadog provides integrations
with popular services such as GitHub, Slack, and PagerDuty. In such cases,
credentials to access those service accounts are provided to Datadog and it will
crawl the account to report metrics related to the given service account.
• Custom integration: Custom integration options are extensive with Datadog
and they are general-purpose in nature and not specific to integrating platform
components. The option to run custom checks from a Datadog Agent is the easiest
option available to integrate a platform component for which official support is not
available or adequate. We will see how that could be implemented in a sample later
in this section. The following are the other options that can be used for rolling out a
custom integration.
Reviewing supported integrations 161
• Use the Datadog API: One of the major attractions of using Datadog for
monitoring is its extensive REST API support. While you need a skilled team to roll
out integrations using the Datadog API, having that option makes your monitoring
infrastructure extensible and flexible.
• Build your own integration: Datadog provides a developer toolkit to build your
own integration following the Datadog norms. The details can be checked out at
https://docs.datadoghq.com/developers/integrations/new_
check_howto.
• Publish metrics using StatsD: We will look at this generic integration option in
detail in Chapter 10, Working with Monitoring Standards.
Now let's look at some of the important integrations that are shipped with the Datadog
Agent and available for users to enable and use:
• Integrations with the public cloud: Datadog supports integrations with major
public cloud platforms such as AWS, Azure, and GCP and the individual services
offered on those platforms. These crawler-based integrations require access to your
public cloud account and that can be provided in different ways.
• Microservices resources: Both Docker and Kubernetes are key components
in building a microservices infrastructure. Integrations for both these platform
components and related products are supported by Datadog.
• Proxy and HTTP services: Apache, Tomcat, NGINX, HAProxy, Microsoft IIS, and
Memcached are popular in this category and integrations are available for these
components.
• Messaging services: Popular messaging software and cloud services such as
RabbitMQ, IBM MQ, Apache Active MQ, and Amazon MQ are supported.
• RDBMS: Almost every popular Relational Database Management System
(RDBMS), such as Oracle, SQL Server, MySQL, PostgreSQL, and IBM DB2, is
supported. Monitoring databases is an important requirement as databases are
central to many software applications. These integrations supply a variety of metrics
that could be used to monitor the workings and performance of databases.
162 Integrating with Platform Components
• NoSQL and Big Data: NoSQL databases are widely used in big data and cloud-
based applications due to their flexibility and scalability. Popular software such as
Redis, Couchbase, Cassandra, MongoDB, Hadoop, and related products from this
category are supported by Datadog.
• Monitoring tools: It's common to have multiple monitoring tools in use as part of
rolling out a comprehensive monitoring solution for a target environment. In such
a scenario, Datadog will be one of the services in the mix, and it's a good platform
to aggregate inputs from other monitoring systems due to its superior UI and
dashboarding features. Datadog also provides integrations with other monitoring
tools to facilitate that consolidation. Currently, integrations for monitoring
applications such as Catchpoint, Elasticsearch, Nagios, New Relic, Pingdom,
Prometheus, Splunk, and Sumo Logic are available.
Let's call this custom check custom_nginx. The configuration steps largely follow
those we did for enabling the NGINX integration earlier. In this case, the configuration
directory and related resources have to be created:
1. Create a configuration directory and set up a configuration file for the check:
$ cd /etc/datadog-agent/conf.d
$ sudo mkdir custom_nginx.d
2. Create a custom_nginx.yaml file in the new directory and save the following
string in it:
instances: [{}]
class NginxErrorCheck(AgentCheck):
def check(self, instance):
file_info err, retcode = get_subprocess_
output(["ls", "-al","/var/log/nginx/error.log"], self.
log, raise_on_empty_output=True)
file_size = file_info.split(" ")[4];
164 Integrating with Platform Components
self.gauge("kurian.nginx.error_log.size", file_
size,tags=['component:nginx'])
Besides the template requirements of the script, it does the following tasks:
A. Runs the ls -al command on the /var/log/nginx/error.log file
B. Parses the file size from the command output
C. Reports the file size as a metric value to Datadog with the component:nginx
tag applied
4. Restart the Datadog Agent to enable the custom check. To check if it runs
successfully, you can run the status check command and look for the status
related to the custom check:
$ sudo datadog-agent status
Running Checks
==============
custom_nginx (1.0.0)
--------------------
Instance ID: custom_nginx:d884b5186b651429 [OK]
Configuration Source: file:/etc/datadog-agent/
conf.d/custom_nginx.d/custom_nginx.yaml
Total Runs: 1
Metric Samples: Last Run: 1, Total: 1
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 2ms
Last Execution Date : 2021-01-04 10:03:18.000000
UTC
Last Successful Execution Date : 2021-01-04
10:03:18.000000 UTC
5. Once you have verified the working of the custom check on the server side, you can
expect the custom metric to be available on the Datadog UI, and it can be verified as
in the following screenshot:
Implementing custom checks 165
The dashboard also tracks the NGINX integration as one of the apps on the host
dashboard.
Now let's look at the best practices related to the topics that we have covered in this
chapter in the following section.
Best practices
Between the Datadog-provided integrations and the hooks it provides to roll out your own
custom integrations, there are many options available to you and so it's better to follow the
best practices instead of implementing something that works but is suboptimal:
• Explore all the Datadog-provided integrations fully and check whether you could
meet the monitoring requirements using those. Custom code and configurations are
costly to develop, error-prone, and hard to deploy and maintain, in the context of
monitoring, and you should consider writing custom code as the last resort.
• If Datadog-supported integrations are not readily available, check in the big
collection of community-maintained integrations.
• If you need to tweak a community-maintained integration to get it working for you,
consider collaborating on that project and commit the changes publicly, as that will
help to obtain useful feedback from the Datadog community.
• Come up with a strategy for naming tags and custom metrics before you start using
them with integrations. Systematic naming of metrics with appropriate namespaces
will help to organize and aggregate them easily.
• Maintain the custom code and configurations used for enabling and implementing
integrations in the source code control system as a backup and, optionally, use that
as the source for automated provisioning of Datadog resources using tools such as
Terraform and Ansible. This best practice is not specific to integrations; it has to
be followed whenever custom code and configurations are involved in setting up
anything.
• In a public cloud environment, the host-level configurations needed for enabling
integrations must be baked into the machine image. For example, in AWS, such
configurations and custom code, along with the Datadog Agent software, can be
rolled out as part of the related AMI used for spinning a host.
Summary 167
Summary
We have looked at both Datadog-supplied integrations and the options to implement
integrations on your own. A Datadog environment that monitors a large-scale production
environment would use a mixed bag of out-of-the-box integrations and custom checks.
Though it's easy to roll out custom checks in Datadog, it is advised to look at the total cost
of doing so. In this chapter, you have learned how to select the right integrations and how
to configure them. Also, you learned how to do custom checks if that is warranted, in the
absence of an out-of-the-box integration.
Continuing with the discussion on extending Datadog beyond the out-of-the-box features
available to you, in the next chapter, we will look at how the Datadog API can be used to
access Datadog features and use them for implementing custom integrations.
9
Using the Datadog
REST API
In the previous chapter, you learned how platform components, mainly made up of
third-party software products and cloud services, are integrated with Datadog, by way
of Datadog-supported integrations and custom checks. The main objective of those
integrations is to monitor third-party tools used in the application stacks from Datadog.
The integration with Datadog can be done in the other direction also. Tools and scripts
can use Datadog HTTP REST APIs to access the Datadog platform programmatically. For
example, if you need to post a metric value or an event to Datadog from an application,
that can be done using the related REST APIs.
The Datadog REST API set is a comprehensive programmatic interface to access the
Datadog monitoring platform. The APIs can be used to post custom metrics, create
monitors and dashboards, tag various resources, manage logs, and create and manage
users and roles. Essentially, anything that you can perform from the Datadog UI and by
making configuration changes at the agent level.
170 Using the Datadog REST API
In this chapter, you will learn the basics of using Datadog APIs from command-line tools
and programming languages, with the help of tutorials. Specifically, the following topics
are covered:
• Scripting Datadog
• Reviewing Datadog APIs
• Programming with Datadog APIs
Technical requirements
To try out the examples in this book, you need to have the following tools installed and
resources available:
Scripting Datadog
In this section, you will learn how to make calls to Datadog APIs using the curl
command-line tool and how to use the APIs from Python. An important prerequisite to
access the Datadog platform programmatically is to set up user access for that purpose.
While authentication is done with the use of dedicated user credentials or SAML when
the Datadog UI is accessed, a pair of application and API keys is used with Datadog APIs.
Let's see how those keys are set up.
By navigating to Team | Applications Keys, a new application key pair can be created on
the Application Keys page as shown in the following screenshot:
Scripting Datadog 171
By navigating to Integrations | APIs on the Datadog dashboard, you can get to the APIs
page where the API key can be created or an existing one can be copied, as shown in the
following screenshot:
curl
In the following example, we will see how a simple curl API call can be used to query for
the hosts monitored by Datadog. As you can see, the JSON output is verbose and that is
usually meant for some automated processing and not meant for manual consumption of
any sort:
"metrics": {
"cpu": 15.446295,
"iowait": 0,
"load": 1.0077666
},
"mute_timeout": null,
"name": "thomass-mbp-2.lan",
"sources": [
"agent"
],
"tags_by_source": {
"Datadog": [
"host:thomass-mbp-2.lan"
]
},
"up": true
}
],
"total_matching": 1,
"total_returned": 1
}
(Only an excerpt of the output is provided here. The full version can be found in the GitHub
repository for this chapter.)
The output is verbose, and the preceding code is only an excerpt of it. The result provides
detailed information about the hosts where the Datadog agents run.
174 Using the Datadog REST API
Though this is a simple API call, there are multiple things you can learn about the
Datadog APIs and curl from this example, so let's look at those one by one.
Not only does the command-line tool curl need to be installed in the local environment,
typically in some Unix shell depending on the operating system that you are working on,
but Python must also be available. Python is needed because the output from the API call
is piped into the Python module json.tool, which formats the output to look better. So,
it's optional in this case, but you will need Python to run the other sample programs.
Let's look at each piece in the curl call:
• The -s switch passed to curl silences the tool from outputting messages about its
own working. That is a good practice when the output is supposed to be parsed by
another tool or code to avoid mixing it with the result from the API call.
• -X GET is the HTTP verb or method used and the -X option of curl is used for
specifying that. The GET method allows the reading of resources and it's the default
method when making REST API calls from any tool, including curl. So, in this case,
there is no need to use -X GET as GET is the default verb. The other important
methods are POST (for creating new resources), PUT (for updating new resources),
and DELETE (for deleting existing resources). We will see the use of all these
methods in this chapter; note that POST and PUT are used interchangeably.
• https://app.datadoghq.com is the URL to access the Datadog backend.
• /api/v1/hosts is the API endpoint that is used to list the hosts. An API
endpoint corresponds to a resource that could be accessed via the REST API. The
HTTP method used along with the API endpoint determines the nature of the
action. (These conventions are not always followed strictly.) For example, GET
returns details about the existing hosts, and a POST or PUT call could be used to
make some change to the same resource.
• The -H option of curl lets you pass in an HTTP header as part of the API call.
In this example, three such headers, Content-Type, DD-API-KEY, and
DD-APPLICATION-KEY, are passed. Practically, the headers can be considered
inputs to the API call. With the POST and PUT methods, data can be passed to the
call as input using the -d option of curl (which is akin to passing in input for a web
form), but with a GET call, a header is the only option. In this case, Content-
Type tells the API to return the result in JSON format.
Scripting Datadog 175
Now, let's make the same API call with a different set of options to illustrate the usage of
curl with the Datadog REST API:
HTTP/2 200
date: Sun, 24 Jan 2021 22:48:05 GMT
content-type: application/json
content-length: 2687
vary: Accept-Encoding
pragma: no-cache
cache-control: no-cache
set-cookie: DD-PSHARD=134; Max-Age=604800; Path=/; expires=Sun,
31-Jan-2021 22:48:05 GMT; secure; HttpOnly
x-dd-version: 35.3760712
x-dd-debug: V1SoipvPhHDSfl6sDy+rFcFwnEIiS7Q6PT
TTTi5csh65nTApZwN4YpC1c2B8H0Qt
x-content-type-options: nosniff
strict-transport-security: max-age=15724800;
content-security-policy: frame-ancestors 'self'; report-uri
https://api.datadoghq.com/csp-report
x-frame-options: SAMEORIGIN
176 Using the Datadog REST API
In this version of the curl call, the output is not formatted, but a very useful curl option
-i is used. It adds header information to the result that can be used to process the output
better. Important header information available in the first line of the output is the HTTP
status code HTTP/2 200. A status code in the 200 range indicates that the API call was
successful. Looking at the HTTP status code is important for an automated script to take
appropriate action if the REST API call fails. The 200 range of codes indicates various
success statuses, the 300 range of codes are related to URL redirection, the 400 range
of codes point to issues with client calls such as bad URLs and authentication issues, and
the 500 range of codes indicate server-side issues. In general, looking at the information
available in the header of an API call result is important to make your script robust.
Now let's see how a POST API call can be done using curl, which would make
some change to a resource. In the following example, the selected host is muted
programmatically from sending out any alert notifications:
HTTP/2 200
date: Mon, 25 Jan 2021 01:21:59 GMT
content-type: application/json
content-length: 74
vary: Accept-Encoding
pragma: no-cache
cache-control: no-cache
set-cookie: DD-PSHARD=134; Max-Age=604800; Path=/; expires=Mon,
01-Feb-2021 01:21:59 GMT; secure; HttpOnly
x-dd-version: 35.3760712
x-dd-debug: HbtaOKlJ6OCrx9tMXO6ivMTrEM+g0c93H
Dp08trmOmgdHozC5J+vn10F0H4WPjCU
x-content-type-options: nosniff
strict-transport-security: max-age=15724800;
content-security-policy: frame-ancestors 'self'; report-uri
https://api.datadoghq.com/csp-report
x-frame-options: SAMEORIGIN
{"action":"Muted","downtime_id":1114831177,"hostname":"thomass-
mbp-2.lan"}
Scripting Datadog 177
The following are the new points you need to note from the preceding example:
• POST is used as the API call method using the curl option -X.
• The API endpoint /api/v1/host/thomass-mbp-2.lan/mute contains the
hostname that is changed by the API call and the action taken.
• The input is provided using the curl option -d. The @ symbol indicates that the
string it precedes is the name of a file and the input must be read from it. Without
the @ qualifier, the string is considered a literal input. See the content of the input
file mute.json used in the sample program:
{
"action": "Muted",
"end": 1893456000,
"message": "Muting my host using curl"
}
The input parameters are specific to an API endpoint and the required information
must be provided.
• The JSON message from the Datadog backend is the last part of the output. While
that would provide some indication of the outcome of an API call, its success must
be determined in conjunction with the HTTP status code as you learned from the
previous example:
HTTP/2 200
{"action":"Muted","downtime_id":1114831177,"hostname":"th
omass-mbp-2.lan"}
178 Using the Datadog REST API
If you look up the same host on the Datadog UI, you can verify that it's muted as shown in
the following screenshot:
Python
Now, let's see how Datadog REST API calls can be made from Python. While the utility
of curl cannot be discounted, serious automation projects tend to use a full-featured
programming language such as Python to build programs used in production. Similarly,
other programming languages such as Go, Java, and Ruby are also supported by Datadog,
by providing language-specific wrappers to the REST APIs.
In the following sample Python program, a custom event is posted to the Datadog
backend:
# post-event.py
from datadog import initialize, api
options = {
"api_key": "cd5bc9603bc23a2d97beb118b75f7b11",
"app_key": "21f769cbd8f78e158ad65b5879a36594c77eb076",
}
initialize(**options)
Scripting Datadog 179
The test program is self-explanatory. All the inputs needed are hardcoded in it, including
the key pair needed for user authentication. In a real-life program used in production, the
use of keys will be parameterized for flexibility and security. Anticipated exceptions in the
program will be handled for robustness.
You need to have Python installed in your local environment to run this program and the
Datadog client library. The Datadog client library can be installed as a Python module
using the pip utility, which is normally installed with Python:
The preceding sample Python code can be saved in post-event.py and can be run by
invoking it with Python as follows:
$ python post-event.py
The success of running this program can be verified on the Events dashboard of the
Datadog UI also, as shown in the following screenshot:
Dashboards
The dashboard tasks that you perform from the Datadog UI can be done using APIs also.
The following are some of the important API endpoints that cover the entire life cycle of
a dashboard:
• Creating a dashboard
• Listing existing dashboards
• Getting details about a dashboard
• Updating and deleting an existing dashboard
• Sending invitations to share a dashboard with other users
• Revoking the sharing of a dashboard
Reviewing Datadog APIs 181
Downtime
Downtime is set on a monitor to stop it from sending out alert notifications. As discussed
earlier, such configurations are needed for some operational reasons, such as when
pushing code to a production environment. The life cycle of downtime, starting from
scheduling through cancelation, can be managed using related APIs.
Events
In the previous section, you saw that events can be posted to the Datadog events stream by
using an API call. APIs are also available to get the details of an event and to query events
using filters such as tags.
Hosts
Details about hosts monitored by Datadog can be gathered using APIs, and some of the
important details are these:
Metrics
Using the Datadog APIs, these tasks related to metrics can be performed:
The APIs to post and query metrics data are widely used for integrating with Datadog.
As Datadog has excellent charting, dashboarding, and monitoring management features,
making monitoring data available in the form of time-series data on the Datadog platform
is very attractive.
In the next section, you will learn how custom metrics data is published to Datadog and
use it later for building useful monitors.
182 Using the Datadog REST API
Monitors
Monitors watch metrics data and check and notify based on the thresholds set. The entire
life cycle of a monitor can be managed using APIs, including these tasks:
In the next section, you will learn how to use some of the specific API endpoints related to
managing monitors.
Host tags
You have already learned that a tag is an important resource type in Datadog for
organizing and filtering information, especially metrics data. Datadog provides excellent
API support for applying and managing host-level tags. These are the main endpoints:
In general, Datadog API endpoints to manage resources provide the option to apply
tags to them. Also, tags can be used as one of the filtering options with the APIs used for
querying these resources. You will learn how to do that in sample programs in the next
section.
We have only looked at the important resources and related API endpoints in this section.
To obtain the most complete and latest information on Datadog APIs, the starting point is
the official Datadog APIs page at https://docs.datadoghq.com/api/.
The next section is a tutorial to explain the use of Datadog APIs further by using sample
Python programs.
Programming with Datadog APIs 183
The problem
For the tutorial, let's assume that you are maintaining an e-commerce site and you need
to monitor the performance of the business on an hourly basis, which management might
be interested in tracking. There is a custom program to query the hourly order from the
company's order management system, which will also post the metric data to Datadog,
and it is scheduled to run every hour.
Once the metrics data is available in Datadog, that could be used in building monitors
and dashboards. Also, whenever the hourly run of the custom program is completed, it
will post an event in the Datadog events stream indicating the completion of the hourly
job. This will make sure that if something goes wrong with the scheduled job, you can get
details about that from the events stream.
# post-metric-and-event.py
import sys
import time
from datadog import initialize, api
options = {
"api_key": "cd5bc9603bc23a2d97beb118b75f7b11",
"app_key": "21f769cbd8f78e158ad65b5879a36594c77eb076",
}
184 Using the Datadog REST API
initialize(**options)
# Post event
api.Event.create(title=title, text=text, tags=tag_list)
Let's look at this program closely. The first part is related to authentication and app client
initialization, which you have seen already in the first Python sample program. The
orders_count value is passed into this program as a parameter, which is mentioned on
the command line as SALES_ORDERS_COUNT, and that should be replaced with a real
number when the program is executed. In real life, another program would estimate that
number and pass it on to this Python program. The sales orders count could be estimated
within the Python program also, in which case there is no need to pass in orders_
count as a parameter.
Programming with Datadog APIs 185
The current timestamp is stored in a variable and used with the publishing time-series
metrics data later, as follows:
ts=time.time()
ts stores the Unix timestamp that pertains to the current time, which is passed along with
the metric value:
tag_list=["product:cute-game1","fulfillment_type:download"]
tag_list sets up an array of tags that are applied to the metric data posted to Datadog.
The following is the API call that posts the metric data to Datadog:
api.Metric.send(
metric='mycompany.orders.hourly_count',
points=(ts,orders_count),
tags=tag_list)
metric should be the name of the metric and it is not created explicitly – posting it with
a data point is enough. points has to be a tuple consisting of the timestamp and a scalar
value that represents the metric value at the point in time represented by the timestamp.
The metric value must be a number and that's why, earlier in the sample program,
orders_count was converted as an integer from the value passed in from the
command line:
orders_count=int(sys.argv[1])
The second part of the program is setting the text for the event and posting it to Datadog.
After this program is executed, the results can be verified on the Datadog UI.
186 Using the Datadog REST API
The metric and the metric time-series data can be looked up using Metrics Explorer as in
the following screenshot:
You can visually verify that the details posted from the program appear in the events
stream as expected. If the hourly aggregation of sales count fails for some reason, that
status could be posted to the events stream as well, and that would be a good piece of
information for those who would triage the failure.
Creating a monitor
Let's try to set up a monitor programmatically using the custom metric just created. There
might be a need for management to know if the hourly order count falls below 100. The
monitor can be set up for alerting in that simple scenario:
# create-monitor.py
options = {
"api_key": "cd5bc9603bc23a2d97beb118b75f7b11",
"app_key": "21f769cbd8f78e158ad65b5879a36594c77eb076",
}
initialize(**options)
tag_list=["product:cute-game1"]
api.Monitor.create(
type="metric alert",
query="avg(last_5m):avg:mycompany.orders.hourly_count{*} <
100",
name="Hourly order count fell below 100",
message="The order count dropped dramatically during
the last hour. Check if everything is alright, including
infrastructure",
188 Using the Datadog REST API
tags=tag_list,
options=monitor_options
)
The first part of this Python program is like the Python programs you have seen earlier.
The main thing to look at in this program is the call to api.Monitor.create. This
API takes several options that finely configure a monitor. For clarity, only the required
parameters are used in this example.
If the program is executed successfully, the new monitor will be listed on the Monitors
dashboard in the Datadog UI.
To generate data for triggering an alert, run the post-metric-and-event.py
program with the sales count under 100 a few times. After waiting for 5 minutes, you will
see that the newly created monitor turns red, indicating the critical status of the monitor,
as shown in the following screenshot:
Figure 9.8 – The monitor "Hourly order count fell below 100" triggers an alert
Creating monitors programmatically is usually needed as part of some provisioning
process in which the newly provisioned resources must be monitored using related
metrics data.
Programming with Datadog APIs 189
# query-events.py
import time
import json
from datadog import initialize, api
options = {
"api_key": "cd5bc9603bc23a2d97beb118b75f7b11",
"app_key": "21f769cbd8f78e158ad65b5879a36594c77eb076",
}
initialize(**options)
end_time = time.time()
start_time = end_time - 500
result = api.Event.query (
start=start_time,
end=end_time,
priority="normal",
tags=["fulfillment_type:download"],
unaggregated=True
)
The script is self-explanatory: start and end, two parameters of the API api.Event.
query, set the timeframe for the events to be considered and further filtering is done
using the tag fulfillment_type:download, which is one of the tags applied on
the custom events posted earlier. Basically, this program will be able to locate the recent
events published by the custom events.
190 Using the Datadog REST API
The last line of the program prints the result in a highly readable format as follows:
$ python query-events.py
{
"events": [
{
"alert_type": "info",
"comments": [],
"date_happened": 1611721000,
"device_name": null,
"host": "thomass-mbp-2.lan",
"id": 5829380167378283872,
"is_aggregate": false,
"priority": "normal",
"resource": "/api/v1/events/5829380167378283872",
"source": "My Apps",
"tags": [
"fulfillment_type:download",
"product:cute-game2"
],
"text": "Hourly sales count has been queried from
order fulfillment system and posted to Datadog for tracking.\
nHourly Count: 300",
"title": "Posted hourly sales count",
"url": "/event/event?id=5829380167378283872"
}
]
}
As you can see in the output, the JSON result contains all the attributes of the event that
can be viewed on the Events dashboard and more. Typically, a program like this would be
a part of another automation in which the events will be queried for tracking the status
of the scheduled program that posts an hourly sales order estimate as a metric to the
Datadog platform.
Best practices 191
Though the examples we looked at in this section are simple, they can easily be expanded
into useful programs for implementing various integration and automation requirements.
Next, let's look at the best practices related to using the Datadog APIs.
Best practices
We reviewed the Datadog APIs and learned the basics of how they are called from curl
and Python. Now, let's see what the best practices are for using the APIs for automating
monitoring tasks:
Summary
In this chapter, you mainly learned how to use curl and Python to interact with the
Datadog platform using Datadog APIs. Also, we looked at major categories of APIs
provided by Datadog. The important thing to remember here is that almost anything you
can do on the Datadog UI can be performed programmatically using an appropriate API.
We will continue looking at the Datadog integration options and in the next chapter, you
will learn specifically about some important monitoring standards that are implemented
by all major monitoring applications, including Datadog.
10
Working with
Monitoring
Standards
You have already seen how Datadog-supplied integrations and REST APIs are useful
in extending Datadog's features. In this chapter, we will explore more options available
for extending these features that are essentially an implementation of standards in the
monitoring space, and how they can be used for rolling out your custom monitoring
requirements.
These are the topics that are covered in this chapter:
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools
installed and resources available:
• An Ubuntu 18.04 environment with a Bash shell. The examples might work on
other Linux distributions also, but suitable changes must be made to the Ubuntu-
specific commands.
• A Datadog account with admin-level access.
• The Datadog Agent running at the host level or as a microservice, depending on the
example, pointing to the Datadog account.
• curl.
• Python 2.7, Python 3.8, or higher
As it's not easy to find a network device for testing, especially when you are in a cloud
environment, we will run the SNMP service on a virtual machine and look at the
metrics using snmpwalk. Also, the steps to enable SNMP integration in Datadog will be
explained using that environment.
You need an Ubuntu host to try out the following SNMP tutorial. These steps were tested
on an AWS EC2 node running Ubuntu 18.04:
4. These steps are enough to have an SNMP service up and running locally and you
can use snmpwalk to scan the service:
HOST-RESOURCES-MIB::hrSystemInitialLoadDevice.0 =
INTEGER: 393216
HOST-RESOURCES-MIB::hrSystemNumUsers.0 = Gauge32: 1
HOST-RESOURCES-MIB::hrSystemProcesses.0 = Gauge32: 103
HOST-RESOURCES-MIB::hrSystemMaxProcesses.0 = INTEGER: 0
• The -mALL option directs snmpwalk to look at all the MIBs available. On the
Ubuntu host, those files are stored under the /usr/share/snmp/mibs directory.
• -v2c points to the version of the SNMP protocol to be used. There are three
versions and in Datadog it defaults to v2, the same as what's used here.
• -cpublic indicates that the community string used for the scan is public. A
community string in v1 and v2 of the SNMP protocol is like a passphrase. In v3,
that is replaced with credential-based authentication.
Now, let's see how the SNMP integration in Datadog can be configured to access the
SNMP service running on the Ubuntu host:
$ cd /etc/datadog-agent/conf.d/snmp.d
$ cp conf.yaml.example conf.yaml
2. Edit conf.yaml and enable the following: under the instances section, point
ip_address to 127.0.0.1, which corresponds to localhost.
3. Set community_string to public.
4. Add two custom tags so the metrics published by SNMP integration can easily be
tracked:
tags:
- integration:snmp
- metrics-type:ubuntu-snmp
Monitoring networks using SNMP 197
5. Restart the Datadog Agent for these changes to take effect. Check the status and
verify that the Datadog Agent can query the SNMP service locally by running the
following:
snmp (3.5.3)
------------
Instance ID: snmp:98597c6009cbe92e [OK]
Configuration Source: file:/etc/datadog-agent/
conf.d/snmp.d/conf.yaml
Total Runs: 96
Metric Samples: Last Run: 2, Total: 192
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 96
Average Execution Time : 7ms
Last Execution Date : 2021-01-31 08:16:21.000000
UTC
Last Successful Execution Date : 2021-01-31
08:16:21.000000 UTC
The metrics that get published to Datadog from this integration could be looked up in the
Metrics Explorer, as in the following screenshot:
Multiple network devices can be specified under the instances section of the SNMP
integration configuration file and the metrics could be differentiated using appropriate
tags. The Datadog Agent can be configured to scan an entire network subnet for
devices that are SNMP-enabled. For the details and configuration, consult the official
documentation at https://docs.datadoghq.com/network_performance_
monitoring/devices/setup.
In the next section, we will discuss JMX, which is an important area related to the
monitoring of Java applications, and you will learn how Datadog is integrated with it.
In production, Cassandra is run on a cluster of machines that might span multiple data
centers for reliability. To demonstrate its JMX features, we will run it in a single node
using a binary tarball installable that you can download from https://cassandra.
apache.org/download/, where you can also pick up the latest stable version:
2. Extract the tarball and change the directory to where it's extracted:
3. Make sure Java 8 is available in the environment. If not, install it. On an Ubuntu
host, this can be done by installing from a package, as follows:
$ bin/cassandra
That's pretty much it if everything goes smoothly and if this instance is not intended for
use with an application. You could use this service for testing Datadog's JMX support.
Now, let's look at the runtime details of Cassandra and the corresponding JVM to
understand how JMX is configured to work. To see the status of the Cassandra service,
you can use nodetool, and you could also look at the log for any issues:
$ bin/nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host
ID Rack
UN 127.0.0.1 70.71 KiB 256 100.0%
03262478-23a3-4bd4-97f1-bc2c837ad650 rack1
200 Working with Monitoring Standards
$ tail -f logs/system.log
One important thing to note here is the way the Cassandra Java application was started up,
enabling JMX. If you look at the Java command line, the following options can be found:
We have looked at how Datadog can be configured to consume application metrics via the
JMX interface using the example of Cassandra in this section. You will learn more about
JMX-based integration in the next section.
1. On the Cassandra host, there should also be a Datadog Agent running. Under the
/etc/datadog-agent/conf.d/cassandra.d directory, use the example
configuration file to set up the configuration file for Cassandra integration:
$ cp conf.yaml.example conf.yaml
Consuming application metrics using JMX 201
The configuration file copied from the example is good to enable the integration.
Note that port 7199 and the metrics specified in the configuration file for collecting
data are compatible with the JMX. Also, the nature of this integration is very clear
from the following setting in the configuration file:
init_config:
## @param is_jmx - boolean - required
## Whether or not this file is a configuration for a
JMX integration
#
is_jmx: true
2. After setting up the configuration file, restart the Datadog Agent service:
3. Check the status of the Cassandra integration, specifically that listed under the
JMXFetch section:
========
JMXFetch
========
Initialized checks
==================
cassandra
instance_name : cassandra-localhost-7199
message : <no value>
metric_count : 230
service_check_count : 0
status : OK
202 Working with Monitoring Standards
The metrics fetched by this integration can be looked up in the Metrics Explorer. For
example, see how one of the metrics fetched from the JMX interface is looked up and
charted on the Metrics Explorer:
1. Before the generic JMX integration is enabled, disable the Cassandra integration
by renaming the conf.yaml configuration file at /etc/datadog-agent/
conf.d/cassandra.d to conf.yaml.BAK.
2. Make the conf.yaml configuration file from the example file, conf.yaml.
example, available at /etc/datadog-agent/conf.d/jmx.d.
You can verify that the JMX port specified in the configuration file is 7199:
instances:
-
## @param host - string - optional - default:
localhost
Consuming application metrics using JMX 203
3. Add these tags to the configuration file so you can filter the metrics later using
them:
tags:
- integration:jmx
- metrics-type:default
5. Check the status of JMXFetch to make sure that the integration is working as
expected:
========
JMXFetch
========
Initialized checks
==================
jmx
instance_name : jmx-localhost-7199
message : <no value>
metric_count : 27
service_check_count : 0
status : OK
Failed checks
=============
no checks
204 Working with Monitoring Standards
Note that Cassandra was mentioned under the JMXFetch status in the previous example
and this time it is jmx instead.
With the JMX integration working, you will be able to look up any default JMX metrics in
the Metrics Explorer as in the following screenshot:
conf:
- include:
domain: org.apache.cassandra.db
type:
- Caches
This is just a sample set of the metrics available. Cassandra publishes a large set of metrics
that you can look up in /etc/datadog-agent/conf.d/cassandra.d/metrics.
yaml. The point to note here is that for Datadog to collect application metrics from the
JMX interface, they must be indicated in the configuration file as you have done in the
sample.
Also, set the value of the metrics-type tag to cassandra for the easy filtering of
metrics collected by Datadog from this example. As usual, for these configuration changes
to take effect, you need to restart the Datadog Agent.
You will be able to look up the new metrics collected by Datadog on the Metrics Explorer,
as in the following screenshot:
Working with the DogStatsD interface 205
=========
DogStatsD
=========
206 Working with Monitoring Standards
Event Packets: 0
Event Parse Errors: 0
Metric Packets: 5,107,230
Metric Parse Errors: 0
Service Check Packets: 0
Service Check Parse Errors: 0
Udp Bytes: 329,993,779
Udp Packet Reading Errors: 0
Udp Packets: 5,107,231
Uds Bytes: 0
Uds Origin Detection Errors: 0
Uds Packet Reading Errors: 0
Uds Packets: 0
The preceding steps describe the process to access JMX metrics from a Java application.
While some applications, such as Cassandra, provide application-specific integration to
access JMX metrics, this generic integration is enough to consume JMX metrics from any
Java application.
Publishing metrics
Custom metric values can be published to the Datadog backend using various means so
that they can be used in monitors and dashboards. As an example, an arbitrary metric
value can be sent to the DogStatsD service running on localhost as in the following
example:
The test metric value sent to the DogStatsD service will be consumed by Datadog as a
backend application and that metric value can be looked up in the Metrics Explorer, as in
the following screenshot:
Working with the DogStatsD interface 207
Figure 10.5 – Looking up a metric value sent via DogStatsD in the Metrics Explorer
Values for a variety of metric types, such as count, gauge, set, and histogram, can be
posted to the Datadog backend using the DogStatsD interface. The following Python code
example demonstrates how such a time series dataset could be published:
# post-metric-dogstatsd.py
import time
import random
from datadog import initialize, statsd
options = {
'statsd_host':'127.0.0.1',
'statsd_port':8125
}
initialize(**options)
while(1):
# Get a random number to mimic the outdoor temperature.
temp = random.randrange(50,70)
statsd.gauge('statsd_test.outdoor_temp', temp, tags=["metric-
source:dogstatsd"]
)
time.sleep(10)
208 Working with Monitoring Standards
The program basically picks a random number between 50 and 70 and posts it as the
outdoor temperature with the tag metric-source:dogstatsd. It runs in an infinite
loop with a wait of 10 seconds for posting the metric each time, mimicking a time series
dataset.
An important thing you should note here is that there is no authentication required to
post data. As long as the DogStatsD service port (by default 8125) is open on the host
where the Datadog Agent (DogStatsD is part of the Datadog Agent) runs, the preceding
sample program can post the metrics. While this offers flexibility, it could also be a
security hole if the network where the Datadog Agents run is not hardened.
The data posted into the Datadog backend can be looked up in the Metrics Explorer
using the metric name and tags if needed, as shown in the following screenshot:
Figure 10.6 – Time series metric values posted via the DogStatsD interface
StatsD is a general-purpose interface for publishing only metrics. However, the
implementation by Datadog, DogStatsD, has extensions to support Datadog-specific
resources such as events and service checks. Those additional features would be useful
if DogStatsD were used as a hub for posting status and updates to the Datadog backend
from your integration, especially if your application uses DogStatsD as the primary
channel to publish monitoring information into the Datadog backend.
Working with the DogStatsD interface 209
Posting events
Posting an event to the Datadog events stream is simple and is illustrated in the following
Python program:
# post-event-dogstatsd.py
options = {
'statsd_host':'127.0.0.1',
'statsd_port':8125
}
initialize(**options)
This program will post an event to the events stream as shown in the following screenshot:
Best practices
Now, let's look at the best practices in making use of monitoring standards and the
support provided by Datadog:
• When monitoring network devices using SNMP integration, the number of metrics
that you must handle can be overwhelming. So, it's important to identify a few key
metrics that can track performance and proactively identify issues, and implement
monitors using those.
• JMX can be used to manipulate the workings of the application and such things
shouldn't be implemented on the monitoring infrastructure side because
monitoring is essentially a ready-only activity. In other words, monitoring
applications won't initiate any corrective actions usually because monitoring is not
considered part of the application system and the non-availability of a monitoring
tool should not hamper the workings of the main application it monitors.
• StatsD is designed to handle only metrics that can be consumed by applications
such as Datadog. It's better to publish only metrics via DogStatsD if there are
multiple monitoring tools in your environment and Datadog is just one of them, to
maintain the flexibility of moving data between multiple systems.
• As DogStatsD doesn't require any authentication besides access to the service port,
the environment where DogStatsD runs, a host or container, should be secured well
so no unauthorized posting of information can happen.
Summary
In this chapter, you have learned about three important monitoring standards, SNMP,
JMX, and StatsD, that help with integrating network devices and custom applications
in Datadog. As these are general-purpose standards, they are supported by most of the
popular monitoring tools. The general pattern of using these standards is to stick with the
standard features and not to use the extensions as the former approach would make your
monitoring friendly with other monitoring tools as well.
In the next chapter, the last chapter of the integration part of the book, we will discuss
how integration is handled directly from custom applications. There are both official and
community-developed programming libraries available to integrate applications directly
with Datadog. We will explore how some of those libraries and the Datadog REST APIs
are used to integrate custom applications with Datadog.
11
Integrating with
Datadog
In the previous chapter, you learned about some of the important monitoring standards
and how they are implemented in Datadog, with the objective of extending the features of
the Datadog monitoring platform. So far, in this part of the book, we have been looking
only at extending the monitoring capabilities of an organization focused on Datadog. The
integration with Datadog can happen both ways – in addition to populating Datadog with
monitoring related data for use with various Datadog features, the information available
in Datadog could be utilized by other internal applications also.
To roll out such general-purpose integrations between applications, a rich set of APIs
should be available. We have already seen that the Datadog REST API is a comprehensive
programming interface that other applications can use to access the Datadog platform to
publish and extract information. We have also looked at DogStatsD as one of the methods
to publish information to the Datadog platform, and we will learn more about how that
interface can be used from other applications. We will also review other methods, which
are mainly community-based efforts that are not officially shipped with the Datadog
product suite but are very useful in rolling out custom integrations.
212 Integrating with Datadog
In this chapter, you will learn about the commonly used libraries for Datadog integration
and cover these topics specifically:
Technical requirements
To implement the examples given in this chapter, you need to have an environment with
the following tools installed:
Once installed, the REST API-specific calls can be done from the program by importing
the API module, and the DogStatsD specific calls can be done by importing the statsd
module into the Python environment.
The code repository related to this client library is available on GitHub at https://
github.com/DataDog/datadog-api-client-java. Review the documentation
available there to understand how this Java library can be built and installed in Java
development environments using build tools such as Maven and Gradle.
<dependency>
<groupId>com.datadoghq</groupId>
<artifactId>java-dogstatsd-client</artifactId>
<version>2.11.0</version>
</dependency>
The StatsD APIs can be called from the Java application and can be built once the
preceding configuration is added to the Maven setting.
216 Integrating with Datadog
datadog-go
datadog-go is a DogStatsD client library for the Go programming language and it is
maintained by Datadog at https://github.com/DataDog/datadog-go on
GitHub. As with other official DogStatsD client libraries, this also supports events and
service checks in addition to metrics.
The library officially supports Go versions 1.12 and above. The details of the installation
and usage of the library are available in the code repository on GitHub.
dogstatsd-ruby
dogstatsd-ruby is a DogStatsD client library for the Ruby programming language and
it's maintained by Datadog at https://github.com/DataDog/dogstatsd-ruby
on GitHub. As with other official DogStatsD client libraries, this also supports events and
service checks in addition to metrics.
The details of the installation and usage of the library are available with the code
repository on GitHub. Full API documentation is available at https://www.rubydoc.
info/github/DataDog/dogstatsd-ruby/master/Datadog/Statsd.
Evaluating community projects 217
For a complete and up-to-date list of client libraries, check out the official compilation at
https://docs.datadoghq.com/developers/libraries/.
In this section, you have become familiar with two groups of client libraries that could
be used with major programming languages to access Datadog resources. Those client
libraries are useful in building Datadog-specific features from a program or a script from
the ground up, as part of integrating with the Datadog SaaS backend. In the next section,
we will look at some of the tools that provide either well-integrated features or building
blocks to develop such integrations.
dog-watcher by Brightcove
In a large-scale environment with several dashboards and monitors built on Datadog for
operational use, maintaining them could quickly become a major chore. This Node.js
utility can be used to take backup of Datadog dashboards and monitors in JSON format
and save it into a Git repository. Such backups are also very useful in recreating similar
resources in the same account or elsewhere.
The utility needs to be run as a Node.js service. The code and the details of configuring
it to run are available on GitHub at https://github.com/brightcove/
dog-watcher. It could be scheduled to take periodic backups or take backups as and
when there would be changes to the Datadog resources being tracked for backing up.
kennel
kennel is a utility developed in Ruby that can be used to manage Datadog monitors,
dashboards, and Service Level Objects (SLOs) as code. Managing all kinds of
infrastructure resources as code is a DevOps tenet and this tool is useful in implementing
that. The code and detailed documentation on the utility are available on GitHub at
https://github.com/grosser/kennel.
• Users
• Roles
• Metrics
• Monitors
• Dashboards
• Downtimes
• Integrations with public cloud platforms
Developing integrations 219
Developing integrations
In Chapter 8, Integrating with Platforms Components, you learned how to configure an
integration. Datadog ships official integrations with a lot of third-party applications that
are used to build the cloud platform where a software application runs. The best thing
about using an official integration is that the metrics specific to that integration will be
available for use in dashboards, monitors, and other Datadog resources after minimal
configuration.
Datadog lets you build custom integrations that would work exactly like the official ones.
It would require DevOps expertise, especially coding skills in Python, and it's not easy to
learn the procedure Datadog lays out to build an integration that would be compatible
with the Datadog Agent. However, it might make sense to build an integration for the
following reasons:
In this section, you will learn the steps to build an integration from scratch. Building
a full-fledged integration is beyond the scope of this chapter, but you will learn the general
steps to do so with some hands-on work.
Prerequisites
Some setup is needed in your local development environment for the integration to be
developed, tested, built, packaged, and deployed:
This will install a lot of things in the local Python 3 environment and the output will
look like the following if everything goes well:
Successfully installed PyYAML-5.3.1 Pygments-2.8.0
appdirs-1.4.4 atomicwrites-1.4.0 attrs-20.3.0 bcrypt-
3.2.0 bleach-3.3.0 cached-property-1.5.2 certifi-
2020.12.5 cffi-1.14.5 chardet-4.0.0 colorama-0.4.4
coverage-5.5 cryptography-3.4.6 datadog-checks-dev-9.0.0
distlib-0.3.1 distro-1.5.0 docker-4.4.4 docker-
compose-1.28.5 dockerpty-0.4.1 docopt-0.6.2 docutils-0.16
filelock-3.0.12 idna-2.10 in-toto-1.0.1 iniconfig-1.1.1
iso8601-0.1.14 jsonschema-3.2.0 keyring-22.3.0
markdown-3.3.4 mock-4.0.3 packaging-20.9 paramiko-2.7.2
pathspec-0.8.1 pip-tools-5.5.0 pkginfo-1.7.0 pluggy-
0.13.1 psutil-5.8.0 py-1.10.0 py-cpuinfo-7.0.0
pycparser-2.20 pygal-2.4.0 pygaljs-1.0.2 pynacl-1.4.0
pyparsing-2.4.7 pyperclip-1.8.2 pyrsistent-0.17.3 pytest-
6.2.2 pytest-benchmark-3.2.3 pytest-cov-2.11.1 pytest-
mock-3.5.1 python-dateutil-2.8.1 python-dotenv-0.15.0
readme-renderer-29.0 requests-2.25.1 requests-
toolbelt-0.9.1 rfc3986-1.4.0 securesystemslib-0.20.0
Developing integrations 221
The commands and output provided here are verified on a Unix-like system such as
Linux or Mac. The development could be done on Windows as well; refer to the official
documentation for platform-specific directions.
For the purpose of describing the steps better, let's assume that you have an application
named CityWeather that supplies a bunch of weather-related information for a specific city
at any time of day. The objective of the integration is to get some of that weather info to be
published into Datadog.
The dry run of creating the directory can be done as follows:
$ ddev create -n CityWeather
The output would show a hierarchical list of directories and files in the scaffolding; for
brevity, it is not provided here.
The directory structure can be created by running the same command without the -n
option. Though the name used has mixed characters, the top-level directory name will be
all lowercase. So, you can change directory into the newly created directory as follows:
$ cd cityweather
• README.md: The README file used for documenting the new integration
in Git. The template provided has the correct headers and formatting for the
documentation to be standard.
• manifest.json: The manifest describing the integration and file locations.
• tests: The directory where unit and integration tests will be configured and
maintained.
• metadata.csv: Maintains the list of all collected metrics.
• tox.ini: Now, tox is used to run the tests. Make sure that the Python version
specified in this configuration file used by tox matches the Python versions
available locally. For example, if Python 2.7 and Python 3.9 are used, the content of
this file would look like the following and you would make changes as needed:
[tox]
minversion = 2.0
skip_missing_interpreters = true
basepython = py39
envlist = py{27,39}
[testenv]
ensure_default_envdir = true
envdir =
Developing integrations 223
py27: {toxworkdir}/py27
py39: {toxworkdir}/py39
dd_check_style = true
usedevelop = true
platform = linux|darwin|win32
deps =
datadog-checks-base[deps]>=6.6.0
-rrequirements-dev.txt
passenv =
DOCKER*
COMPOSE*
commands =
pip install -r requirements.in
pytest -v {posargs}
Now, with the tooling for developing an integration in place locally, let's try running the
rest of the steps using an existing integration that is available in the repository. Note that
there are about 100 extra integrations available in this repository and you could use them
by building the corresponding related package, a trick that you will learn soon.
For the purposes of testing and deployment practice, let's select an integration available
for Zabbix. This is a popular on-premises monitoring tool and is widely used in both data
centers and cloud platforms. Many companies that are migrating to using Datadog might
have Zabbix installations to deal with, and a more practical strategy would be to integrate
with Zabbix rather than trying to replace it, with a focus on rolling out Datadog for
monitoring any new infrastructure. In such scenarios, you will see Datadog and Zabbix
(or another on-premises monitoring application) running side by side.
Running tests
For the code developed for the integration, both unit and integration tests can be run
locally with the help of Docker. In the case of Zabbix integration, Zabbix will be run
locally as a microservice on Docker and the tests will be run against that instance. The
Zabbix deployment details are provided in a docker-compose.yml file.
Follow these steps to deploy Zabbix and test the integration:
5. The output is verbose and it would end with messages similar to the following,
indicating the success of the tests:
py39: commands succeeded
style: commands succeeded
congratulations :)
Passed!
Next, let's see how to build a configuration file for the integration.
Next, let's look at how the integration code can be packaged for installation.
Developing integrations 225
Building a package
To deploy the integration, it needs to be packaged into a wheel, a format used for packing
and distributing Python programs. The wheel will have only those files needed for the
working of the integration, and it will not contain most of the source files used to build
the integration. The integration can be built as follows:
Building `zabbix`...
'build/lib' does not exist -- can't clean it
'build/bdist.macosx-10.13-x86_64' does not exist -- can't clean
it
'build/scripts-3.9' does not exist -- can't clean it
warning: no previously-included files matching '__pycache__'
found anywhere in distribution
Build done, artifact(s) in: /Users/thomastheakanath/dd/
integrations-extras/zabbix/dist
Success!
The wheel file can be found at the location mentioned in the build command output:
$ ls dist/*.whl
dist/datadog_zabbix-1.0.0-py2.py3-none-any.whl
Next, let's look at how the integration is installed using the package just built.
Deploying an integration
If the Datadog Agent runs locally, you can install it by pointing to the wheel built in
the last step. The wheel must be copied to other locations wherever it is planned to be
installed. Once the wheel file is available locally, it can be installed as follows:
This will basically make the integration available to the Datadog Agent to be enabled as an
official integration shipped with the Datadog Agent. Now if you check under the conf.d
directory of the Datadog Agent home directory, you can see that zabbix.d is listed as
follows:
$ ls zabbix.d
conf.yaml.example
As with any other standard integration, to enable it, conf.yaml needs to be created from
the sample file provided and the Datadog Agent service needs to be restarted.
The complete procedure to build a Datadog integration is officially documented at
https://docs.datadoghq.com/developers/integrations/new_check_
howto. Refer to that documentation for finer details and updates. Now, let's look at the
best practices related to the topics you have just looked at.
Best practices
The availability of client libraries and the option to build custom integrations add a lot
of flexibility to your toolbox for integrating Datadog with another application or even
a batch job. However, there are certain best practices that you need to look at before
starting to implement automation or customization using one or more of those options:
• If you can choose the programming language, pick a language that is better
supported and popular, such as Python for scripting and Java for enterprise
applications. If the application to be integrated runs primarily runs on Microsoft
Windows platforms, choosing C# would be wise.
• Choose a client library that is officially maintained. It's a no-brainer – you need
to rely on a library that will keep up with the enhancements made to the Datadog
platform and REST API.
• Plan to manage Datadog resources as code using Terraform. Ansible can help there
too, but its support for Datadog is limited as of now.
• If you build an integration and if it has an external use, publish it to Datadog's
integrations-extras repository on GitHub. Use by others can help to get valuable
feedback and fixes to make it more robust and useful.
Summary 227
Following these best practices and related patterns will help you choose the right approach
for implementing your integration requirements.
Summary
By now, you should be aware of all the important integration options available in Datadog
for a variety of use cases; let's recap what we covered in this chapter specifically.
Many Datadog client libraries targeting popular programming languages are available,
both officially maintained and at the community level. There are two types of client
libraries – ones that provide a language wrapper to the Datadog REST API and libraries
that provide support interfacing with Datadog via the StatsD-compatible DogStasD
service. Also, there are community-level efforts to integrate with Datadog that are
available on GitHub.
There are other types of Datadog client libraries that are not discussed here, such as APM
and distributed tracing libraries, libraries that support serverless computing resources
such as AWS Lambda, and client libraries that specifically target the log management
feature of Datadog. The usage of these libraries is not different from how the core API
and DogStatsD libraries are used, and you should check those out if you have any related
use cases.
With this chapter, Part 3 of this book, Extending Datadog, has been completed. In the next
part of the book, you will learn more advanced monitoring concepts and features that are
implemented by Datadog. We looked at monitoring microservices briefly earlier, and, in
the next chapter, you will learn more about monitoring microservices, especially how this
is done in an environment orchestrated by Kubernetes.
Section 3:
Advanced
Monitoring
Technical requirements
To try out the examples mentioned in this chapter, you need to have an environment
where the Datadog Agent is deployed on either Docker or Kubernetes. The examples were
developed in the following environments:
• Docker: Docker Desktop 3.2.2 running on a MacBook Pro. Any environment that
has the latest version of Docker would be compatible.
• Kubernetes: Single-node minikube v1.18.1 running on a MacBook Pro. To
try out the examples, any Kubernetes environment that has kubectl v1.20.0 or
higher can be used.
The examples might even work with older versions of Docker and Kubernetes; you are
encouraged to try out the examples on their respective most recent versions.
Note that DATADOG-API-KEY must be replaced with a valid API key associated with a
Datadog user account.
Upon successfully running the Datadog Agent container, you should be able to verify it
from the command line as follows:
$ docker ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS
NAMES
2689f209fc16 gcr.io/datadoghq/agent:latest "/init"
29 hours ago Up 29 hours (healthy) 8125/udp, 8126/tcp
datadog-agent
There could also be other containers running on the host, but look for the Datadog Agent
container labeled datadog-agent.
To check whether the Datadog Agent service is able to collect logs from other containers,
you can run any other containers that generate logs. For the example here, we can try
running an NGINX container as follows:
This command will run an NGINX container labeled web, a name that we can use later in
the Datadog dashboards to locate this container. Also, the container port will be mapped
to port 8080 on the Docker host machine.
The output of the preceding command is not provided here for brevity, but you can check
if it's up and running as follows:
$ docker ps
CONTAINER ID IMAGE COMMAND
CREATED STATUS PORTS
NAMES
53cd9cae287d nginx "/
docker-entrypoint.…" 28 hours ago Up 28 hours
0.0.0.0:8080->80/tcp web
2689f209fc16 gcr.io/datadoghq/agent:latest "/init"
29 hours ago Up 29 hours (healthy) 8125/udp, 8126/tcp
datadog-agent
234 Monitoring Containers
Look at the name of the container and the port mapping; this enables you to access the
NGINX service on port 8080 on the host machine, even though the service runs on port
80 in the container.
Also, you can access the NGINX service on a web browser using the URL http://
localhost:8080 (or using an IP address or CNAME in place of localhost if the service
is accessed from a remote host). The service could also be accessed using a command-line
tool such as curl. You need to do this to generate some logs to see them being collected
by the Datadog Agent.
Now let's see how the logs can be viewed using the Datadog UI. By navigating to
Infrastructure | Containers, you can get to the Containers dashboard, as shown
in the following screenshot:
By clicking on the link Open in Log Explorer, as shown in Figure 12.3, the Log Explorer
dashboard can be launched, and the interface will be as in the following screenshot:
FROM nginx:latest
6. Verify whether you can access the container on port 8080 on the Docker host:
$ curl -I http://localhost:8080
HTTP/1.1 200 OK
Server: nginx/1.19.8
Date: Wed, 17 Mar 2021 05:55:45 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Tue, 09 Mar 2021 15:27:51 GMT
Connection: keep-alive
ETag: "604793f7-264"
Accept-Ranges: bytes
If all the previous steps were successful, then you can check whether Datadog is capturing
the logs.
238 Monitoring Containers
To verify that the logs are collected, the container logs can be looked up on the Datadog
Containers dashboard as you have seen earlier:
You have learned that the containers running on a Docker host can be monitored using
Datadog at both the infrastructure and application levels, and specifically, how to collect
application logs from a container in this section. Kubernetes is becoming the default
platform for deploying microservices lately, and in the next section, you will learn how
to use Datadog to monitor containers that run on Kubernetes.
Monitoring Kubernetes
Docker can be used for both packaging and running microservices, and you have seen
examples of that in the previous section. However, that is only one of the packaging
solutions available for microservices. Kubernetes is a platform for running microservices
that are packaged using tools such as Docker. It provides a vast number of features to
orchestrate and maintain the deployment of a microservices-based software system.
Practically, it can be considered an operating system for microservices.
Kubernetes environments can be set up on a wide variety of infrastructures, starting from
your laptop for testing purposes through clusters of several machines in a data center.
However, the most popular option to run Kubernetes is using the managed services
available on public cloud platforms such as Elastic Kubernetes Service (EKS) on AWS,
Google Kubernetes Engine (GKE), and Azure Container Service (AKS). Regardless of
the underlying infrastructure, Kubernetes can be accessed and used uniformly using tools
such as kubectl. In this section, you will learn how to deploy the Datadog Agent to
monitor containers running in a Kubernetes environment.
To test the steps provided here, a minikube-based Kubernetes environment that runs
on a personal computer was used. The details of setting up minikube can be found here:
https://minikube.sigs.k8s.io/docs/start/. The steps to deploy Datadog
are applicable in any Kubernetes environment and you can try them out anywhere
regardless of the underlying Kubernetes infrastructure.
Monitoring Kubernetes 239
Monitoring Kubernetes has two parts – monitoring the Kubernetes cluster itself, and
monitoring the microservices that run in the containers orchestrated by Kubernetes. The
first is infrastructure monitoring and the latter is application monitoring. For Datadog to
access the related monitoring information, the Datadog Agent must be installed as one
of the containers in the Kubernetes cluster, supported by Kubernetes resources defined
for that.
2. Add the following code snippet to the end of the clusterrole.yaml file. This
might not be needed in the latest versions of Kubernetes:
- apiGroups:
- "apps"
resources:
- deployments
- replicasets
- pods
- nodes
- services
verbs:
- list
- get
- watch
$ wget https://raw.githubusercontent.com/DataDog/datadog-
agent/master/Dockerfiles/manifests/rbac/serviceaccount.
yaml
$ kubectl apply -f serviceaccount.yaml
$ kubectl get serviceaccounts |grep datadog-agent
datadog-agent 1 23h
8. Update the sample Datadog Agent manifest with the following changes.
In the section for the datadog-agent secret resource, update the api-key
field with the base64-encoded value of a valid API key. The encoding can be done
in different ways, and there are online tools available for these ways. It can be
done reliably on the command line using openssl if that is available in your
working environment:
$ echo -n 'API-KEY' | openssl base64
Y2Q1YmM5NjAzYmMyM2EyZDk3YmViMTxyzajc1ZjdiMTE=
9. Add the following environment variables to the container section for the Agent:
env:
- name: DD_KUBELET_TLS_VERIFY
value: "false"
- name: DD_SITE
value: "datadoghq.com"
- name: DD_LOGS_ENABLED
value: "true"
- name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
value: "true"
- name: DD_LOGS_CONFIG_K8S_CONTAINER_USE_FILE
value: "true"
The Kubernetes resources that have been created in the previous steps could be looked up
and verified using Kubernetes Dashboard as well. For example, if all the preceding steps
were successful, you will be able to see the datadog-agent Pod listed with the status as
Running, as in the following screenshot:
By clicking on the container of interest, logs and metrics related to that container can be
viewed on the Containers dashboard as you learned in the previous section, including
the access available to the Log Explorer dashboard.
The option to view Kubernetes platform resources such as Pods, Deployments,
ReplicaSets, Services, and Nodes is similar to that available on Kubernetes Dashboard.
However, having those resources tracked by Datadog provides the option to monitor
those resources using Datadog too.
We have already learned how Datadog can be used to monitor live containers with
the help of integrations with Docker and Kubernetes. In the next section, you will learn
more about that feature in the context of Kubernetes.
• The Datadog Agent container should have the following environment variable
defined:
- name: DD_ORCHESTRATOR_EXPLORER_ENABLED
value: "true"
244 Monitoring Containers
• The Datadog Agent ClusterRole should have permissions set for Live
Containers to collect information on Kubernetes resources. That requirement was
discussed in the previous section, related to the setting up of the clusterrole.
yaml file.
• The process agent container must be run with the following environment
variables set:
- name: DD_ORCHESTRATOR_EXPLORER_ENABLED
value: "true"
- name: DD_ORCHESTRATOR_CLUSTER_ID
valueFrom:
configMapKeyRef:
name: datadog-cluster-id
key: id
- name: DD_CLUSTER_NAME
value: "<YOUR_CLUSTER_NAME>"
The cluster name can be obtained from the Kubernetes
configuration file.
The Live Tail option is available on the Containers dashboard as in the following
screenshot:
Figure 12.7 – The Live Tail option to view logs in real time
Log Explorer also has a Live Tail option available as shown in the following screenshot:
Best practices
The following are some of the best practices to be followed while monitoring containers
using Datadog:
• Run the Datadog Agent as a container for the easy discovery of application
containers and flexibility. Even if the Datadog Agent may have to be run at the
host level for some reason, running it as a container on the same host might be
acceptable considering the operational benefits that it brings.
• In a Kubernetes environment, don't try to access container logs directly via Docker
integration; instead, install the Datadog Agent on the Kubernetes cluster and
configure it to collect logs.
• Though kubectl and Kubernetes Dashboard can be used to view Kubernetes
cluster resources, making those available in Datadog will help to increase the
visibility of their availability and health.
Summary
In this chapter, you have learned how containers can be monitored using Datadog with
the help of integrations available for Docker and Kubernetes. You have also learned how
to search the container information and container logs once such information is collected
by Datadog. After reading this chapter and trying out the samples provided in it, you are
prepared to use Datadog to monitor containers that run in both Docker and Kubernetes
environments.
Log aggregation, indexing, and search is a major area in monitoring and there are major
initiatives around in the industry, such as the ELK Stack, Splunk, and Sumo Logic.
Datadog also offers a solution, and you will learn about that in the next chapter.
13
Managing Logs Using
Datadog
The logs generated by the operating system, the various platform components, and the
application services contain a lot of information regarding the state of the infrastructure
as well as the workings of the applications running on it. Managing all logs at a central
repository and analyzing that for operational insights and monitoring purposes is
an important area in monitoring. It usually involves the collection, aggregation, and
indexing of logs. In Chapter 1, Introduction to Monitoring, this monitoring type was briefly
discussed. In Chapter 12, Monitoring Containers, you learned how logs from containers are
published to Datadog for aggregation and indexing for facilitating searches.
Some of the popular monitoring product offerings in this area are ELK Stack (Elasticsearch,
Logstash, and Kibana), Splunk, and Sumo Logic. Now, Datadog also provides this feature
and you have seen Log Explorer, a frontend to that feature, in the last chapter.
In this chapter, we will explore Datadog's log aggregation and indexing feature in detail.
Specifically, we will look at the following topics:
• Collecting logs
• Processing logs
• Archiving logs
• Searching logs
250 Managing Logs Using Datadog
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools
installed and resources available:
Collecting logs
The first step in any log management application is to collect the logs in a common storage
repository for analyzing them later and archiving them for the records. That effort involves
shipping the log files from machines and services where they are available to the common
storage repository.
The following diagram provides the workflow of collecting and processing the logs and
rendering the aggregated information to end users. The aggregated information could be
published as metrics, which could be used for setting up monitors. That is the same as
using metrics to set up monitors in a conventional monitoring application:
• Public cloud services: Public cloud services such as AWS S3 and RDS are very
popular, especially if the production infrastructure is built predominantly using
public cloud services. The logs from those services can be shipped into Datadog
using the related Datadog integrations available.
Collecting logs 251
• Microservices: The Datadog integrations available for Docker and Kubernetes can
be used to ship these logs into Datadog. We have already seen examples of doing
that in the last chapter.
• Hosts: Traditionally, logs are available as files on a disk location on the host, bare-
metal or virtual. A Datadog Agent can be configured to ship local log files to the
Datadog backend.
Now, let's see the details of how logs are collected and shipped by Datadog, in the most
common use cases that are summarized above.
• Cloudformation and Kinesis Firehose-based options are available to roll out the
automation to collect logs from AWS services.
• On Azure, the automation to collect logs is based on Azure Event Hub and
Azure Function.
• Datadog integration is available on GCP to collect logs from the services offered on
that platform.
Instrumenting a Docker image might not always be possible as some images could be
supplied by third parties, or making such changes might not be viable operationally. In
such scenarios, the environment variable DD_LOGS_CONFIG_CONTAINER_COLLECT_
ALL could be specified in the Datadog Agent runtime environment to collect logs from all
the containers running on that host. If there are any logs to be excluded from aggregation,
Datadog has options to filter those logs out before shipping those for processing, and we
will look at that later in this section.
The configuration required in a Kubernetes cluster to ship logs running in that
environment to the Datadog backend is similar to how that is done on a Docker host:
Next, let's look at what configuration is needed for collecting logs when the Datadog
Agent is run at the host level.
- type: file
service: webapp
path: /var/log/nginx/access.log
source: nginx
- type: file
service: webapp
path: /var/log/nginx/error.log
source: nginx
The Datadog Agent must be restarted in order for these configuration changes to
take effect.
254 Managing Logs Using Datadog
Filtering logs
As you have seen already, it's pretty easy to configure the Datadog Agent or integration to
collect logs from any environment. However, shipping all the information tracked in the
logs to Datadog might not be a good idea for a variety of reasons, including the following:
• Security concerns
• Restrictions imposed by agreements with customers
• Regulatory requirements
• Compliance controls in place
Therefore, it may be necessary to filter out information from the logs that are not useful for
monitoring and that are not allowed to be shared with a wider audience. Also, such filtering
can result in minimizing storage utilization and data transfer to the Datadog backend.
Two predefined rules, include_at_match and exclude_at_match, can be used to
filter out logs at the collection phase. These rules work with a regular expression – if a log
entry is matched with the regular expression used with the rule, that log is included or
excluded, depending on the type of rule.
In the following example, log entries starting with the string k8s-log are ignored:
logs:
- type: file
path: "</PATH/TO/LOGFILE>"
service: "webapp"
source: "nginx"
log_processing_rules:
- type: exclude_at_match
name: exclude_k8s_nginx_log
pattern: ^k8s-log
Multiple include_at_match rules are allowed for the same log file and that would
result in an AND condition. This means that both the rules must be satisfied in order for
a log entry to be collected.
To implement an OR rule, the conditions must be specified in the same expression using
the | symbol, as in the following example:
This will result in any line beginning with warning or err being collected by the
Datadog Agent.
As you have seen earlier, the filtering rules can be implemented in the Docker runtime
using labels, and in Kubernetes using annotations.
log_processing_rules:
- type: mask_sequences
pattern: "^(User:.*), SSN:(.*), (.*)$"
replace_placeholder: "${1}, SSN:[redacted], ${3}"
The preceding sample rule will replace the field containing SSN information in a log entry
with the string SSN:[redacted]. This is done by splitting the log entry into three
parts and reassembling it using the format mentioned in replace_placeholder. The
content matched within each () construct in pattern would be available as a numbered
variable in replace_placeholder.
As mentioned earlier, these rules have to be added to the datadog.yaml file at the
host level and similar rules can be implemented in Docker runtime using labels, and
in a Kubernetes cluster using annotations.
256 Managing Logs Using Datadog
You have learned how Datadog collects logs from different environments and stores
that centrally for processing to derive operational insights in the form of metrics and to
generate search indexes. In the next section, we will see how the logs are processed by
Datadog and discuss the resources involved in that process.
Processing logs
Once the logs are collected by Datadog, they will be available on Log Explorer to view and
search. The logs in the structured JSON format are processed by Datadog automatically.
The unstructured logs can be processed further and analytical insights extracted. Datadog
uses Pipelines and Processors to process the incoming logs.
To view the pipelines available, open the Logs | Configuration menu option from the
main menu. The pipelines are listed under the Pipelines tab, as shown in the following
sample screenshot:
In the following screenshot, the processors used in the Apache httpd sample integration
pipeline are listed:
{
"http": {
"status_code": 200,
"auth": "frank",
"method": "GET",
"url": "/apache_pb.gif",
"version": "1.0"
},
"network": {
"bytes_written": 2326,
"client": {
"ip": "127.0.0.1"
}
},
"date_access": 1468407336000
}
258 Managing Logs Using Datadog
All the predefined pipelines can be looked up using the Browse Pipeline Library link on
the Pipelines tab in the main Log dashboard, as shown in Figure 13.2.
Log-based metrics can be generated based on a query. To set a new metric, navigate to
Logs | Generate Metrics | New Metric. The main part is to provide a query that will
define the new metric well. A sample screenshot of the Generate Metric window is
provided as follows:
Archiving logs
Having logs at a central location is itself a significant advantage for a business as access to
the collected logs is simplified and logs from multiple sources can easily be correlated and
analyzed. For monitoring and reporting the aggregated information, it is good enough
and there is no need to retain the old logs. However, for compliance purposes and future
audits, businesses may need to retain logs for longer periods. As old, raw logs are not
needed for active use, those logs could be archived away with the option to retrieve them
on demand.
Datadog provides archival options with public cloud storage services as the backend
storage infrastructure. To set up an archive for a subset of logs collected by Datadog, the
general steps are as follows:
• Set up an integration with cloud service: This step requires setting up integration
with a public cloud storage service: AWS S3, Azure Storage, or Google Cloud
Storage.
• Create a storage bucket: This storage bucket will store the logs.
• Set permissions: Set permissions on the storage bucket so that Datadog can store
and access the logs there.
• Route logs to the bucket: In this step, create a new archive in Datadog and point it
to the new storage bucket set up in the previous step. This option is available under
the Logs | Archives tab.
These steps differ considerably depending on the public cloud storage service used for
storing the archives, and those details can be found in the official documentation available
at https://docs.datadoghq.com/logs/archives.
When the archived logs need to be loaded back into Datadog for any purpose, which is
usually due to the need for some audit or root cause analysis (such events are rare in real
life), the Rehydrate from Archives option could be used for that purpose. Navigate to
Logs | Rehydrate from Archives.
In the next section, we will explore how the logs collected by Datadog could be searched,
an important tool that is popular with operations teams.
260 Managing Logs Using Datadog
Searching logs
To search the logs, navigate to Logs | Search and the search window should look like the
sample interface in the following screenshot:
Built-in keywords such as host or source can be used as a search term by using the
autocomplete option in the search field. You just need to click in the search field to
see all the terms available to use, as shown in the following screenshot:
Searching logs 261
The saved searches could be listed and be rerun using the Views link, as shown in the
same screenshot in Figure 13.7.
You have learned how to search the logs aggregated by Datadog using the Search
interface on the Log Explorer dashboard with directions on how to use the keywords
and operators. This section concludes the chapter, so let's now look at the best practices
and the summary related to log management in Datadog.
Best practices
There are certain patterns of best practices in log management. Let's see how they can be
rolled out using related Datadog features:
• Plan to collect as many logs as possible. It is better to stop collecting some of the
logs later if they are found to be not useful.
• Where possible, especially with the application logs that you will have control over
in terms of formatting, make the format of logs parsing friendly.
• Consider generating new logs for the purpose of generating metrics out of such
logs. Such efforts have been found to be very useful in generating data for reporting.
• Make sure that sensitive information in logs is redacted before allowing Datadog
to collect.
• Implement the redaction of sensitive information from the logs instead of filtering
out log entries. However, filter out log entries that might not be useful, so the
volume of logs handled by Datadog will be minimal.
• Create a library of searches and publish it for general use. It's hard to create complex
queries, and sharing such queries would make the team more efficient.
Summary
In this chapter, you have learned how Datadog collects logs, processes these, and provides
access to various aggregated information as well as the raw logs. Datadog also facilitates
archiving of the logs with support for public cloud storage services. Using Log Explorer,
you can search the entire set of active logs and valid searches can be saved for future use.
In the next chapter, the final chapter of this book, we will discuss a number of advanced
Datadog features that we haven't touched on yet.
14
Miscellaneous
Monitoring Topics
The core monitoring features, as they are implemented in Datadog, have been discussed
up to this point in the book. In this chapter, you will learn about some of the monitoring
features that have become available on the Datadog monitoring platform relatively
recently. These features, especially Application Performance Monitoring (APM), security
monitoring, and synthetic monitoring, are usually addressed by dedicated applications.
AppDynamics in APM, various Security Information and Event Management (SIEM)
applications in security monitoring, and Catchpoint in synthetic monitoring are examples
of dedicated monitoring applications in the respective areas. With these features available
on the Datadog platform, it is becoming a one-stop destination for all the monitoring
requirements.
In this chapter, you will learn about the following topics, specifically the following:
Technical requirements
To try out the examples mentioned in this book, you need to have the following tools
installed and resources available:
• An Ubuntu 18.04 Linux environment with Bash shell. Other Linux distributions can
be used, but make suitable changes to any Ubuntu-specific commands.
• A Datadog account and user with admin-level access.
• A Datadog Agent running, at host level or as microservice depending on the
example, pointing to the Datadog account.
• curl and wget.
As you can see, these are broad areas and every APM solution has its own way of
implementing these features and more. You will also learn that observability and synthetic
monitoring, two topics that will be discussed in dedicated sections later, are also related
to APM. In the remainder of this section, we will see what APM features are available
in Datadog and try to relate those to the broad categories mentioned in the preceding list,
as much
as possible.
Application Performance Monitoring (APM) 265
2. Download the Java library for Datadog tracing using wget or curl:
$ wget -O dd-java-agent.jar https://dtdg.co/latest-java-
tracer
The sample commands are based on the assumption that the user is in the directory
where the Cassandra installable is extracted, such as /home/ubuntu/apache-
cassandra-3.11.10. Make suitable changes to the path related to the actual installation.
In the preceding sample steps, it has been outlined how tracing is enabled for a Java
application that runs at the host level. Similar steps are followed for instrumenting services
running in the Docker and Kubernetes runtime environments, but with differences specific
to the related platform. Also, note that the steps are unique to the programming language
used for building the application. All those permutations are documented in the official
documentation, starting at https://docs.datadoghq.com/tracing/.
266 Miscellaneous Monitoring Topics
Once tracing is enabled for the applications as outlined above, you can view those on the
APM dashboard as in the following screenshot by navigating to APM | Traces:
Profiling an application
Using the Continuous Profiler feature of Datadog APM, the resource usage and I/O
bottlenecks can be traced to the application code by drilling down to the class, method,
and line number. Instrumentation is also required for enabling this feature, and let's try
that with the Cassandra application.
The steps to instrument an application for profiling is similar to how it was done for
tracing. Actually, the latest agent for tracing also supports profiling, and both features
could be enabled using the following command line:
$ export JVM_OPTS="-javaagent:/home/ubuntu/dd-java-agent.jar
-Ddd.service=cassandra -Ddd.profiling.enabled=true"
$ bin/cassandra
Application Performance Monitoring (APM) 267
As you can observe in the command line, the environment variable, dd.profiling.
enabled, is set to true for enabling the profiling feature, in addition to tracing.
The profiling-related reports can be viewed on the Continuous Profiler dashboard by
navigating to APM | Profiles in the following screenshot:
On the preceding dashboard, under the Performance, Analysis, Metrics, and Runtime
Info tabs, a large amount of runtime information regarding the application is available
for triaging performance issues and fine-tuning application performance and security
in general.
Service Map
A Service Map will provide a pictorial representation of the services running in a runtime
environment, such as a host or Kubernetes cluster, with the interaction between the services
mapped out. The Service Map can be accessed by navigating to APM | Service Map and the
dashboard will be rendered with the Services tab open, as in the following screenshot:
Implementing observability
Observability refers to the processes and methods involved in making the working of
an application system more transparent and measurable. Increased observability of an
application system will make it more monitoring-friendly. Observability is a property of
the application system itself, while monitoring is an act that leverages that property for
operational requirements.
Observability is relatively new to the monitoring vocabulary, but it has been repurposed
from system control theory. The concept of observability was introduced by Hungarian
American engineer Rudolf E. Kálmán for linear dynamic systems, and it states that
observability is a measure of how well internal states of a system can be inferred from
knowledge of the system's external outputs. In the context of monitoring, the external
outputs could be various metrics, logs, and traces. So, a monitoring tool with observability
features should help with generating and analyzing various outputs related to making the
working of an application system more transparent, implemented using a set of methods,
processes, and dashboards for analysis. Such features would help with increasing the
observability of the application system that the monitoring tool monitors.
While it's obvious that adding observability is important for better monitoring, you may
be wondering why observability is a relatively new monitoring terminology. One of the
reasons is that modern application systems and the infrastructure they run on are far more
complex now. Gone are the days when monolithic applications ran on a few bare-metal
machines in a data center. The applications are highly distributed, they run on public
clouds, and are managed by complex orchestration tools such as Kubernetes and Spinnaker.
While the modern application systems are cutting-edge and far more flexible and scalable,
the increased complexity of application systems reduces the overall observability. In such
a scenario, deliberate steps need to be taken to improve observability, and that's why
commercial monitoring tools are now shipped with such features.
Metrics, logs, and traces are generally considered the three pillars of observability.
Throughout this book, you have seen that Datadog features center around metrics and
tags. Datadog monitoring features generate a variety of metrics out of the box. Datadog
also provides options to create custom metrics and tags that will add more visibility to the
working of an application or an infrastructure component. As seen in Chapter 13, Managing
Logs Using Datadog, Datadog's log management features are comprehensive in terms of
managing a variety of logs, including those from the microservices platforms such as Docker
and Kubernetes. Earlier in this chapter, in the section on APM, you have seen how Datadog
could be used to generate, collect, and analyze traces from the applications.
270 Miscellaneous Monitoring Topics
While Datadog has all the nuts and bolts necessary for implementing observability in an
application, it's largely a custom effort that must be done for every application system.
The application onboarding and deployment processes must be enhanced to include
the instrumentation steps necessary for adding observability. Let's see which of those
instrumentations are required in Datadog:
It's evident that the ability of Datadog to consolidate all three pillars of observability
– metrics, logs, and traces – is a major strength of that platform. To make use of the
related out-of-the-box features, the application build and deployment processes must
be enhanced and tooled so logs and traces are published to the Datadog backend
automatically and seamlessly.
In the next section, we will look at synthetic monitoring, which refers to the testing of an
application live in production using requests and actions that simulate the user experience.
Synthetic monitoring 271
Synthetic monitoring
In synthetic monitoring, the utilization of an application is simulated, typically by a
robotic user, and the data collected from such simulations forms the basis of actionable
steps, such as triggering an alert on an application performance attribute or the availability
of the application itself. Generally, the following are the application states that can be
monitored by using synthetic monitoring tools and methods:
• The application is available in all respects. This might involve checking on multiple
web or API endpoints of the application.
• The application performance in terms of the velocity with which an application
responds to user requests.
• The applications can execute the business transactions as designed.
• The third-party components used in building the application system are functioning.
• The desired performance of the application achieved is cost-effective and within
budget.
Most of these aspects are related to measuring user experience in terms of a set of metrics,
and those metrics values are generated by the robotic usage of the application. The robotic
access could be local where the application is hosted, in a data center or public cloud, or
close to where actual users are located. The latter aspect of monitoring is known as last-mile
monitoring, one of the types of monitoring that we discussed in Chapter 1, Introduction to
Monitoring. There are dedicated SaaS monitoring solutions that are available for addressing
last-mile monitoring requirements. With Datadog's synthetic monitoring features, it is also
possible to address typical last-mile monitoring requirements.
The synthetic monitoring feature of Datadog provides a variety of checks that will
simulate the end user experience. The tests are launched from Datadog-managed locations
around the world, and you have the option to configure from where such tests originate.
The following are the categories of checks that can be performed using Datadog's synthetic
monitoring:
• DNS: Checks whether domains are resolved and how quickly they are resolved in
regions where users are located
• ICMP: Pings the hosts that are enabled with an ICMP protocol for their availability
and access from where the users are located
• TCP: Verifies access to service ports, such as HTTP (80), HTTPS (443), and SSH
(22), and any custom port
272 Miscellaneous Monitoring Topics
Now, let's see whether some of these checks could be configured in Datadog. The synthetic
monitoring-related options are accessible from the main menu, UX Monitoring. Navigate
to UX Monitoring | New Test and then click on the Get Started link, and you will be
presented with the options as shown in the following screenshot:
Click on New API Test and select the TCP tab to get to the form where a TCP test can
be configured. The first part of the form is shown in the following screenshot, where the
name of the test and the information about the SFTP server can be provided:
Using the Test URL link on the form, basic access to the server on port 22 can be
verified. Access from specific regions can then be added to the test, as shown in the
following screenshot:
The other types of TCP tests – HTTP, DNS, SSL, and ICMP – can be configured by
following similar steps by selecting the related tab on the New API Test form.
In a real-life scenario, there will be multiple tests such as this configured to verify that
various components and workflows of an application are available to the end user in the
regions where they are located. The availability status dashboard will look like the one
in the following screenshot, and it can be accessed by navigating to UX Monitoring |
Synthetic Tests:
The overall uptime status will be presented on this dashboard and the results can be
filtered using different conditions, such as the region of test origination. The individual
tests are listed at the bottom of this dashboard and by clicking a specific item, the details
of that test can be viewed and updated, as shown in the following screenshot:
Figure 14.10 – Synthetic test simulating browsers and devices used for access
To complete the creation of this test, the browsers, devices, and locations need to be
selected. Depending on the browser that you are using to access the Datadog dashboard,
a Datadog-supplied browser plugin has to be installed as well for recording the website
workflow that needs to be simulated by the test. You will get an option to record the
workflow and save it as part of creating the test.
278 Miscellaneous Monitoring Topics
Once the test results are in, you can view this on the dashboard, as in the following
sample screenshot:
Security monitoring
Cybersecurity is a far more important and essential area to cover in the cloud
environment because the application needs to be accessed via the internet and, in
most cases, the application itself is hosted in a public cloud. The security of running an
application in your own data center and making it accessible only in your private network
is no longer an option. The infrastructure and the applications that are exposed to external
attacks should be protected and hardened. There is an ecosystem of software applications
and services addressing a plethora of cybersecurity issues.
Security monitoring 279
Another aspect is the requirement of meeting security and privacy and compliance
standards for doing business, especially if the application is catering to the healthcare
and financial industries. Compliance requirements are dictated by laws applicable in
the jurisdiction of doing business, and security standards are demanded by customers.
Compliance requirements are usually audited by third-party service providers in that space,
and those requirements must be monitored and recorded as evidence for the auditors.
Datadog's Security and Compliance Monitoring feature has multiple options that can
be rolled out in an organization to address common cybersecurity and compliance
requirements. Also, Datadog enjoys the advantage of being a unified platform for various
monitoring types. By combining the analysis of logs and traces that are sourced from the
infrastructure and applications, and the powerful monitoring features such as alerting
and event management, Datadog can also be configured as a SIEM platform. Having
a SIEM tool is usually a requirement to demonstrate that an organization has a sound
cybersecurity practice.
Now, let's review the security features currently available on the Datadog platform and
look at the general steps involved in starting to use those features. The security options
can be accessed from the main Security menu, as in the following screenshot:
Now, let's look at how information is fed into Datadog for security analysis and use the
insights for hardening the infrastructure and applications.
The integration methods are specific to each product, and by selecting a product listed on
the preceding dashboard, the installation procedure can be viewed.
Security monitoring 281
We did a general overview of the security features available on the Datadog platform in
this section. It's an area that's still a work in progress, but the general direction it has taken
is very encouraging in terms of the usability of the available features and the simplicity of
integration with sources of security information. In the next section, let's look at the best
practices related to the topics that have been discussed in this chapter.
Best practices
You have learned about advanced monitoring, APM, and the security features available on
the Datadog platform, so now let's look at the related best practices in those areas:
• The instrumentation for generating application traces and profiling for APM must
be incorporated in the build and deployment process.
• Define a complete set of application metrics for each service and expose those for
easy consumption by Datadog.
• Plan to collect all application logs and, if needed, define new ones so that the
application state can be observed easily.
• Determine the geographical locations of the users of the application for fine-tuning
synthetic tests.
• Publish the list of supported devices and browsers. Based on the information
available in synthetic monitoring reports, fine-tune the application for compatibility
with access devices and browsers.
• For effective security monitoring, define custom security rules that are relevant
to the organization. Disable out-of-the-box rules that might generate spurious
messages.
Summary
In this chapter, you have learned about some of the monitoring features that are relatively
new on the Datadog platform and that continue to evolve. Both observability and
synthetic monitoring are discussed along with APM, but the related tools and concepts
are generic enough to be applicable to a wider context of monitoring.
With this chapter, the book is concluded and the recommended next step for you is to roll
out Datadog in your environment to acquire expertise. With features available to cover
almost every monitoring type, such as infrastructure monitoring, log aggregation and
indexing, last-mile monitoring, APM, and security monitoring, Datadog is one the most
comprehensive monitoring platforms available on the market. One of its major attractions
is its ability to unify monitoring with the help of a variety of monitoring features available
on the platform and its ability to correlate information across products.
We wish you good luck with rolling out proactive monitoring using Datadog, which is an
excellent choice for this purpose.
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as
well as industry leading tools to help you plan your personal development and advance
your career. For more information, please visit our website.
Why subscribe?
• Spend less time learning and more time coding with practical eBooks and Videos
from over 4,000 industry professionals
• Improve your learning with Skill Plans built especially for you
• Get a free eBook or video every month
• Fully searchable for easy access to vital information
• Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at packt.com and as a print
book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
customercare@packtpub.com for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for
a range of free newsletters, and receive exclusive discounts and offers on Packt books and
eBooks.
286 Other Books You May Enjoy
• Rationalize the selection of AWS as the right cloud provider for your organization
• Choose the most appropriate service from AWS for a particular use case or project
• Implement change and operations management
• Find out the right resource type and size to balance performance and efficiency
• Discover how to mitigate risk and enforce security, authentication, and
authorization
• Identify common business scenarios and select the right reference architectures
for them
Why subscribe? 287
• Adopt a security-first approach by giving users minimum access using IAM policies
• Build your first Amazon Elastic Compute Cloud (EC2) instance using the AWS CLI,
Boto3, and Terraform
• Set up your datacenter in AWS Cloud using VPC
• Scale your application based on demand using Auto Scaling
• Monitor services using CloudWatch and SNS
• Work with centralized logs for analysis (CloudWatch Logs)
Back up your data using Amazon Simple Storage Service (Amazon S3), Data
Lifecycle Manager, and AWS Backup
288 Other Books You May Enjoy
F system 109
tags 110
Flickr 205 hosts
Forwarder 35 about 251
inventorying 106-12
G logs, shipping from 252, 253
pinging 22
GCP 161 host monitoring containers
Go API client library, for Datadog with Datadog Agent 37-44
about 214 host-shots client library, for Node.js
reference link 214 about 217
Google Kubernetes Engine (GKE) 238 reference link 217
Google operations 27 HPE Oneview 28
Gradle 214
Grafana 24
I
H IBM DB2 161
IBM Tivoli Netcool/OMNIbus 28
Hadoop 162 identity provider (IdP) 75
HAProxy 158 incident 17
host 12 InfluxDB 24
host dashboard Information Technology (IT) 4
CPU usage 112, 113 infrastructure as code (IaC) 218
disk latency 115 Infrastructure List 106
disk usage 117 infrastructure monitoring 19
load averages 113, 114 integrating platform components, options
memory breakdown 116 agent-based 160
network traffic 118, 119 crawler-based 160
swap memory metrics 114 custom integration 160
host dashboard, resources Datadog API, using 161
agent 109 integration, building 161
apps 109 publish metrics, with StatsD 161
container 109, 110 integration folder
dashboard 110 creating 221-223
metrics 109 integrations
networks 110 configuration file, building 224
processes 110 deploying 225, 226
developing 219
294 Index
user roles
Datadog Admin Role 69
Datadog Read Only Role 70
Datadog Standard Role 69
users
managing 68-70
V
virtual machine (VM) 106
W
webhook 100, 142
WebService-DataDog library
about 214
reference link 214
Z
Zabbix
about 24, 223
deploying 223
Zenoss 25