Monitoring and Observability For: Data Platform

Monitoring and
Observability for
Data Platform
Prepared by Albert Lewandowski, DevOps @GetInData

Table
of content
Monitoring and observability in practice 3

Observability foundations 5
Site Reliability Engineering 5
Business value delivered by monitoring solution 6
Real Life Scenarios - use cases 7
Observing batch applications 7
Monitor quality of processes 7
Continuously processing platform in real-time 8
Observing services and infrastructure 8
Real Life Scenarios - infranstructure 10
Kubernetes 10
Real Life Scenarios - services 11
Core Hadoop components monitoring 11
Flink Jobs Monitoring 11
Hive Monitoring 12
Kafka Monitoring 12
Spark Jobs Monitoring 13
SQL Databases Monitoring 14
Solution architecture 15
Monitoring and observability 15
Prometheus High Availability setup 16
Prometheus with Cortex 18
Log analytics 19
Applications log analytics 19
Business logs analytics 23
How to implement monitoring platform for observability? 25
About GetInData 26
Monitoring and Observability for Data Platform

2
Monitoring and
„
observability in practice
We need to start from describing what monitoring and observability

of the platform mean because there are a lot of different definitions.
Monitoring describes the process of gathering metrics about IT
environment, running applications and observing the system per-
formance while observability is about measuring how well internal
states of the system can be inferred from knowledge of its external
outputs (according to the control theory). It sounds a bit difficult
due to its mathematics source but it can be easily translated into IT
use cases.
We can start with considering a use case of large and complex

Hadoop cluster that consists of several dozens of nodes. Using ava-
ilable monitoring tools incorrectly by analyzing too many data
points would cause unnecessary alerts and false flags so we do not
have clear view of the situation. We can call it as low observability of
the infrastructure. If we want to achieve high observability we need
to provide well-matched metrics and correctly set up alerts. The
target is to deliver information about current status of each compo-
nent.
Great example could be even a simple data processing job written

in Spark or Flink, that rewrites data from location A to B. Gathering
its metrics and setting up alerts or creating dashboard with simple
runtime visualization are a quite simple tasks. However to achieve
observability we should collect metrics about the amount of pro-
cessed data, JVM statistics and some metrics about infrastructure
under the hood. There are more complex data pipelines in real life
so we suggest to think about observing each part of the system.

3
The biggest challenge in making system observable is about its
„
dependency on the system’s complexity. Higher complexity
means more interactions between components, more logs and
metrics to analyse. It requires experience with handling such chal-
lenges in Big Data world and understanding which indicators
should be observed in each situation.
Observability is quite similar to DevOps. It is not limited only to

technology and it cover organizational culture and approach. Besi-
des, the concept of observability is prominent in DevOps appro-
ach because it describes that monitoring goals are not limited to
collecting and processing logs and metrics. It should deliver infor-
mation about its state to make it observable. That is what we call
observability. Great synonym would be understandable for users.
However, what is the difference between monitoring and observa-

bility? Monitoring is about obtaining information from applications
and machines (metrics and logs) while observability describes the
state of the platform and should help in meeting all business and
technical requirements such as SLA (Service Level Agreement -
particular aspects of the service – quality, availability, responsibi-
lities – are agreed between the service provider and the service
user). Monitoring is even the part of the observability.

4
Observability foundations
„
Observability requires three things to get started. The first is me-
trics because we need to monitor IT infrastructure and services.
Data points could be counters, gauges or any different metric type.
The second relies on log data. It is the story of the process and
becomes necessary in DevOps-oriented environment. The last
one is about capturing how users engage into our systems and
which actions can causes errors or bottlenecks.
Site Reliability Engineering
If you think of DevOps like an interface in a programming

language, class SRE implements DevOps.
When we talk about monitoring and observability, it is important to

mention about SRE (Site Reliability Engineering). SRE is the disci-
pline that incorporates aspects of software engineering and
applies them to infrastructure and operations problems. Reducing
organizational silos or measuring everything are ones of many
components mentioned by SRE that help in implementing com-
plex monitoring system.
SRE shares ownership with developers by using the same tools and
techniques across the stack, has a formula for balancing accidents
and failures against new releases. Moreover, it encourages "auto-
mating this year's job away" and minimizing manual systems work
to focus on efforts that bring long-term value to the system. It
shows that operations is a software problem, and defines pre-
scriptive ways for measuring availability, uptime, outages, toil, etc.

5
Business value delivered
by monitoring solution
Monitoring and observing data platform is necessary if we want to

have continuously working processes with reduced down-times.
Observable environment means easier configuration and mana-
gement of IT infrastructure because we understand how each
component works and how they interact between each others.
An understanding of running processes helps in delivering busi-
„
ness value. Any issues can be prevented and solved before some-
thing goes wrong. It means avoiding downtimes and reducing
costs of such incidents.
There are multiple profits of implementing complex monitoring SLA is a contract that the
service provider promises
solution: customers on service
availability, performa.
• Avoid downtimes of the platform
SLO is a goal that service
Prevent issues and react immediately if triggered action is provider wants to reach.
not enough.
SLI is a measurement the
• Meet business and technical requirements, and validate SLA, service provider uses for
SLO, SLI in the monitoring platform directly the goal.
Monitoring all indicators and check if everything is OK.

• Observability into service health
On-call operators can find the context easily when they
need it.
• Knowledge about dependencies between different compo-
nents and applications
Find the root cause of any issue quickly, analyse it in post-
-mortem and apply enhancement.
• Resilient platform with self-sustaining triggers.
Add triggers based on metrics that can run tasks to provide
continuously working of the platform.

6
Real Life Scenarios -
„
Use cases
Observing batch applications

Having flexible solution that will take care of monitoring and
observing any application that you would submit to the cluster.
Technical background:
• On-premise Hadoop cluster
• Data Engineers run Spark Batch jobs
• Data Analysts uses Hive queries.
Spark jobs would expose metrics directly to the module that enab-
les pulling them by Prometheus. Developers are interested in
checking job log files for any outages, what kind of errors there are
and if search phrase is there. Analysts would like to have stable tool
and run Hive queries that can process a long time. We need to
make sure there are enough resources and that query does not
cause any interference into different application.
Monitor quality of processes

Be sure that all ETL pipelines work as expected and you will not
lose data.
• Platform built in the public cloud with S3 storage
• Great number of Hive queries
• ETL pipelines are necessary for next steps of the proces-
sing layer

7
We can expose results from Hive queries to validate quality of the
„
data by using custom Prometheus Exporter, visualize them in Gra-
fana and send alerts if there are any issues. We should also monitor
logs to detect warnings and errors to prevent downtimes in next
steps of the pipeline.
Continuously processing platform in

real-time
Analyze real-time streams to prevent downtimes and make all
fully automated.
• Containerized environment on Kubernetes

• Flink jobs processing several hundred thousands events
per second
• Events are consumed from Kafka topics
The first layer would be focused on Kubernetes and its compo-

nents. It is necessary to observe resources and condition of the
cluster because we must detect issues to prevent downtimes. The
second relies on the metrics. Targets need to be discovered that
can be done with built-in tool in Prometheus and we do not have to
worry that we would miss any new applications. The third layer
consists of log files that need to be scraped from all pods and labe-
led correctly to enable easy searching of exact pod. Flink jobs
would be restarted in case of any failure while alerts would be sent
in case of any lag in Kafka which would require human action.

8
Observing services and
„
infrastructure
Start from the foundation of any platform: IT infrastructure and
core services.
• Hadoop cluster in the public cloud
• Hive Metastore, Ranger and Ambari Admin managed by
• SQL database provided by the cloud platform
• NiFi is responsible for ingesting data
Taking advantage of mixed monitoring and export built-in metrics

from default cloud monitoring to the target solution. It provides
storing all metrics and logs in one place. Monitoring cloud compo-
nents are necessary when we use them for storing data required by
services and they would stop without access to them. Each flow in
NiFi can be observed because it may have impact on next steps in
the processing platform.

9
Real-life scenarios
„
infrastructure
Kubernetes
It is highly recommended to monitor our Kubernetes cluster when
it is a vital part of the production analytics solution.
Metrics:
• Performance overview to validate if our cluster has enough

resources by analysing CPU, RAM and disk usage and utiliza-
tion, and observing Go processes metrics.
• Check services status like Flink operator or Spark operator to
check if our processes are up and if there is no delay in proces-
sing data.
• Validate pods and applications metrics - custom metrics, like
network health checks.
• Check node performance and their logs.
• Such data should be enriched by the information about
servers’ metrics on which Kubernetes works.

10
Real-life scenarios
„
services
Monitoring of core Hadoop

components
Here we talk about HDFS and YARN. They are important because
HDFS is responsible for storing data while YARN for managing
cluster resources and submitted jobs.
Metrics:
• State of HDFS - it can be checked by reading number of

under replicated blocks (it should be equal to 0), NameNode
performance and how replication looks like.
• YARN status to check if it is up and if there is no issues with
submitting applications.
• NodeManagers’ status and resources usage to have all
worker nodes availables. Lost Node Manager can cause
issues with running applications.
• YARN Queues - we can observe the status of each queue
and the number of running applications, submitted or failed
to check if everything is OK.
• Observing JVM metrics like Java Heap Usage, Garbage Col-
lection Overhead or number of active threads to check if Java
processes work smoothly.

11
Flink Jobs Monitoring
„
There should be monitoring of each component of the Flink jobs
so it has to be enriched by the information about its data source
and the target to which Flink writes information to. It is especially
important to check the progress of the job to prevent downtimes.
Metrics:
• Progress of the job - it is required to check the progress of

the Flink job, how it works and if there is no lag that is necessa-
ry in real-time processing. There are information about water-
mark progress, latency which provides knowledge about it.
• Overall performance - we can enrich our metrics by coun-
ting the number of processed events that is also useful for
predicting resource usage by Flink jobs.
• Observing JVM metrics to check if Java processes work
smoothly.
Hive Monitoring
Hive queries can run for a long time so we need to make sure that it
can finish with no delays.
Metrics:
• It should be enriched by monitoring of HDFS and YARN.

• Hive Metastore status and its connection to the database -
Hive will not run without Metastore.
• Check the data quality and Hive queries by running valida-
tion queries to expose information about data.
smoothly.

12
Kafka Monitoring
„
Kafka is used for sending data so any issue can cause lag in data
processing or loosing some events.
Metrics:
• Monitoring the number of sent events to check if there is no

issue with data sources or consuming messages.
• Brokers state - we check Kafka performance and its state.
• State of the partitions - it is necessary to observe the number
of partitions, their leaders and election processes to avoid any
offline partition.
• Additional information like the metrics of each server on
which Kafka is running: disk usage, CPU utilization and RAM
usage.
Spark Jobs Monitoring

We need to check if all our Spark jobs work as expected. There are
some different Spark jobs like batch or streaming but they require
to have similar metrics set to provide high performance and no
downtime.
Metrics:
• Monitor task progress to check if there is low performance.

• Check Task Manager status and how they process events.
We can also observe the number of processed events.
• Additional information about data source and data storage
where Spark saves information.
smoothly.

13
SQL Databases Monitoring
„
It is necessary to monitor SQL databases. They are used as the me-
tastores for many services and they can be used for storing data.
Metrics:
• Performance overview would be helpful to observe the

usage of the database and plan some upgrades if needed.
• Number of connections from processes shows us if every-
thing is OK and we have some still available connections in
the database.
• Monitoring the number of running queries.
• Replication status has to be observed when we have the
cluster and we need to validate that this process works smo-
othly. Any issues can cause downtimes in future and we may
lose high availability.

14
Solution architecture
Monitoring and observability

„
In GetInData we have have prepared an architecture for monito-
ring all components and observing how the processes are working
that was deployed and verified in many production environments.
A monitoring system is a must in any data platform or IT service. It
provides information about a current situation, and some issues
can even be resolved automatically. We can trigger some actions in
case of problems, like restarting a Flink job, when it failed, based on
the metrics that shows the number of Task Managers, for example.
If the restart is not successful, then we would get an alert. As
described in the book called Site Reliability Engineering:
Alerts signify that a human needs to take action immediately

in response to something that is either happening or about
to happen, in order to improve the situation.
GetInData monitoring platform is based on Prometheus, a Cloud

Native Computing Foundation project, that is a monitoring system
and time-series database. It supports a multi-dimensional data
model (time series defined by metric name and set of key/value
dimensions), a flexible query language to leverage this dimensio-
nality and targets are discovered via service discovery or static
configuration. It pulls metrics from target and supports various
applications, services and platforms. Surely, it provides complex
setup for containerized environment.

15
Alerts are handled by AlertManager, a Prometheus component. It
„
takes care about deduplicating, grouping, and routing them to the
correct receiver integrations such as email, Slack, Mattermost
PagerDuty, or OpsGenie. It also takes care of silencing and inhi-
bition of alerts. The last one is a concept of suppressing notifica-
tions for certain alerts if corresponding alerts has already been
fired.
Metrics visualisation is provided by Grafana. It allows you to query,

visualize, and understand metrics. Grafana makes it also possible
to define alerts on the visualised metrics. User Interface delivers
great user experience - we can add panels with information from
different data sources in one dashboard what provides great value
in terms of monitoring and observability.
This stack provides flexibility, is based on easily scalable time-se-

ries database and consists of open-source components. Prome-
theus delivers High Availability mode in two variants and can be
extended by cold storage if there is a need for storing metrics older
than 30 days.
Prometheus High Availability setup

16
Prometheus High Availability setup

17
Prometheus with Cortex

18
Log analytics
„
Moreover, GetInData delivers the platform for log analytics. We
need to start with some questions to explain why and for what mo-
nitoring systems with log analytics can be used. The first thing is
about performing complex monitoring of each process in our plat-
form. When talking about Big Data solutions, it is imperative to
check that all real-time processing jobs work as expected, because
we have to act quickly if there are any issues. It is also important to
validate how any changes in the code help in the processing part.
Here we can talk about processing jobs that run on Apache Spark
and Apache Flink. The first part of the monitoring process is focu-
sed on getting metrics, like the number of processed events, JVM
statistics or used Task Managers. The second is about log analytics.
We want to detect any warnings or errors in the log files and analy-
ze them later during post mortem or to find any invalid data sour-
ces. Moreover, we can set up alerts based on the log files that could
be really helpful for detecting issues, even with related compo-
nents.
There is also a need to provide all log files in real time, because any
lag in sending them can cause problems and would not provide
the required effect for IT and software developers. In the case of a
Flink job, we want to check that all triggers work as expected, and if
not then we would need to find the reason for this in the log files.
We want to find values in logs later by looking for an exact phrase.
Applications log analytics

Applications’ logs analytics is useful for all developers and opera-
tions because they can debug and check what is happening in the
pipelines. There a lot of solutions based on Elastic stack but we
recommend using Loki in these use cases. Loki requires much less
resources, is designed for tailing log files from any machine: virtual

19
machine, Docker container, Kubernetes pod or bare-metal server;
„
and can be used as the data source in Grafana
As previously mentioned, our solution is based on Loki. It is a hori-

zontally-scalable, highly-available, multi-tenant log aggregation
system inspired by Prometheus. It is designed to be very cost
effective and easy to operate. It does not index the contents of the
logs, but rather a set of labels for each log stream..
Compared to other log aggregation systems, Loki:
• does not provide full text indexing on logs. By storing com-

pressed, unstructured logs and only indexing metadata,
Loki is simpler to operate and cheaper to run.
• indexes and groups log streams using the same labels
you’re already using with Prometheus, enabling you to
seamlessly switch between metrics and logs using the same
labels that you’re already using with Prometheus.
• is an especially good fit for storing Kubernetes Pod logs.
Metadata such as Pod labels is automatically scraped and
indexed.

20
.
Moreover, logs gathered by Loki can be visualized in Grafana

directly. It means we can create panels with metrics, logs in one
„
dashboard to make our platform easily observable and detect any
dependencies.
Loki requires little resources to make it running and can be setup in

high availability. Surely, it supports scaling horizontally.

21
Tytuł
Loki is a great choice if you want to:
• gather logs from application

• save log files and you would not run queries on too old data
• setup log analytics platform with little resources
• have metrics and logs in Grafana dashboard
PROS CONS
Low technical requirements Index only labels that makes run-

ning queries not the fastest
Support for adding dashboards HA setup requires S3 storage and
in Grafana key-value database
Simple configuration
Open-source solution

22
Business logs analytics
„
Elastic Stack is well known for its various implementations. Distri-
buted, open-source search analytics engine provides data
indexing that once indexed can be queried by users to retrieve
complex summaries. From Kibana, visualisation tool, users can
create powerful visualizations of their data, share dashboards, and
manage the Elastic Stack.
Logs are delivered to Elasticsearch by Logstash. It is used to aggre-

gate and process data - we can enrich and transform it before it is
indexed into Elasticsearch. Information from log files can be read
and sent by Filebeat.
Elasticsearch is fast in processing queries that is crucial in analy-

zing business logs. It indexes all log files so it can deliver results in
near real time with low latency.

23
„
ELK meets your requirements if you want to:
• run great number of queries on logs

• gather business logs for business users
• ingest data from different sources, not only from log files
PROS CONS
Great query performance Requirements for powerful reso-

urces (in comparison to Loki)
Distributed system out of the Data can be visualised only in

box Kibana
Open-source solution

24
How to implement
„
monitoring platform
for observability?
In GetInData we believe that implementing a software should be

always a complete solution providing required usability for busi-
ness users, technical users and IT Team.
Firstly, technical and business requirements need to be defined to

enable preparing everything for achieving observability of the
platform. Secondly, it is important to decide the target groups of
users who will use the platform. Thirdly, all activities should be
followed by decisions about implementing it as a dedicated system
or as an extension to the present one.
Metrics and log files from all components and applications have to
delivered to one central store to make is useful for any user
involved to the projects. It can be delivered with Prometheus with
its exporters for any Big Data services, data processing jobs or
queries, and even for checking the status of the IT infrastructure.
Log files would be read and sent to the destination for running
queries by scalable and reliable service
It is valuable to have an opportunity to query logs to debug Spark

job or to find if one exact record was processed as expected or not.
Such tool might be helpful for everyone in the team - it is a tool for
validation and checking.
Alerts and logs provides automated triggers. It causes that most of

issues can be solved by platform itself with no need of human
action. Less downtimes and faster time response for an emergency

25
situation give the confidence that all data will be processed and we
will deliver business value all the time without any issues.
High observability of the system means an ease in understanding

the status of each part of the data platform and helps in mainta-
ining it. „
About GetInData
Getindata is a data analytics company founded in 2014 by ex-Spo-
tify data engineers. From day one we focus on Big Data projects.
We bring together a group of best and most experienced experts
in Poland working with the cloud and open-source Big Data tech-
nologies to help companies build scalable data architectures and
implement advanced analytics over large data sets.
Our experts have vast production experience in implementing Big

Data projects for Polish as well as foreign companies including i.a.
Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise,
StepStone, iZettle and many others from pharmaceutical, media,
finance and FMCG industries.

26
Have a project to discuss?
Contact us!
We will reach out to schedule a call and
settle the next steps
GetInData Sp. zo.o. Sp. komandytowa

ul. Puławska 39/20,
02-508 Warszawa
Poland
hello@getindata.com
www.getindata.com

Monitoring and Observability For: Data Platform

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monitoring and Observability For: Data Platform

Uploaded by

Copyright:

Available Formats

Monitoring and

Prepared by Albert Lewandowski, DevOps @GetInData

Monitoring and observability in practice 3

Monitoring and Observability for Data Platform

We need to start from describing what monitoring and observability

We can start with considering a use case of large and complex

Great example could be even a simple data processing job written

Monitoring and Observability for Data Platform

Observability is quite similar to DevOps. It is not limited only to

However, what is the difference between monitoring and observa-

Monitoring and Observability for Data Platform

Site Reliability Engineering

If you think of DevOps like an interface in a programming

When we talk about monitoring and observability, it is important to

Monitoring and Observability for Data Platform

Monitoring and observing data platform is necessary if we want to

An understanding of running processes helps in delivering busi-

Monitoring all indicators and check if everything is OK.

Monitoring and Observability for Data Platform

Observing batch applications

Monitor quality of processes

Monitoring and Observability for Data Platform

Continuously processing platform in

• Containerized environment on Kubernetes

The first layer would be focused on Kubernetes and its compo-

Monitoring and Observability for Data Platform

Taking advantage of mixed monitoring and export built-in metrics

Monitoring and Observability for Data Platform

• Performance overview to validate if our cluster has enough

Monitoring and Observability for Data Platform

Monitoring of core Hadoop

• State of HDFS - it can be checked by reading number of

Monitoring and Observability for Data Platform

• Progress of the job - it is required to check the progress of

• It should be enriched by monitoring of HDFS and YARN.

Monitoring and Observability for Data Platform

• Monitoring the number of sent events to check if there is no

Spark Jobs Monitoring

• Monitor task progress to check if there is low performance.

Monitoring and Observability for Data Platform

• Performance overview would be helpful to observe the

Monitoring and Observability for Data Platform

Monitoring and observability

Alerts signify that a human needs to take action immediately

GetInData monitoring platform is based on Prometheus, a Cloud

Monitoring and Observability for Data Platform

Metrics visualisation is provided by Grafana. It allows you to query,

This stack provides flexibility, is based on easily scalable time-se-

Prometheus High Availability setup

Monitoring and Observability for Data Platform

Monitoring and Observability for Data Platform

Monitoring and Observability for Data Platform

Applications log analytics

Monitoring and Observability for Data Platform

As previously mentioned, our solution is based on Loki. It is a hori-

Compared to other log aggregation systems, Loki:

• does not provide full text indexing on logs. By storing com-

Monitoring and Observability for Data Platform

Moreover, logs gathered by Loki can be visualized in Grafana

Loki requires little resources to make it running and can be setup in

Monitoring and Observability for Data Platform

Loki is a great choice if you want to: