Professional Documents
Culture Documents
Monitoring and Observability For: Data Platform
Monitoring and Observability For: Data Platform
Observability for
Data Platform
„
observability in practice
„
dependency on the system’s complexity. Higher complexity
means more interactions between components, more logs and
metrics to analyse. It requires experience with handling such chal-
lenges in Big Data world and understanding which indicators
should be observed in each situation.
„
Observability requires three things to get started. The first is me-
trics because we need to monitor IT infrastructure and services.
Data points could be counters, gauges or any different metric type.
The second relies on log data. It is the story of the process and
becomes necessary in DevOps-oriented environment. The last
one is about capturing how users engage into our systems and
which actions can causes errors or bottlenecks.
SRE shares ownership with developers by using the same tools and
techniques across the stack, has a formula for balancing accidents
and failures against new releases. Moreover, it encourages "auto-
mating this year's job away" and minimizing manual systems work
to focus on efforts that bring long-term value to the system. It
shows that operations is a software problem, and defines pre-
scriptive ways for measuring availability, uptime, outages, toil, etc.
„
ness value. Any issues can be prevented and solved before some-
thing goes wrong. It means avoiding downtimes and reducing
costs of such incidents.
There are multiple profits of implementing complex monitoring SLA is a contract that the
service provider promises
solution: customers on service
availability, performa.
• Avoid downtimes of the platform
SLO is a goal that service
Prevent issues and react immediately if triggered action is provider wants to reach.
not enough.
SLI is a measurement the
• Meet business and technical requirements, and validate SLA, service provider uses for
SLO, SLI in the monitoring platform directly the goal.
„
Use cases
Technical background:
• On-premise Hadoop cluster
• Data Engineers run Spark Batch jobs
• Data Analysts uses Hive queries.
Spark jobs would expose metrics directly to the module that enab-
les pulling them by Prometheus. Developers are interested in
checking job log files for any outages, what kind of errors there are
and if search phrase is there. Analysts would like to have stable tool
and run Hive queries that can process a long time. We need to
make sure there are enough resources and that query does not
cause any interference into different application.
Technical background:
• Platform built in the public cloud with S3 storage
• Great number of Hive queries
• ETL pipelines are necessary for next steps of the proces-
sing layer
„
data by using custom Prometheus Exporter, visualize them in Gra-
fana and send alerts if there are any issues. We should also monitor
logs to detect warnings and errors to prevent downtimes in next
steps of the pipeline.
Technical background:
„
infrastructure
Start from the foundation of any platform: IT infrastructure and
core services.
Technical background:
• Hadoop cluster in the public cloud
• Hive Metastore, Ranger and Ambari Admin managed by
• SQL database provided by the cloud platform
• NiFi is responsible for ingesting data
„
infrastructure
Kubernetes
It is highly recommended to monitor our Kubernetes cluster when
it is a vital part of the production analytics solution.
Metrics:
„
services
Metrics:
„
There should be monitoring of each component of the Flink jobs
so it has to be enriched by the information about its data source
and the target to which Flink writes information to. It is especially
important to check the progress of the job to prevent downtimes.
Metrics:
Hive Monitoring
Hive queries can run for a long time so we need to make sure that it
can finish with no delays.
Metrics:
„
Kafka is used for sending data so any issue can cause lag in data
processing or loosing some events.
Metrics:
Metrics:
„
It is necessary to monitor SQL databases. They are used as the me-
tastores for many services and they can be used for storing data.
Metrics:
„
takes care about deduplicating, grouping, and routing them to the
correct receiver integrations such as email, Slack, Mattermost
PagerDuty, or OpsGenie. It also takes care of silencing and inhi-
bition of alerts. The last one is a concept of suppressing notifica-
tions for certain alerts if corresponding alerts has already been
fired.
„
Moreover, GetInData delivers the platform for log analytics. We
need to start with some questions to explain why and for what mo-
nitoring systems with log analytics can be used. The first thing is
about performing complex monitoring of each process in our plat-
form. When talking about Big Data solutions, it is imperative to
check that all real-time processing jobs work as expected, because
we have to act quickly if there are any issues. It is also important to
validate how any changes in the code help in the processing part.
Here we can talk about processing jobs that run on Apache Spark
and Apache Flink. The first part of the monitoring process is focu-
sed on getting metrics, like the number of processed events, JVM
statistics or used Task Managers. The second is about log analytics.
We want to detect any warnings or errors in the log files and analy-
ze them later during post mortem or to find any invalid data sour-
ces. Moreover, we can set up alerts based on the log files that could
be really helpful for detecting issues, even with related compo-
nents.
There is also a need to provide all log files in real time, because any
lag in sending them can cause problems and would not provide
the required effect for IT and software developers. In the case of a
Flink job, we want to check that all triggers work as expected, and if
not then we would need to find the reason for this in the log files.
We want to find values in logs later by looking for an exact phrase.
„
and can be used as the data source in Grafana
„
dashboard to make our platform easily observable and detect any
dependencies.
PROS CONS
Simple configuration
Open-source solution
„
Elastic Stack is well known for its various implementations. Distri-
buted, open-source search analytics engine provides data
indexing that once indexed can be queried by users to retrieve
complex summaries. From Kibana, visualisation tool, users can
create powerful visualizations of their data, share dashboards, and
manage the Elastic Stack.
PROS CONS
Open-source solution
„
monitoring platform
for observability?
Metrics and log files from all components and applications have to
delivered to one central store to make is useful for any user
involved to the projects. It can be delivered with Prometheus with
its exporters for any Big Data services, data processing jobs or
queries, and even for checking the status of the IT infrastructure.
Log files would be read and sent to the destination for running
queries by scalable and reliable service
About GetInData
Getindata is a data analytics company founded in 2014 by ex-Spo-
tify data engineers. From day one we focus on Big Data projects.
We bring together a group of best and most experienced experts
in Poland working with the cloud and open-source Big Data tech-
nologies to help companies build scalable data architectures and
implement advanced analytics over large data sets.
hello@getindata.com
www.getindata.com