Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Databricks usage and cost


analysis
George Kozlov · Follow
5 min read · Jun 8, 2022

61 1

Modern data-driven applications require a modern approach to data


processing and transformation, and at People.ai, we rely on Databricks
software for a number of data processing.

Databricks has been a reliable partner for us, providing a cloud-based


orchestration platform for our Spark-driven data processing. We use
Databricks primarily for production data analysis, called all-purpose and
light jobs; However, we find the platform helpful for product proof-of-
concept solutions.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 1 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Problem
Historically, Databricks focused on Spark and its runtime, not paying
much attention to easy-digestible metrics and usage stats, which leads to
complications and confusion for hyper-growing companies.
Understanding the job costs and overall spending with Databricks
becomes vital with the growing demand from product and engineering
teams.

Databricks offers flexible instruments to spin up the clusters and run all
kinds of data processing jobs in the cloud; however, it can get out of
control quickly. In other words, there is no mechanism in place to
calculate the efficiency of the created cluster as there are many
dependencies on an actual user’s code deployed to the cluster, amount
and type of data, etc. As a result, the users can overestimate the required

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 2 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

compute power and waste processing resources without clear visibility.

Another critical problem is that compared to Snowflake or similar data


processing platforms, Databricks is an orchestration platform that
operates within the customer’s cloud and spins up the compute instances
(we use AWS, and I will refer to it in the article) there. That leads to
computing power and Databricks costs, and measuring such costs is
incredibly difficult. This article shows our approach to measuring
Databricks usage and cost, along with the cost optimization
recommendations.

Solution
Databricks offers detailed billable usage reports that you can download
and analyze using the “root” (admin) account. The reports are available
via CSV files with details of each run of the job and the allocated AWS
instance type requested by the user. It also supports the export of the
reports to S3 storage and an API so that you can automate report
ingestion; that’s something that we’re going to do.

For the automation, we rely on the Workato platform and build a


scheduled job that retrieves the report from Databricks, parses the CSV
file, “pivots” the raw report by the job name, and generates the output
report for analysis. The tricky part in that puzzle is the cost
approximation that we do on the fly based on the “on-demand” cost of the
AWS instance in the specific region and the known price of the
Databricks processing unit (DBU). To do cost calculation, we need to
know the following information:

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 3 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

The exact type of AWS instance and target region;

The precise price of DBU for all-purpose, compute, and light jobs;

The accurate price of AWS instance by its type.

Note: Unfortunately, it’s not possible to determine the exact price of an AWS
instance as it depends on your plan, possible Enterprise Discount Program
(EDP), reserved instances, and even spot instances if requested. Databricks
doesn’t provide details on an allocated instance except for instance type, so in
our approximation, we rely on on-demand prices and apply an EDP discount.
That approach doesn’t give you an exact cost of the job; however, it provides
visibility on approximate cost so that the engineering team can extrapolate it
and adjust the clusters if needed.

For data aggregation on the Workato platform, we use a Collection


application that supports data loading into a virtual SQL-like table to run
SQL queries to the loaded dataset (Workato uses SQLite for collections
behind the scenes). Generally speaking, we generate four virtual tables
(collections) and query the data via an SQL query with the joins between
the tables to get the report. Here is a simple database diagram of the
implemented solution.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 4 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

usage jobs

workspaceld varchar jobld varchar

timestamp timestamp jobName varchar

clusterld varchar ec2InstanceModel varchar


*
clusterName varchar creatorEmail varchar

*
clusterNodeType varchar

clusterOwnerUserld varchar

clusterCustomTags varchar

*
sku varchar

dbus varchar

ec2_prices machineHours varchar dbu_prices

varchar clusterOwnerUserName varchar sku varchar


type

price decimal
price decimal tags varchar

Because DBUs prices are static and depend on your agreement with
Databricks, we load them to collections via static JSON objects. The same
applies to AWS prices as we operate in a single region for our production
environment. I was thinking about requesting AWS instance prices via
API; however, I simplified the automation as prices are primarily static,
and the list is easy to maintain manually. That might be a good
improvement, though.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 5 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

The detailed usage report doesn’t contain job details, and it lists the tasks
executed by each job whenever it is executed and the id of the job
associated with it. To amplify the result report with the job-level details,
we retrieve all jobs via Jobs API from Databricks.

Note: That step is essential if you are interested in actual job names as they are
not available in the report; however, it can be skipped.

As the final step, we query the data via SQL query and generate a CSV
report. Here is the SQL query that we use for data aggregation:

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 6 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Note: Databricks returns tags as a JSON object, and you can extract any JSON
component using the json_extract function supported by SQLite.

What next?
There is no limit to perfection, right? I was thinking about many possible
outcomes for that automation, and here comes the list of possible
improvements:
Sign Sign
Search Write
up In
Retrieve AWS instance prices via API;

Upload the resulted report to Google Docs right away;

Run the automation periodically and write data incrementally to

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 7 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Snowflake;
Fazer login com o Google

Build a dynamic report using Tableau on top of the Snowflake table.


Use sua Conta do Google
para fazer login no app
Conclusion Medium
It all depends on your requirements
Chegaand goals.senhas.
de decorar We generate
Faça the report
login de forma rápida, simples e
monthly and review it with the Engineering team to find out how
segura.
efficiently resources are utilized and how we can optimize jobs and
Continuar
clusters to save money on data processing. The team reviews Databricks
costs associated with each hob and AWS resources cost as AWS costs
become even more critical for periodically changed jobs and instance
types. Total cost gives you better visibility of the changes deployed by the
product engineering team and eve cost of the new features run via
Databricks. Here is a masked example of the report generated by
automation.

You can find sample automation here. Feel free to modify it according to
your requirements.

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 8 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Databricks AWS Cost Analysis Cloud Cost Cloud Cost Management

Written by George Kozlov Follow

34 Followers

I drive a new-generation IT team, eliminating routine IT, business, and engineering


operations company-wide to leave challenging and exciting work for people.

More from George Kozlov

Users provisioning automation Okta tips and tricks with the groups

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 9 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

George Kozlov in People.ai Engineering George Kozlov in People.ai Engineering

Users provisioning automation Okta tips and tricks with the


Technologies change rapidly, and our groups
routine and operation activities change… This year I shared an article about Users
accordingly. I remember the time when Provisioning Automation via Workato,…
onboarding and… where I explained how we leverage Okta
8 min read · May 21, 2020 5 minto
API read · Nov
build 5, 2020
custom…

176 2

George Kozlov George Kozlov

Connect Okta to multiple AWS Google Drive security


instances via the custom app compliance monitoring
SaaS applications bring flexibility and Compliance is one of the gray areas in IT
management headaches at the same time… security due to the complexity of…
The more applications and services used implementation and uncertain rules and
6
bymin
theread · Mar 8 it…
company, 6 · Julin18this…
min read and
policies,

1 51

See all from George Kozlov

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 10 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Recommended from Medium

Tech Zero Databricks SQL S… in DBSQL SME Engineeri…


E g
Tutorial: Setup Databricks SCD 2 Warehousing Design
Workflows To Run Pipelines in… Pattern on Databricks SQL…
Databricks-
This Part how
article describes 1 to use Serverless
How and when to implement the SCD2
Databricks workflows to create, run and… Design Pattern on the Lakehouse in DBSQL
audit a pipeline in Databricks environment.
5 min read · May 13 11 min read · Oct 15

6 29 2

Lists

New_Reading_List Natural Language


174 stories · 171 saves Processing
780 stories · 358 saves

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 11 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

Rakesh Reddy Chetna Chaudhari

Spark Job Optimizations & Delta Tables Optimization


Databricks Techniques
There are several ways a spark job can be Optimizing Delta tables involves
optimized, using the right optimization is… implementing various strategies to impro…
crusial to reduce the overall runtime and performance and efficiency when working
5 min read · Oct 12
compute… 2 read · Lake.
min Delta
with Jul 23 Here’s…

63 1 3 1

Anuj Alexander Volok in Plumbers Of Data Science

Delta Live Table with slowly Delta Properties and Check


changing dimension 2 Constraints at Scale
As industry is growing, Volume of data Managing Delta Table Properties and
increased rapidly. Most of the real time da… Check Constraints using Python
will changes over a periodic interval of time.
4 min read · Jul 26
To… 8 min read · May 26

20 7 1

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 12 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54

See more recommendations

Help Status About Careers Blog Privacy Terms Text to speech Teams

https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 13 de 13

You might also like