Professional Documents
Culture Documents
Databricks Usage and Cost Analysis - by George Kozlov - Medium
Databricks Usage and Cost Analysis - by George Kozlov - Medium
61 1
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 1 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
Problem
Historically, Databricks focused on Spark and its runtime, not paying
much attention to easy-digestible metrics and usage stats, which leads to
complications and confusion for hyper-growing companies.
Understanding the job costs and overall spending with Databricks
becomes vital with the growing demand from product and engineering
teams.
Databricks offers flexible instruments to spin up the clusters and run all
kinds of data processing jobs in the cloud; however, it can get out of
control quickly. In other words, there is no mechanism in place to
calculate the efficiency of the created cluster as there are many
dependencies on an actual user’s code deployed to the cluster, amount
and type of data, etc. As a result, the users can overestimate the required
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 2 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
Solution
Databricks offers detailed billable usage reports that you can download
and analyze using the “root” (admin) account. The reports are available
via CSV files with details of each run of the job and the allocated AWS
instance type requested by the user. It also supports the export of the
reports to S3 storage and an API so that you can automate report
ingestion; that’s something that we’re going to do.
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 3 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
The precise price of DBU for all-purpose, compute, and light jobs;
Note: Unfortunately, it’s not possible to determine the exact price of an AWS
instance as it depends on your plan, possible Enterprise Discount Program
(EDP), reserved instances, and even spot instances if requested. Databricks
doesn’t provide details on an allocated instance except for instance type, so in
our approximation, we rely on on-demand prices and apply an EDP discount.
That approach doesn’t give you an exact cost of the job; however, it provides
visibility on approximate cost so that the engineering team can extrapolate it
and adjust the clusters if needed.
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 4 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
usage jobs
*
clusterNodeType varchar
clusterOwnerUserld varchar
clusterCustomTags varchar
*
sku varchar
dbus varchar
price decimal
price decimal tags varchar
Because DBUs prices are static and depend on your agreement with
Databricks, we load them to collections via static JSON objects. The same
applies to AWS prices as we operate in a single region for our production
environment. I was thinking about requesting AWS instance prices via
API; however, I simplified the automation as prices are primarily static,
and the list is easy to maintain manually. That might be a good
improvement, though.
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 5 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
The detailed usage report doesn’t contain job details, and it lists the tasks
executed by each job whenever it is executed and the id of the job
associated with it. To amplify the result report with the job-level details,
we retrieve all jobs via Jobs API from Databricks.
Note: That step is essential if you are interested in actual job names as they are
not available in the report; however, it can be skipped.
As the final step, we query the data via SQL query and generate a CSV
report. Here is the SQL query that we use for data aggregation:
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 6 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
Note: Databricks returns tags as a JSON object, and you can extract any JSON
component using the json_extract function supported by SQLite.
What next?
There is no limit to perfection, right? I was thinking about many possible
outcomes for that automation, and here comes the list of possible
improvements:
Sign Sign
Search Write
up In
Retrieve AWS instance prices via API;
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 7 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
Snowflake;
Fazer login com o Google
You can find sample automation here. Feel free to modify it according to
your requirements.
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 8 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
34 Followers
Users provisioning automation Okta tips and tricks with the groups
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 9 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
176 2
1 51
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 10 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
6 29 2
Lists
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 11 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
63 1 3 1
20 7 1
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 12 de 13
Databricks usage and cost analysis | by George Kozlov | Medium 03/11/2023 13:54
Help Status About Careers Blog Privacy Terms Text to speech Teams
https://medium.com/@progeorgek/databricks-usage-and-cost-analysis-e974e380916a Página 13 de 13