Get Started With Databricks For Machine Learning

Get Started with
Databricks for
Machine Learning
Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved

Learning goals
Upon completion of this content, you should be able to:
Explain fundamental concepts about using the Databricks Lakehouse

1
Platform for machine learning.
2 Perform basic notebook tasks using the Databricks Lakehouse Platform.
3 Store and manage data in the Lakehouse for machine learning tasks.
4 Create and use a baseline model using AutoML.
5 Create and use a feature store table for model training.
6 Track, register, and manage the stage of a model with MLflow.

Prerequisites/Technical Considerations
Things to keep in mind before you work through this course
Prerequisites Technical Considerations
Intermediate level knowledge of Python

1 1 A cluster running on DBR ML 13.3+
Basic knowledge of data science and

2 machine learning topics such as Unity Catalog enabled workspace
2
regression/classification models, model
evaluation metrics.
Basic knowledge of a machine learning
3 3 Model Serving enabled workspace
library such as scikit-learn.

Databricks Lakehouse Fundamentals:
Databricks
Fundamentals
Databricks Academy
2023

Learning objectives
Things you’ll be able to do after completing this lesson
• Identify Databricks as the Lakehouse Platform

• Describe core services of the Databricks Lakehouse Platform for
different personas.
• Identify different types of assets in the Databricks Workspace.
• Navigate throughout different sections of the Workspace.
• Perform common actions available in the Workspace.
• Describe Databricks Repos and its features.

Databricks Overview
What is Databricks?
Inventor and pioneer

5000+ of the data lakehouse Creator of
global employees
$1B+
in revenue
The Lakehouse Company
$3B Gartner-recognized Leader

in investment Database Management Systems
Data Science and Machine Learning Platforms

Databricks
Lakehouse Platform
Lakehouse Platform
Simple
Data Data Data Data Science Unify your data warehousing and AI
Warehousing Engineering Streaming and ML
use cases on a single platform
Unity Catalog
Fine-grained governance for data and AI Open
Built on open source and open standards
Delta Lake
Data reliability and performance
Multicloud
Cloud Data Lake
All structured and unstructured data One consistent data platform across
clouds

Databricks ecosystem

The lakehouse is for ALL data practitioners
Machine Learning Data Engineers Data Analysts Data Governance

Practitioners
Databricks Lakehouse
Data Engineering workloads on Databricks
• Simplifies data engineering

with a curated data lake
approach through Delta Lake
• Data orchestration through

Databricks Workflows
• Delta Live Tables manage your

full data pipelines
•
Data Analysts workloads on Databricks
• Great performance and

concurrency for BI and SQL
workloads on Delta Lake
• Native SQL interface for analysts
• Support for BI tools to directly

query your most recent data in
Delta Lake
ML & Data Science workloads on
Databricks
Machine Learning
• Model registry, reproducibility,
productionization with MLflow
• Leverages Delta Lake for reproducibility
• AutoML for citizen data scientists
Data Science
• Collaborative notebooks and
dashboards for interactive analysis
• Native support for Python, Java, R, Scala
• Delta Lake data natively supported

Lakehouse Governance with Unity Catalog
Govern and manage all data assets

• Warehouse, Tables, Columns
• Data Lake, Files
• Machine Learning Models
• Dashboards and Notebooks
Capabilities
• Data lineage
• Attribute-based access control
• Security policies
• Auditing
• Data sharing
Demo:
Exploring the
Workspace
Databricks Academy
2023

Demo
High-level steps
Overview of the UI
• Landing page
• Navigation
Workspace
• Creating and managing assets
• Search assets
• Repos
• Clone a repo
• Pull/push changes

Working with
Notebooks
Databricks Academy
2023

Learning objectives
• Describe Databricks Notebooks as the most common interface for

data engineers when working with Databricks.
• Recognize common use cases for data engineers when working with
Notebooks.
• Describe Databricks cluster.
• Describe the basic cloud-based compute structure of Databricks.

Compute Resources

Clusters
Overview
Clusters are made up of one Workloads Cluster
or more virtual machine (VM) Worker
instances Notebook
VM instance
Distributes workloads across

Driver Worker
workers Job
VM instance VM instance
• Driver coordinates activities
of executors DBSQL Worker
• Workers run tasks
VM instance
composing a Spark job

Clusters
Overview
Three main compute types: Workloads Cluster
Worker
• All-purpose clusters for
Notebook
interactive development VM instance
• Job clusters for automating

Driver Worker
workloads Job
• SQL Warehouses VM instance VM instance
(Serverless) instant DBSQL Worker

compute to run DBSQL
VM instance
queries and dashboards

Cluster Mode
Single node Standard (Multi Node)
Low-cost single-instance Default mode for workloads

cluster catering to single-node developed in any supported
machine learning workloads language (requires at least two
and lightweight exploratory VM instances)
analysis

Databricks Runtimes
DB Runtime; A set of core components that run on Databricks clusters
Standard Photon Machine Learning
Apache Spark and An optional add-on to Adds popular machine

many other optimize Spark learning libraries like
components and queries (e.g. SQL, TensorFlow, Keras,
updates to provide an DataFrame) PyTorch, and XGBoost.
optimized big data
analytics experiences

ML Runtime
Pre-built machine learning infrastructure
Databricks Machine Learning Runtime

• Optimized and pre-configured ML Frameworks
• Turnkey distributed ML
• Built-in AutoML
• GPU support out of the box
Built-in ML Frameworks and Built-in support for Built-in support for AutoML and Built-in support for
Model Explainability distributed Training Hyperparameter Tuning Hardware Accelerators
AutoML

Notebooks

Databricks Notebooks
Collaborative, reproducible, and enterprise ready
Reproducible
Multi-Language Automatically track version
Use Python, SQL, Scala, and R,
history, and use git version
all in one Notebook
control with Repos
Visualizations
Built-in visualizations and Collaborative
support for the most popular Real-time co-presence,
visualization libraries co-editing, and commenting
(e.g. matplotlib, ggplot)
Enterprise Ready
Adaptable Enterprise-grade access
Install standard libraries and
controls, identity management,
use local modules
and auditability

Ideal for exploratory data analysis
Native tools for visualizing and understanding data in ML workflow
Create interactive charts to Summarize a data set’s essential

visualize data in the Notebook with properties and statistics in a data
only two clicks profile with the push of a button

Right tool for quick development
Multi-language support, use standard libraries and custom modules
Mix and match languages based on Install Python libraries for a

use case and preferred workflow, notebook without affecting other
choosing from Python, SQL, Scala, users with %pip
and R Import local modules using
arbitrary file support when working
in Repos
Demo:
Working with
Notebooks
Databricks Academy
2023

Demo
High-level steps
Compute
• Configure and launch a cluster for ML
Notebooks
• UI Walkthrough
• Using multiple languages
• Working with Markdown
• Data visualization
• Table
• Graphs
• Data Profiler

Data Storage and

Management
Databricks Academy
2023

Learning objectives
• Describe that data is stored in cloud object storage locations and

accessed via Databricks.
• Explain the benefits of data storage in the data lakehouse architecture
across roles and Databricks services.
• Identify Delta Lake as the optimized storage layer that provides the
foundation for data storage for the data lakehouse.
• Describe Unity Catalog as a centralized governance solution in
Databricks.
• Explain the three-tier namespace and its levels.

Control
plane

Control
plane

Delta Lake
Open-source, default storage format on Databricks
• Delta Lake is an open-source project.

• It is the default format for the tables created
in Databricks.
• Delta Lake is the optimized storage layer that
provides the foundation for storing data and
tables in the Databricks Lakehouse Platform.
• Designed to improve data reliability, quality, and
performance in data lakes.

Delta Lake brings ACID to object storage
Atomicity means all transactions either succeed

or fail completely
Consistency guarantees relate to how a given
state of the data is observed by simultaneous
operations
Isolation refers to how simultaneous operations
A C I D
ATOMICITY CONSISTENCY ISOLATION DURABILITY
conflict with one another. The isolation

guarantees that Delta Lake provides do differ
from other systems
Durability means that committed changes are
permanent

Delta Lake features
Key features
• Unified batch and streaming

• Automatic schema validation
• Support upserts using the merge operation
• Update your table schema without rewriting data.
• Track row-level changes with Change Data Feed
• Querying previous versions of a table based on version number of
timestamp
• Performance optimization with ZORDER and OPTIMIZE
• Supports multiple programming languages like Python, Scala, and SQL.

Delta’s rich ecosystem of connectors
Cloud platforms API languages
Google Scala Ruby Python

DataProc
Azure
Synapse Rust
SQL engines ETL and streaming engines
AWS Redshift Power BI

Spectrum
AWS Athena

Data ingestion and transformation for ML
An example data ingestion workflow

Data and AI
Governance with
Unity Catalog

Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where to “How to
discover secure
Data lake
the datasets, these
Data analyst Permissions
models, on tables, rows and columns
assets?”
notebooks,
dashboards?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML models?”
“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards
Applications BI dashboards

Unity Catalog (UC)
Unified governance for data and AI
Unified visibility into data and AI

Single permission model for data and AI
AI-powered monitoring and observability
Open data sharing
Databricks Unity Catalog
Access Data
Discovery Lineage Monitoring Auditing
Controls Sharing
Tables Files Models Notebook Dashboards

s

Key Capabilities of UC
Governance model:
Unity Catalog
• Unified governance across clouds
• Centralized metadata and user Databricks Databricks
Workspace Workspace
management
• Centralized access controls
GRANT … ON … TO …
• Grant or revoke permission to data and REVOKE … ON … FROM …
AI assets using the UI or the API.

Catalogs, Databases (schemas), Tables,
Views, Storage credentials, External locations

The three level namespace of UC
How to use UC
(Unity) Unity Catalog

Metastore
…
(Unity)
Catalog
Metastore
Managed
External
Model
…
Table
table
Databricks
Catalog
…
Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace
Databricks View
View
Workspace
SELECT * FROM catalog1.database1.table1;

Demo:
Data Storage and

Management
Databricks Academy
2023

Demo
High-level steps
Data Storage and Management

• Ingest data and create a Delta table
• View and manage tables
• Performance optimization for Delta
• Manage permissions with Unity Catalog

Databricks for Machine Learning:
Introduction to
Databricks for
Machine Learning
Databricks Academy
2023

Learning objectives
• Describe MLflow as an open source platform for managing the

end-to-end machine learning lifecycle that’s built into Databricks.
• Describe MLflow Experiments as a tool for tracking model
development runs and comparing the resulting model parameters
and metrics.
• Describe the Model Registry as a centralized model store for
managing models’ full lifecycle, including versioning and annotating.
• [Extra] Describe AutoML and its features.

Databricks supports both coding and low-coding users
Low code ML with AutoML Multi-language Notebooks

UI based ML development with a glass box approach Co-edit Notebooks in Python, R, Scala, and SQL

Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Collaborative Multi-Language Notebooks
AutoML
Model Model Runtime and Batch

Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring
MLOps / Governance powered by
Open Multi-Cloud Data Lakehouse Foundation with

AutoML

AutoML
Rapid, simplified machine learning for everyone
Quick-start ML initiatives Select and input dataset (UI

Automated Data Prep
or API)
Accelerate your time to production,
Save weeks on ML projects
Auto-generated Automated Training and Automated Feature

notebooks Model Selection Engineering
Ensure best practices.

Customize baseline models with
your domain expertise Automated Explore Generated
Hyperparameter Tuning Artifacts and Notebooks
Wide range of problems
Solve classification, regression, and
forecasting problems from a variety
of ML libraries Monitor Deploy

Databricks AutoML
A glass-box solution empowering data teams without taking away control
UI and API to start Auto-created

AutoML training MLflow Experiment Easily deploy to
Model Registry
to track models and
metrics
Auto-generated Understand and

debug data
Data Exploration quality and
notebook preprocessing
Auto-generated Iterate further on

models from
notebooks with AutoML, adding
source code your expertise

AutoML solves two key pain points
Quickly Verify the Predictive Power of Get a Baseline Model to Guide Project
a Dataset Direction
Data
Marketing Data Science
Team Science Team
Team Dataset Baseline
Dataset Model
“Can this dataset be used to “What direction should I go in for

predict customer churn?” this ML project and what
benchmark should
I aim to beat?”

MLflow

Core Machine Learning Issues
Modern ML lifecycle comes with many challenges
• Keeping track of experiments or model development

• Reproducing code
• Comparing models
• Standardization of packaging and deploying models
MLflow addresses these issues.

MLflow
What is mlflow?
• Open-source platform for

machine learning lifecycle
• Operationalizing machine learning
• Developed by Databricks
• Pre-installed on the Databricks
Runtime for ML
Model
Registry

MLflow Components
The four components of MLflow
Tracking Projects Models Model Registry

Record and Packaging General model Centralized and
query format format that collaborative
experiments: for reproducible supports diverse model lifecycle
code, data, runs on any deployment tools management
config, results platform
APIs: CLI, Python, R, Java, REST

Model Tracking and Auto-logging using MLFlow
Ensure reproducibility
Inspect, Visualize and Compare

Metrics
mlflow.autolog()
Track ML development with one Model, environment,
line of code: parameters, metrics, and artifacts
data lineage, model, and
environment.
Auto-generated Data
Exploration Notebook

MLflow Model Registry
Features and Architecture
Tracking Server
• Collaborative, centralized model hub

Parameters Metrics Artifacts Models
• Allows Versioning of ML artifacts
• Facilitate experimentation, testing, and
production
Model Registry
• Integrate with approval and governance Data Deployment Engineers
workflows Scientists
Staging Productio Archived

• Audit log of stage transitions and requests, n
approval workflow for stage transitions

v1
v2
• Helps in automation through CI/CD

integration
v3

Demo:
Experimentation
with AutoML
Databricks Academy
2023

Demo
High-level steps
Create an Experiment
• Create and run an AutoML experiment
• View the best model
Model Registry
• Register the best model to Model Registry
• Manage model stages

Databricks for Machine Learning:
End-to-End ML
on the Lakehouse
Databricks Academy
2023

Learning objectives
• Compare and contrast model governance solutions with and without

Unity Catalog.
• Describe the Databricks Feature Store as a centralized repository that
enables data scientists to find and share features.
• Describe Workflows as a capability to productionize data workflows.
• Describe Jobs as a simple solution to schedule and automate one or
more tasks.
• Describe Databricks’ built-in model serving capabilities with real-time
inference, streaming, and batch

Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Collaborative Multi-Language Notebooks
AutoML
Model Model Runtime and Batch

Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring
MLOps / Governance powered by
Open Multi-Cloud Data Lakehouse Foundation with

MLOps - End 2 End workflow
Setup MLFlow Model
webhook Schedule Monthly
Slack notifications, Trigger Retrain Job
Testing Jobs Databricks Job
Data Prep & Build baseline Model Promote Best Run to Automated Model testing Run inferences
Featurization with AutoML Registry Schema, Demographic Load model
ETL + EDA, Feature MLflow autologging, Annotate model. Request accuracy, Docs & artifacts… Batch or
Engineering with Koalas Hyperopt +Spark transition to staging Approve/reject request realtime
Approved.
Move to
STAGING STAGING
Tracking
Feature Store STAGING Request
Model Registry Rejected
Webhook triggers test
...
Realtime
HTTP inference
Data Scientist ML Engineer Data Engineer

Feature Store

Feature Store
How feature stores help?
• Feature store provide a centralized repository for managing and serving

machine learning (ML) features.
• Feature stores provide auditing and logging capabilities to track who
accessed or modified features.
• Feature store helps to handle scaling requirements of feature storage,
retrieval, and serving, ensuring that ML pipelines can operate efficiently.
• Feature store allows reusing feature across projects, reducing
duplication.

Why would you need a feature store?
Basic Motivations
Discovery
Multiple Data Scientists are trying to solve similar modeling tasks and come up with different definitions
of the same features. How can I find the features?
Lineage
Model governance requires documentation of the features used to train a model, as well as the
upstream lineage of a feature to reliably use it. How is it computed, and who owns it?
Skew
When multiple teams manage feature computation and ML models in production, minor yet significant
skew in upstream data at the input of a feature pipeline can be very hard to detect and fix.
Online Serving
During exploration and model experimentation phases features are implemented in frameworks that do
not scale to production.

Databricks Feature Store
Feature Definitions Feature Tables Training Data Set Creation
● Define reusable, ● Represent features as tables
shareable featurization that can be queried from any
logic language
● SQL, ACLs, versions, and
performance optimizations
Feature 1 Feature 2 snapshot
Batch Scoring
load
save Customer Item
Features Features
publish Online Serving

Databricks
Model Serving
... ...
REST
©2023 Databricks Inc. — All rights reserved Endpoint
Model Deployment

Model Serving Modes
Serving models for batch, streaming, real-time and, edge inference
Batch • High latency

• Leverages databases or object storage
• Fast retrieval of stored predictions
Delta Lake /
Feature Store
Streaming • Stream processing
• Moderately fast scoring on new data
Model
Registry
Real Time • Low latency scoring
Model training • High availability
• Usually using REST (containers, K8s)
Embedded (Edge) • Special case deployments

• Limited connectivity with cloud services

Challenges with building Real-time ML Systems
Most ML models don’t get into production
ML infrastructure is hard Deploying real time models Operating production ML

needs disparate tools requires expert resources
Real-time ML systems Data teams use diverse tools Steep learning curve of
require fast and scalable to develop models deployment tools.
serving infrastructure, which
Customers use separate Model deployment is
is costly to build and
platforms for data, ML, and bottlenecked by limited
maintain
Serving, adding complexity engineering resources,
and cost limiting the ability to scale

Databricks Model Serving
• Multiple model scoring and

deployment choices
World class • Leading multi-cloud inference
model scoring provider giving the customer the
and deployment choice of what, where, and when
they will score their model
options
• Ultra low latency real-time model
serving

Model deployment with Model Serving
Flexible deployment at any scale
Batch scoring
One-click deployment of models
from the Model Registry to scalable
compute clusters for batch scoring
Online scoring
One-click deployment of models to
REST endpoints for auto-scaling low
latency scoring

Core Features of Model Serving
Support real-time production ML workloads
Real Time Lakehouse Unified Simplified Deployment

● Low overhead latency: <100ms ● Feature Store Integrated: ● Simple: Endpoints UI and API for
Automated online lookups simple deployment
● Throughput: 3K+ QPS
● MLflow Integrated: Fast & easy ● Flexible: Traffic splitting for staged
● Availability: 99.5% model deployment roll-out and A/B testing
● Scalable: Automatically scales ● Quality & Diagnostics: Payload
● Manageable: Endpoints
up/down to handle bursty traffic logging to Delta
observability with built-in-metrics
● Secure: PrivateLink and ● Unified governance: Manage data and export options
IP-allowlist & AI with UCt

Orchestration with
Workflows

Workflows is a fully-managed
cloud-based general-purpose task
orchestration service for the entire Lakehouse Platform
Lakehouse. Data Data Data Data Science
Warehousing Engineering Streaming and ML
Unity Catalog
Workflows is a service for data Fine-grained governance for data and AI
engineers, data scientists, and analysts Delta Lake

Data reliability and performance
to build reliable data, analytics and AI Cloud Data Lake

All structured and unstructured data
workflows on any cloud.

Workflows Features
Orchestrate Anything Fully Managed Simple Workflow

Anywhere Authoring
Run diverse workloads for the full Remove operational overhead An easy point-and-click
data and AI lifecycle, on any with a fully managed authoring experience for all your
cloud. Orchestrate; orchestration service enabling data teams not just those with
you to focus on your workflows specialized skills
• Notebooks
not on managing your
• Delta Live Tables
infrastructure
• Jobs for SQL
• ML models, and more

Workflow features
Key features
Databricks Workflow offers:

• Monitoring and debugging
• Repair only failed tasks and sub-tasks Tasks
• Reduces the time and resources required to

recover from unsuccessful job runs
• Access Control
• Manage access across different teams
• Scheduling
• Run jobs immediately or periodically
• Alerts
Job run
Example Workflow
Data ingestion funnel

E.g. Auto Loader, DLT
Data filtering, quality assurance, transformation

E.g. DLT, SQL, Python
ML feature extraction
E.g. MLflow
Persisting features and training prediction

model

Demo:
End-to-End ML
on the Lakehouse
Databricks Academy
2023

Demo
High-level steps
End-to-end ML
• Create a feature store table
• Train and track a model with MLflow
• Register a model to Model Registry
• Transition model to next stage
• Use model for batch inference
• Automate inference with Workflows

Course Summary
and Next Steps
Databricks Academy
2023

Extra Resources

Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)
Feature Registry Feature Provider

● Discoverability and Reusability ● Batch and online access to Features
● Versioning ● Feature lookup packaged with Models
● Upstream and downstream Lineage ● Simplified deployment process
Co-designed with Co-designed with
● Open format ● Open model format that supports all ML

frameworks
● Built-in data versioning and governance
● Feature version and lookup logic hermetically
● Native access through PySpark, SQL, etc. logged with Model

Get Started With Databricks For Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Get Started With Databricks For Machine Learning

Uploaded by

Copyright:

Available Formats

Get Started with

©2023 Databricks Inc. — All rights reserved

Explain fundamental concepts about using the Databricks Lakehouse

4 Create and use a baseline model using AutoML.

5 Create and use a feature store table for model training.

6 Track, register, and manage the stage of a model with MLﬂow.

©2023 Databricks Inc. — All rights reserved

Prerequisites Technical Considerations

Intermediate level knowledge of Python

Basic knowledge of data science and

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

• Identify Databricks as the Lakehouse Platform

©2023 Databricks Inc. — All rights reserved

Inventor and pioneer

The Lakehouse Company

$3B Gartner-recognized Leader

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Machine Learning Data Engineers Data Analysts Data Governance

• Simpliﬁes data engineering

• Data orchestration through

• Delta Live Tables manage your

• Great performance and

• Native SQL interface for analysts

• Support for BI tools to directly

©2023 Databricks Inc. — All rights reserved

Govern and manage all data assets

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

• Describe Databricks Notebooks as the most common interface for

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Clusters are made up of one Workloads Cluster

or more virtual machine (VM) Worker

Distributes workloads across

©2023 Databricks Inc. — All rights reserved

Three main compute types: Workloads Cluster

• Job clusters for automating

• SQL Warehouses VM instance VM instance

(Serverless) instant DBSQL Worker

©2023 Databricks Inc. — All rights reserved

Single node Standard (Multi Node)

Low-cost single-instance Default mode for workloads

©2023 Databricks Inc. — All rights reserved

Standard Photon Machine Learning

Apache Spark and An optional add-on to Adds popular machine

©2023 Databricks Inc. — All rights reserved

Databricks Machine Learning Runtime

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Create interactive charts to Summarize a data set’s essential

©2023 Databricks Inc. — All rights reserved

Mix and match languages based on Install Python libraries for a

©2023 Databricks Inc. — All rights reserved

©2023 Databricks Inc. — All rights reserved

Data Storage and

©2023 Databricks Inc. — All rights reserved

• Describe that data is stored in cloud object storage locations and

©2023 Databricks Inc. — All rights reserved