Professional Documents
Culture Documents
Get Started With Databricks For Machine Learning
Get Started With Databricks For Machine Learning
Databricks for
Machine Learning
Databricks Academy
2023
3 Store and manage data in the Lakehouse for machine learning tasks.
Databricks
Fundamentals
Databricks Academy
2023
$1B+
in revenue
Simple
Data Data Data Data Science Unify your data warehousing and AI
Warehousing Engineering Streaming and ML
use cases on a single platform
Unity Catalog
Fine-grained governance for data and AI Open
Built on open source and open standards
Delta Lake
Data reliability and performance
Multicloud
Cloud Data Lake
All structured and unstructured data One consistent data platform across
clouds
Databricks Lakehouse
©2023 Databricks Inc. — All rights reserved
Data Engineering workloads on Databricks
Data Science
• Collaborative notebooks and
dashboards for interactive analysis
• Native support for Python, Java, R, Scala
• Delta Lake data natively supported
Capabilities
• Data lineage
• Attribute-based access control
• Security policies
• Auditing
• Data sharing
©2023 Databricks Inc. — All rights reserved
Demo:
Exploring the
Workspace
Databricks Academy
2023
Overview of the UI
• Landing page
• Navigation
Workspace
• Creating and managing assets
• Search assets
• Repos
• Clone a repo
• Pull/push changes
Working with
Notebooks
Databricks Academy
2023
instances Notebook
VM instance
Worker
• All-purpose clusters for
Notebook
interactive development VM instance
Built-in ML Frameworks and Built-in support for Built-in support for AutoML and Built-in support for
Model Explainability distributed Training Hyperparameter Tuning Hardware Accelerators
AutoML
Reproducible
Multi-Language Automatically track version
Use Python, SQL, Scala, and R,
history, and use git version
all in one Notebook
control with Repos
Visualizations
Built-in visualizations and Collaborative
support for the most popular Real-time co-presence,
visualization libraries co-editing, and commenting
(e.g. matplotlib, ggplot)
Enterprise Ready
Adaptable Enterprise-grade access
Install standard libraries and
controls, identity management,
use local modules
and auditability
Working with
Notebooks
Databricks Academy
2023
Compute
• Configure and launch a cluster for ML
Notebooks
• UI Walkthrough
• Using multiple languages
• Working with Markdown
• Data visualization
• Table
• Graphs
• Data Profiler
Databricks Academy
2023
Azure
Synapse Rust
AWS Athena
“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards
Applications BI dashboards
Access Data
Discovery Lineage Monitoring Auditing
Controls Sharing
Governance model:
Unity Catalog
• Unified governance across clouds
• Centralized metadata and user Databricks Databricks
Workspace Workspace
management
• Centralized access controls
GRANT … ON … TO …
• Grant or revoke permission to data and REVOKE … ON … FROM …
…
(Unity)
Catalog
Metastore
Managed
External
Model
…
Table
table
Databricks
Catalog
…
Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace
Databricks View
View
Workspace
Databricks Academy
2023
Introduction to
Databricks for
Machine Learning
Databricks Academy
2023
AutoML
Quickly Verify the Predictive Power of Get a Baseline Model to Guide Project
a Dataset Direction
Data
Marketing Data Science
Team Science Team
Team Dataset Baseline
Dataset Model
Model
Registry
mlflow.autolog()
Track ML development with one Model, environment,
line of code: parameters, metrics, and artifacts
data lineage, model, and
environment.
Auto-generated Data
Exploration Notebook
v2
Experimentation
with AutoML
Databricks Academy
2023
Create an Experiment
• Create and run an AutoML experiment
• View the best model
Model Registry
• Register the best model to Model Registry
• Manage model stages
End-to-End ML
on the Lakehouse
Databricks Academy
2023
AutoML
Data Prep & Build baseline Model Promote Best Run to Automated Model testing Run inferences
Featurization with AutoML Registry Schema, Demographic Load model
ETL + EDA, Feature MLflow autologging, Annotate model. Request accuracy, Docs & artifacts… Batch or
Engineering with Koalas Hyperopt +Spark transition to staging Approve/reject request realtime
Approved.
Move to
STAGING STAGING
Tracking
Feature Store STAGING Request
Model Registry Rejected
Webhook triggers test
...
Realtime
HTTP inference
Discovery
Multiple Data Scientists are trying to solve similar modeling tasks and come up with different definitions
of the same features. How can I find the features?
Lineage
Model governance requires documentation of the features used to train a model, as well as the
upstream lineage of a feature to reliably use it. How is it computed, and who owns it?
Skew
When multiple teams manage feature computation and ML models in production, minor yet significant
skew in upstream data at the input of a feature pipeline can be very hard to detect and fix.
Online Serving
During exploration and model experimentation phases features are implemented in frameworks that do
not scale to production.
Batch Scoring
load
save Customer Item
Features Features
REST
©2023 Databricks Inc. — All rights reserved Endpoint
Model Deployment
Delta Lake /
Feature Store
Streaming • Stream processing
• Moderately fast scoring on new data
Model
Registry
Real Time • Low latency scoring
Model training • High availability
• Usually using REST (containers, K8s)
Real-time ML systems Data teams use diverse tools Steep learning curve of
require fast and scalable to develop models deployment tools.
serving infrastructure, which
Customers use separate Model deployment is
is costly to build and
platforms for data, ML, and bottlenecked by limited
maintain
Serving, adding complexity engineering resources,
and cost limiting the ability to scale
Batch scoring
One-click deployment of models
from the Model Registry to scalable
compute clusters for batch scoring
Online scoring
One-click deployment of models to
REST endpoints for auto-scaling low
latency scoring
Workflows is a fully-managed
cloud-based general-purpose task
orchestration service for the entire Lakehouse Platform
Lakehouse. Data Data Data Data Science
Warehousing Engineering Streaming and ML
Unity Catalog
Workflows is a service for data Fine-grained governance for data and AI
ML feature extraction
E.g. MLflow
End-to-End ML
on the Lakehouse
Databricks Academy
2023
End-to-end ML
• Create a feature store table
• Train and track a model with MLflow
• Register a model to Model Registry
• Transition model to next stage
• Use model for batch inference
• Automate inference with Workflows
Databricks Academy
2023
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)