Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Get Started with

Databricks for
Machine Learning

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning goals
Upon completion of this content, you should be able to:

Explain fundamental concepts about using the Databricks Lakehouse


1
Platform for machine learning.
2 Perform basic notebook tasks using the Databricks Lakehouse Platform.

3 Store and manage data in the Lakehouse for machine learning tasks.

4 Create and use a baseline model using AutoML.

5 Create and use a feature store table for model training.

6 Track, register, and manage the stage of a model with MLflow.

©2023 Databricks Inc. — All rights reserved


Prerequisites/Technical Considerations
Things to keep in mind before you work through this course

Prerequisites Technical Considerations

Intermediate level knowledge of Python


1 1 A cluster running on DBR ML 13.3+

Basic knowledge of data science and


2 machine learning topics such as Unity Catalog enabled workspace
2
regression/classification models, model
evaluation metrics.
Basic knowledge of a machine learning
3 3 Model Serving enabled workspace
library such as scikit-learn.

©2023 Databricks Inc. — All rights reserved


Databricks Lakehouse Fundamentals:

Databricks
Fundamentals

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning objectives
Things you’ll be able to do after completing this lesson

• Identify Databricks as the Lakehouse Platform


• Describe core services of the Databricks Lakehouse Platform for
different personas.
• Identify different types of assets in the Databricks Workspace.
• Navigate throughout different sections of the Workspace.
• Perform common actions available in the Workspace.
• Describe Databricks Repos and its features.

©2023 Databricks Inc. — All rights reserved


Databricks Overview
What is Databricks?

Inventor and pioneer


5000+ of the data lakehouse Creator of
global employees

$1B+
in revenue

The Lakehouse Company

$3B Gartner-recognized Leader


in investment Database Management Systems
Data Science and Machine Learning Platforms

©2023 Databricks Inc. — All rights reserved


Databricks
Lakehouse Platform
Lakehouse Platform

Simple
Data Data Data Data Science Unify your data warehousing and AI
Warehousing Engineering Streaming and ML
use cases on a single platform

Unity Catalog
Fine-grained governance for data and AI Open
Built on open source and open standards
Delta Lake
Data reliability and performance

Multicloud
Cloud Data Lake
All structured and unstructured data One consistent data platform across
clouds

©2023 Databricks Inc. — All rights reserved


Databricks ecosystem

©2023 Databricks Inc. — All rights reserved


The lakehouse is for ALL data practitioners

Machine Learning Data Engineers Data Analysts Data Governance


Practitioners

Databricks Lakehouse
©2023 Databricks Inc. — All rights reserved
Data Engineering workloads on Databricks

• Simplifies data engineering


with a curated data lake
approach through Delta Lake

• Data orchestration through


Databricks Workflows

• Delta Live Tables manage your


full data pipelines

©2023 Databricks Inc. — All rights reserved
Data Analysts workloads on Databricks

• Great performance and


concurrency for BI and SQL
workloads on Delta Lake

• Native SQL interface for analysts

• Support for BI tools to directly


query your most recent data in
Delta Lake
©2023 Databricks Inc. — All rights reserved
ML & Data Science workloads on
Databricks
Machine Learning
• Model registry, reproducibility,
productionization with MLflow
• Leverages Delta Lake for reproducibility
• AutoML for citizen data scientists

Data Science
• Collaborative notebooks and
dashboards for interactive analysis
• Native support for Python, Java, R, Scala
• Delta Lake data natively supported

©2023 Databricks Inc. — All rights reserved


Lakehouse Governance with Unity Catalog

Govern and manage all data assets


• Warehouse, Tables, Columns
• Data Lake, Files
• Machine Learning Models
• Dashboards and Notebooks

Capabilities
• Data lineage
• Attribute-based access control
• Security policies
• Auditing
• Data sharing
©2023 Databricks Inc. — All rights reserved
Demo:

Exploring the
Workspace

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Demo
High-level steps

Overview of the UI
• Landing page
• Navigation
Workspace
• Creating and managing assets
• Search assets
• Repos
• Clone a repo
• Pull/push changes

©2023 Databricks Inc. — All rights reserved


Databricks Lakehouse Fundamentals:

Working with
Notebooks

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning objectives
Things you’ll be able to do after completing this lesson

• Describe Databricks Notebooks as the most common interface for


data engineers when working with Databricks.
• Recognize common use cases for data engineers when working with
Notebooks.
• Describe Databricks cluster.
• Describe the basic cloud-based compute structure of Databricks.

©2023 Databricks Inc. — All rights reserved


Compute Resources

©2023 Databricks Inc. — All rights reserved


Clusters
Overview

Clusters are made up of one Workloads Cluster

or more virtual machine (VM) Worker

instances Notebook
VM instance

Distributes workloads across


Driver Worker
workers Job
VM instance VM instance
• Driver coordinates activities
of executors DBSQL Worker
• Workers run tasks
VM instance
composing a Spark job

©2023 Databricks Inc. — All rights reserved


Clusters
Overview

Three main compute types: Workloads Cluster

Worker
• All-purpose clusters for
Notebook
interactive development VM instance

• Job clusters for automating


Driver Worker
workloads Job

• SQL Warehouses VM instance VM instance

(Serverless) instant DBSQL Worker


compute to run DBSQL
VM instance
queries and dashboards

©2023 Databricks Inc. — All rights reserved


Cluster Mode

Single node Standard (Multi Node)

Low-cost single-instance Default mode for workloads


cluster catering to single-node developed in any supported
machine learning workloads language (requires at least two
and lightweight exploratory VM instances)
analysis

©2023 Databricks Inc. — All rights reserved


Databricks Runtimes
DB Runtime; A set of core components that run on Databricks clusters

Standard Photon Machine Learning

Apache Spark and An optional add-on to Adds popular machine


many other optimize Spark learning libraries like
components and queries (e.g. SQL, TensorFlow, Keras,
updates to provide an DataFrame) PyTorch, and XGBoost.
optimized big data
analytics experiences

©2023 Databricks Inc. — All rights reserved


ML Runtime
Pre-built machine learning infrastructure

Databricks Machine Learning Runtime


• Optimized and pre-configured ML Frameworks
• Turnkey distributed ML
• Built-in AutoML
• GPU support out of the box

Built-in ML Frameworks and Built-in support for Built-in support for AutoML and Built-in support for
Model Explainability distributed Training Hyperparameter Tuning Hardware Accelerators

AutoML

©2023 Databricks Inc. — All rights reserved


Notebooks

©2023 Databricks Inc. — All rights reserved


Databricks Notebooks
Collaborative, reproducible, and enterprise ready

Reproducible
Multi-Language Automatically track version
Use Python, SQL, Scala, and R,
history, and use git version
all in one Notebook
control with Repos

Visualizations
Built-in visualizations and Collaborative
support for the most popular Real-time co-presence,
visualization libraries co-editing, and commenting
(e.g. matplotlib, ggplot)

Enterprise Ready
Adaptable Enterprise-grade access
Install standard libraries and
controls, identity management,
use local modules
and auditability

©2023 Databricks Inc. — All rights reserved


Ideal for exploratory data analysis
Native tools for visualizing and understanding data in ML workflow

Create interactive charts to Summarize a data set’s essential


visualize data in the Notebook with properties and statistics in a data
only two clicks profile with the push of a button

©2023 Databricks Inc. — All rights reserved


Right tool for quick development
Multi-language support, use standard libraries and custom modules

Mix and match languages based on Install Python libraries for a


use case and preferred workflow, notebook without affecting other
choosing from Python, SQL, Scala, users with %pip
and R Import local modules using
arbitrary file support when working
in Repos
©2023 Databricks Inc. — All rights reserved
Demo:

Working with
Notebooks

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Demo
High-level steps

Compute
• Configure and launch a cluster for ML
Notebooks
• UI Walkthrough
• Using multiple languages
• Working with Markdown
• Data visualization
• Table
• Graphs
• Data Profiler

©2023 Databricks Inc. — All rights reserved


Databricks Lakehouse Fundamentals:

Data Storage and


Management

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning objectives
Things you’ll be able to do after completing this lesson

• Describe that data is stored in cloud object storage locations and


accessed via Databricks.
• Explain the benefits of data storage in the data lakehouse architecture
across roles and Databricks services.
• Identify Delta Lake as the optimized storage layer that provides the
foundation for data storage for the data lakehouse.
• Describe Unity Catalog as a centralized governance solution in
Databricks.
• Explain the three-tier namespace and its levels.

©2023 Databricks Inc. — All rights reserved


Control
plane

©2023 Databricks Inc. — All rights reserved


Control
plane

©2023 Databricks Inc. — All rights reserved


Delta Lake
Open-source, default storage format on Databricks

• Delta Lake is an open-source project.


• It is the default format for the tables created
in Databricks.
• Delta Lake is the optimized storage layer that
provides the foundation for storing data and
tables in the Databricks Lakehouse Platform.
• Designed to improve data reliability, quality, and
performance in data lakes.

©2023 Databricks Inc. — All rights reserved


Delta Lake brings ACID to object storage

Atomicity means all transactions either succeed


or fail completely
Consistency guarantees relate to how a given
state of the data is observed by simultaneous
operations
Isolation refers to how simultaneous operations
A C I D
ATOMICITY CONSISTENCY ISOLATION DURABILITY

conflict with one another. The isolation


guarantees that Delta Lake provides do differ
from other systems
Durability means that committed changes are
permanent

©2023 Databricks Inc. — All rights reserved


Delta Lake features
Key features

• Unified batch and streaming


• Automatic schema validation
• Support upserts using the merge operation
• Update your table schema without rewriting data.
• Track row-level changes with Change Data Feed
• Querying previous versions of a table based on version number of
timestamp
• Performance optimization with ZORDER and OPTIMIZE
• Supports multiple programming languages like Python, Scala, and SQL.

©2023 Databricks Inc. — All rights reserved


Delta’s rich ecosystem of connectors

Cloud platforms API languages

Google Scala Ruby Python


DataProc

Azure
Synapse Rust

SQL engines ETL and streaming engines

AWS Redshift Power BI


Spectrum

AWS Athena

©2023 Databricks Inc. — All rights reserved


Data ingestion and transformation for ML
An example data ingestion workflow

©2023 Databricks Inc. — All rights reserved


Data and AI
Governance with
Unity Catalog

©2023 Databricks Inc. — All rights reserved


Today, data and AI governance is complex
Data Consumers Data Governance Team
Permissions on files
“Where to “How to
discover secure
Data lake
the datasets, these
Data analyst Permissions
models, on tables, rows and columns
assets?”
notebooks,
dashboards?”
“Who is
Data warehouse accessing
“Can I trust Data engineer these assets
Permissions on ML
the data and and how?”
models, features
ML models?”

“Are we
ML Models meeting the
ML engineer regulatory
Permissions on reports, compliance?”
dashboards

Applications BI dashboards

©2023 Databricks Inc. — All rights reserved


Unity Catalog (UC)
Unified governance for data and AI

Unified visibility into data and AI


Single permission model for data and AI
AI-powered monitoring and observability
Open data sharing

Databricks Unity Catalog

Access Data
Discovery Lineage Monitoring Auditing
Controls Sharing

Tables Files Models Notebook Dashboards


s

©2023 Databricks Inc. — All rights reserved


Key Capabilities of UC

Governance model:
Unity Catalog
• Unified governance across clouds
• Centralized metadata and user Databricks Databricks
Workspace Workspace
management
• Centralized access controls
GRANT … ON … TO …
• Grant or revoke permission to data and REVOKE … ON … FROM …

AI assets using the UI or the API.


Catalogs, Databases (schemas), Tables,
Views, Storage credentials, External locations

©2023 Databricks Inc. — All rights reserved


The three level namespace of UC
How to use UC

(Unity) Unity Catalog


Metastore


(Unity)
Catalog
Metastore
Managed
External
Model


Table
table
Databricks
Catalog


Account assigned to
Schema External
Managed
Databricks (Database) Table
Table
Workspace

Databricks View
View
Workspace

SELECT * FROM catalog1.database1.table1;

©2023 Databricks Inc. — All rights reserved


Demo:

Data Storage and


Management

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Demo
High-level steps

Data Storage and Management


• Ingest data and create a Delta table
• View and manage tables
• Performance optimization for Delta
• Manage permissions with Unity Catalog

©2023 Databricks Inc. — All rights reserved


Databricks for Machine Learning:

Introduction to
Databricks for
Machine Learning
Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning objectives
Things you’ll be able to do after completing this lesson

• Describe MLflow as an open source platform for managing the


end-to-end machine learning lifecycle that’s built into Databricks.
• Describe MLflow Experiments as a tool for tracking model
development runs and comparing the resulting model parameters
and metrics.
• Describe the Model Registry as a centralized model store for
managing models’ full lifecycle, including versioning and annotating.
• [Extra] Describe AutoML and its features.

©2023 Databricks Inc. — All rights reserved


Databricks supports both coding and low-coding users

Low code ML with AutoML Multi-language Notebooks


UI based ML development with a glass box approach Co-edit Notebooks in Python, R, Scala, and SQL

©2023 Databricks Inc. — All rights reserved


Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle

Collaborative Multi-Language Notebooks

AutoML

Model Model Runtime and Batch


Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring

MLOps / Governance powered by

Open Multi-Cloud Data Lakehouse Foundation with


AutoML

©2023 Databricks Inc. — All rights reserved


AutoML
Rapid, simplified machine learning for everyone

Quick-start ML initiatives Select and input dataset (UI


Automated Data Prep
or API)
Accelerate your time to production,
Save weeks on ML projects

Auto-generated Automated Training and Automated Feature


notebooks Model Selection Engineering

Ensure best practices.


Customize baseline models with
your domain expertise Automated Explore Generated
Hyperparameter Tuning Artifacts and Notebooks
Wide range of problems
Solve classification, regression, and
forecasting problems from a variety
of ML libraries Monitor Deploy

©2023 Databricks Inc. — All rights reserved


Databricks AutoML
A glass-box solution empowering data teams without taking away control

UI and API to start Auto-created


AutoML training MLflow Experiment Easily deploy to
Model Registry
to track models and
metrics

Auto-generated Understand and


debug data
Data Exploration quality and
notebook preprocessing

Auto-generated Iterate further on


models from
notebooks with AutoML, adding
source code your expertise

©2023 Databricks Inc. — All rights reserved


AutoML solves two key pain points

Quickly Verify the Predictive Power of Get a Baseline Model to Guide Project
a Dataset Direction

Data
Marketing Data Science
Team Science Team
Team Dataset Baseline
Dataset Model

“Can this dataset be used to “What direction should I go in for


predict customer churn?” this ML project and what
benchmark should
I aim to beat?”

©2023 Databricks Inc. — All rights reserved


MLflow

©2023 Databricks Inc. — All rights reserved


Core Machine Learning Issues
Modern ML lifecycle comes with many challenges

• Keeping track of experiments or model development


• Reproducing code
• Comparing models
• Standardization of packaging and deploying models

MLflow addresses these issues.

©2023 Databricks Inc. — All rights reserved


MLflow
What is mlflow?

• Open-source platform for


machine learning lifecycle
• Operationalizing machine learning
• Developed by Databricks
• Pre-installed on the Databricks
Runtime for ML

Model
Registry

©2023 Databricks Inc. — All rights reserved


MLflow Components
The four components of MLflow

Tracking Projects Models Model Registry


Record and Packaging General model Centralized and
query format format that collaborative
experiments: for reproducible supports diverse model lifecycle
code, data, runs on any deployment tools management
config, results platform

APIs: CLI, Python, R, Java, REST


©2023 Databricks Inc. — All rights reserved
Model Tracking and Auto-logging using MLFlow
Ensure reproducibility

Inspect, Visualize and Compare


Metrics

mlflow.autolog()
Track ML development with one Model, environment,
line of code: parameters, metrics, and artifacts
data lineage, model, and
environment.

Auto-generated Data
Exploration Notebook

©2023 Databricks Inc. — All rights reserved


MLflow Model Registry
Features and Architecture
Tracking Server

• Collaborative, centralized model hub


Parameters Metrics Artifacts Models
• Allows Versioning of ML artifacts
• Facilitate experimentation, testing, and
production
Model Registry
• Integrate with approval and governance Data Deployment Engineers
workflows Scientists

Staging Productio Archived


• Audit log of stage transitions and requests, n

approval workflow for stage transitions


v1

v2

• Helps in automation through CI/CD


integration
v3

©2023 Databricks Inc. — All rights reserved


Demo:

Experimentation
with AutoML

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Demo
High-level steps

Create an Experiment
• Create and run an AutoML experiment
• View the best model
Model Registry
• Register the best model to Model Registry
• Manage model stages

©2023 Databricks Inc. — All rights reserved


Databricks for Machine Learning:

End-to-End ML
on the Lakehouse

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Learning objectives
Things you’ll be able to do after completing this lesson

• Compare and contrast model governance solutions with and without


Unity Catalog.
• Describe the Databricks Feature Store as a centralized repository that
enables data scientists to find and share features.
• Describe Workflows as a capability to productionize data workflows.
• Describe Jobs as a simple solution to schedule and automate one or
more tasks.
• Describe Databricks’ built-in model serving capabilities with real-time
inference, streaming, and batch

©2023 Databricks Inc. — All rights reserved


Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle

Collaborative Multi-Language Notebooks

AutoML

Model Model Runtime and Batch


Data Scoring
Training Tuning Environments
Prep
Online Serving
Data Feature Jobs and API Automation
Versioning Store Monitoring

MLOps / Governance powered by

Open Multi-Cloud Data Lakehouse Foundation with


MLOps - End 2 End workflow
Setup MLFlow Model
webhook Schedule Monthly
Slack notifications, Trigger Retrain Job
Testing Jobs Databricks Job

Data Prep & Build baseline Model Promote Best Run to Automated Model testing Run inferences
Featurization with AutoML Registry Schema, Demographic Load model
ETL + EDA, Feature MLflow autologging, Annotate model. Request accuracy, Docs & artifacts… Batch or
Engineering with Koalas Hyperopt +Spark transition to staging Approve/reject request realtime

Approved.
Move to
STAGING STAGING
Tracking
Feature Store STAGING Request
Model Registry Rejected
Webhook triggers test

...
Realtime
HTTP inference

Data Scientist ML Engineer Data Engineer

©2023 Databricks Inc. — All rights reserved


Feature Store

©2023 Databricks Inc. — All rights reserved


Feature Store
How feature stores help?

• Feature store provide a centralized repository for managing and serving


machine learning (ML) features.
• Feature stores provide auditing and logging capabilities to track who
accessed or modified features.
• Feature store helps to handle scaling requirements of feature storage,
retrieval, and serving, ensuring that ML pipelines can operate efficiently.
• Feature store allows reusing feature across projects, reducing
duplication.

©2023 Databricks Inc. — All rights reserved


Why would you need a feature store?
Basic Motivations

Discovery

Multiple Data Scientists are trying to solve similar modeling tasks and come up with different definitions
of the same features. How can I find the features?

Lineage

Model governance requires documentation of the features used to train a model, as well as the
upstream lineage of a feature to reliably use it. How is it computed, and who owns it?

Skew

When multiple teams manage feature computation and ML models in production, minor yet significant
skew in upstream data at the input of a feature pipeline can be very hard to detect and fix.

Online Serving

During exploration and model experimentation phases features are implemented in frameworks that do
not scale to production.

©2023 Databricks Inc. — All rights reserved


Databricks Feature Store
Feature Definitions Feature Tables Training Data Set Creation
● Define reusable, ● Represent features as tables
shareable featurization that can be queried from any
logic language
● SQL, ACLs, versions, and
performance optimizations
Feature 1 Feature 2 snapshot

Batch Scoring

load
save Customer Item
Features Features

publish Online Serving


Databricks
Model Serving
... ...

REST
©2023 Databricks Inc. — All rights reserved Endpoint
Model Deployment

©2023 Databricks Inc. — All rights reserved


Model Serving Modes
Serving models for batch, streaming, real-time and, edge inference

Batch • High latency


• Leverages databases or object storage
• Fast retrieval of stored predictions

Delta Lake /
Feature Store
Streaming • Stream processing
• Moderately fast scoring on new data
Model
Registry
Real Time • Low latency scoring
Model training • High availability
• Usually using REST (containers, K8s)

Embedded (Edge) • Special case deployments


• Limited connectivity with cloud services

©2023 Databricks Inc. — All rights reserved


Challenges with building Real-time ML Systems
Most ML models don’t get into production

ML infrastructure is hard Deploying real time models Operating production ML


needs disparate tools requires expert resources

Real-time ML systems Data teams use diverse tools Steep learning curve of
require fast and scalable to develop models deployment tools.
serving infrastructure, which
Customers use separate Model deployment is
is costly to build and
platforms for data, ML, and bottlenecked by limited
maintain
Serving, adding complexity engineering resources,
and cost limiting the ability to scale

©2023 Databricks Inc. — All rights reserved


Databricks Model Serving

• Multiple model scoring and


deployment choices
World class • Leading multi-cloud inference
model scoring provider giving the customer the
and deployment choice of what, where, and when
they will score their model
options
• Ultra low latency real-time model
serving

©2023 Databricks Inc. — All rights reserved


Model deployment with Model Serving
Flexible deployment at any scale

Batch scoring
One-click deployment of models
from the Model Registry to scalable
compute clusters for batch scoring

Online scoring
One-click deployment of models to
REST endpoints for auto-scaling low
latency scoring

©2023 Databricks Inc. — All rights reserved


Core Features of Model Serving
Support real-time production ML workloads

Real Time Lakehouse Unified Simplified Deployment


● Low overhead latency: <100ms ● Feature Store Integrated: ● Simple: Endpoints UI and API for
Automated online lookups simple deployment
● Throughput: 3K+ QPS
● MLflow Integrated: Fast & easy ● Flexible: Traffic splitting for staged
● Availability: 99.5% model deployment roll-out and A/B testing
● Scalable: Automatically scales ● Quality & Diagnostics: Payload
● Manageable: Endpoints
up/down to handle bursty traffic logging to Delta
observability with built-in-metrics
● Secure: PrivateLink and ● Unified governance: Manage data and export options
IP-allowlist & AI with UCt

©2023 Databricks Inc. — All rights reserved


Orchestration with
Workflows

©2023 Databricks Inc. — All rights reserved


Databricks Workflows
Databricks Workflows

Workflows is a fully-managed
cloud-based general-purpose task
orchestration service for the entire Lakehouse Platform
Lakehouse. Data Data Data Data Science
Warehousing Engineering Streaming and ML

Unity Catalog
Workflows is a service for data Fine-grained governance for data and AI

engineers, data scientists, and analysts Delta Lake


Data reliability and performance

to build reliable data, analytics and AI Cloud Data Lake


All structured and unstructured data
workflows on any cloud.

©2023 Databricks Inc. — All rights reserved


Workflows Features

Orchestrate Anything Fully Managed Simple Workflow


Anywhere Authoring
Run diverse workloads for the full Remove operational overhead An easy point-and-click
data and AI lifecycle, on any with a fully managed authoring experience for all your
cloud. Orchestrate; orchestration service enabling data teams not just those with
you to focus on your workflows specialized skills
• Notebooks
not on managing your
• Delta Live Tables
infrastructure
• Jobs for SQL
• ML models, and more

©2023 Databricks Inc. — All rights reserved


Workflow features
Key features

Databricks Workflow offers:


• Monitoring and debugging
• Repair only failed tasks and sub-tasks Tasks

• Reduces the time and resources required to


recover from unsuccessful job runs
• Access Control
• Manage access across different teams
• Scheduling
• Run jobs immediately or periodically
• Alerts
Job run
©2023 Databricks Inc. — All rights reserved
Example Workflow

Data ingestion funnel


E.g. Auto Loader, DLT

Data filtering, quality assurance, transformation


E.g. DLT, SQL, Python

ML feature extraction
E.g. MLflow

Persisting features and training prediction


model

©2023 Databricks Inc. — All rights reserved


Demo:

End-to-End ML
on the Lakehouse

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Demo
High-level steps

End-to-end ML
• Create a feature store table
• Train and track a model with MLflow
• Register a model to Model Registry
• Transition model to next stage
• Use model for batch inference
• Automate inference with Workflows

©2023 Databricks Inc. — All rights reserved


Course Summary
and Next Steps

Databricks Academy
2023

©2023 Databricks Inc. — All rights reserved


Extra Resources

©2023 Databricks Inc. — All rights reserved


Feature Store
The first Feature Store codesigned with a Data and MLOps Platform

Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)

Feature Registry Feature Provider


● Discoverability and Reusability ● Batch and online access to Features
● Versioning ● Feature lookup packaged with Models
● Upstream and downstream Lineage ● Simplified deployment process

Co-designed with Co-designed with

● Open format ● Open model format that supports all ML


frameworks
● Built-in data versioning and governance
● Feature version and lookup logic hermetically
● Native access through PySpark, SQL, etc. logged with Model

©2023 Databricks Inc. — All rights reserved

You might also like