Professional Documents
Culture Documents
Slide 1
Slide 1
1
Introduction
Chris Deotte
● Senior Data Scientist at NVIDIA
● 4x Kaggle Grandmaster with 2 years data science competitions
● PhD. Mathematics and Computational Science from UCSD
● https://www.kaggle.com/cdeotte
Gilberto Titericz
● Senior Data Scientist at NVIDIA
● 3x Kaggle Grandmaster with 11 years data science competitions
● MSc. Electronics from UTFPR
● https://www.kaggle.com/titericz
Jiwei Liu
● Senior Data Scientist and Software Developer at NVIDIA RAPIDS cuML
● 2x Kaggle Grandmaster
● PhD. Electrical and Computer Engineering from University of Pittsburgh
● https://www.kaggle.com/jiweiliu
Benedikt Schifferer
● Deep Learning Engineer for Recommender Systems at NVIDIA
● Built Recommender System 2 years at a German ecommerce
● MSc. Data Science from Columbia University
● https://www.linkedin.com/in/benedikt-schifferer/
2
GOALS
Improve model performance for Tabular datasets
3
Experimentation Pipeline for Tabular Datasets
Feature
Input Model Evaluation
Engineering
Iterations
4
Accelerating pipelines enables more experiments
NVIDIA @ RecSys Challenge 2020 Illustrative day as a data scientist w/wo GPU acceleration
Source: https://medium.com/rapids-ai/rapids-accelerates-data-science-end-to-end-afda1973b65d
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2019/RAPIDS%20RecSys%20Challenge%202019.pdf
5
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2020/RecSysChallenge2020.pdf
Overview Feature Types Bold techniques in focus
● Price ● Binning
Numeric ● Deliver time ● Normalization
● Avg. reviews ● Gauss Rank
6
Some Models
Support Vector Machines Tree Based Deep Learning
7
Dataset of the Tutorial
Dataset: Amazon Review Dataset - Category Electronics
URL: https://jmcauley.ucsd.edu/data/amazon/
Events: Ratings
Timeframe: Jun 1999 - July 2014
Goal:
Positive target: Rating>=4
Negative target: Rating<=3
Dataset split:
Training: June-1999 - May-2014 (1.6 Mio samples)
Validation:June-2014 (43k samples)
Test: July-2014 (33k Mio samples)
Features:
● userID, productID
● price
● timestamp
● category
● brand
8
Performance improvement of 11% with Target Encoding
9
The RAPIDS suite of open source software libraries gives
you the freedom to execute end-to-end data science and
analytics pipelines entirely on NVIDIA GPUs.
10
The Evolution of Data Processing
Faster Data Access, Less Data Movement
11
Open Source Software Has Democratized Data Science
Highly Accessible, Easy to Use Tools Abstract Complexity
12
Accelerated Data Science with RAPIDS
Powering Popular Data Science Ecosystems with NVIDIA GPUs
13
Minor Code Changes for Major Benefits
Abstracting Accelerated Compute through Familiar Interfaces
14
Lightning-Fast End-to-End Performance
Reducing Data Science Processes from Hours to Seconds
16 70x 20x
A100s Provide More Power Faster Performance than More Cost-Effective than
than 100 CPU Nodes Similar CPU Configuration Similar CPU Configuration
*CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs.
15
Improving Performance Over Time
Holistic Development Yields Long-Term Results
2.5 3x 7x
Year between initial and Faster Performance than Better TCO with Improved
latest RAPIDS release RAPIDS 0.2 on DGX-2 Compute Performance
16
A RAPIDLY GROWING RAPIDS ECOSYSTEM
SUPPORTED & USED BY A WIDE VARIETY OF INNOVATIVE PARTNERS
CONTRIBUTORS
ADOPTERS
OPEN SOURCE
17
Deploy RAPIDS, Everywhere
Easily Accelerated Your Workloads in Your Environment
GET SOFTWARE
CLOUD PROVIDERS
18
HANDS-ON LABs
19
LAB Details
Use !nvidia-smi to check GPU memory - we are using 1x NVIDIA Tesla T4 with 16 GB memory
Shutdown notebooks in the end to free GPU memory Some notebooks restart kernel to free GPU memory
and reset dataframe
20
Start the Lab!
21
SUMMARY
Checkout
RAPIDS.AI https://github.com/rapidsai/
cuDF https://github.com/rapidsai/cudf
cuML https://github.com/rapidsai/cuml
22
23