Slide 1

Learn How to Create Features from
Tabular Data and Accelerate your Data

Science Pipeline
Gilberto Titericz, Jiwei Liu, Chris Deotte, Benedikt Schifferer
1
Introduction
Chris Deotte
● Senior Data Scientist at NVIDIA
● 4x Kaggle Grandmaster with 2 years data science competitions
● PhD. Mathematics and Computational Science from UCSD
● https://www.kaggle.com/cdeotte
Gilberto Titericz
● Senior Data Scientist at NVIDIA
● 3x Kaggle Grandmaster with 11 years data science competitions
● MSc. Electronics from UTFPR
● https://www.kaggle.com/titericz
Jiwei Liu
● Senior Data Scientist and Software Developer at NVIDIA RAPIDS cuML
● 2x Kaggle Grandmaster
● PhD. Electrical and Computer Engineering from University of Pittsburgh
● https://www.kaggle.com/jiweiliu
Benedikt Schifferer
● Deep Learning Engineer for Recommender Systems at NVIDIA
● Built Recommender System 2 years at a German ecommerce
● MSc. Data Science from Columbia University
● https://www.linkedin.com/in/benedikt-schifferer/
2
GOALS
Improve model performance for Tabular datasets
Learn and apply best practices for preprocessing and feature

engineering
Accelerate training pipelines
Leverage GPUs acceleration for quick experimentation
3
Experimentation Pipeline for Tabular Datasets
Feature
Input Model Evaluation
Engineering
Iterations
Developing high accuracy model requires a cycle to run multiple experiments to

iterate on feature engineering, model types and architectures.
4
Accelerating pipelines enables more experiments
NVIDIA @ RecSys Challenge 2020 Illustrative day as a data scientist w/wo GPU acceleration
NVIDIA @ RecSys Challenge 2019
● RecSys2020 - 28x faster than optimized CPU code

● RecSys2019 - 15x faster than initial CPU code
● Less computation time enables more experiments
Source: https://medium.com/rapids-ai/rapids-accelerates-data-science-end-to-end-afda1973b65d
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2019/RAPIDS%20RecSys%20Challenge%202019.pdf
5
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2020/RecSysChallenge2020.pdf
Overview Feature Types Bold techniques in focus
Feature Type Example Feature Engineering

● User ID / Item ID ● Target Encoding
Categorical ● Brand ● Count Encoding
● Main Category ● Categorify + Combining Categories
● Keywords ● Target Encoding

Unstructured list ● Subcategories ● Count Encoding
● Colors ● Categorify
● Price ● Binning
Numeric ● Deliver time ● Normalization
● Avg. reviews ● Gauss Rank
Timestamp ● Timestamp ● Extract month, weekday, weekend, hour
● Events in order ● # of events in past X

Timeseries ● Time since last event ● Difference in time (lag)
Image ● Product image ● Extract latent representation with deep learning
● Description ● Extract latent representation with deep learning

Text ● Term frequency-inverse document frequency
Social graph ● Follower/Following graph ● Link analysis
Geo location ● Addresses ● Distances to point of interest
6
Some Models
Support Vector Machines Tree Based Deep Learning
Types: Types: Types:

● SVM ● CatBoost ● Wide And Deep
● XGBoost ● DeepFM
● LightGBM ● DLRM
Components: Components: Components:

● Maximizes distance between ● Defines split by information gain ● Embedding Layers
decision boundary and data per feature ● Feed-Forward Layers
● May require normalization ● Does not require normalization ● Requires normalization of input
● Good for high dimensional data of input features features
XGBoost cannot handle raw

categorical features
7
Dataset of the Tutorial
Dataset: Amazon Review Dataset - Category Electronics
URL: https://jmcauley.ucsd.edu/data/amazon/
Events: Ratings
Timeframe: Jun 1999 - July 2014
Goal:
Positive target: Rating>=4
Negative target: Rating<=3
Dataset split:
Training: June-1999 - May-2014 (1.6 Mio samples)
Validation:June-2014 (43k samples)
Test: July-2014 (33k Mio samples)
Baseline: ~80% of events are high ratings
Features:
● userID, productID
● price
● timestamp
● category
● brand
8
Performance improvement of 11% with Target Encoding
9
The RAPIDS suite of open source software libraries gives
you the freedom to execute end-to-end data science and
analytics pipelines entirely on NVIDIA GPUs.
10
The Evolution of Data Processing
Faster Data Access, Less Data Movement
Hadoop Processing, Reading from Disk
HDFS HDFS HDFS HDFS HDFS

Query ETL ML Train
Read Write Read Write Read
CPU-Based Spark In-Memory Processing 25-100x Improvement

Less Code
Language Flexible
HDFS Primarily In-Memory
Query ETL ML Train
Read
Traditional GPU Processing 5-10x Improvement

More Code ?
Language Rigid Why Use GPUs
HDFS GPU CPU GPU CPU GPU ML Substantially on GPU GPUs are built for intensive
Query ETL
Read Read Write Read Write Read Train
parallel processing. As datasets
continue to grow, data scientists
50-100x Improvement are limited by the sequential
RAPIDS
Same Code nature of CPU compute. GPUs
Language Flexible provide the power and parallelism
Arrow
Query ETL
ML Primarily on GPU necessary for today’s data
Read Train
science.
11
Open Source Software Has Democratized Data Science
Highly Accessible, Easy to Use Tools Abstract Complexity
With the rise of data science, businesses,

Data Preparation Model Training Visualization
researchers, and individuals are using
data to solve a wide variety of challenges
across many problem domains.
Apache Spark / Dask
Tools like pandas, Apache Spark, and
scikit-learn have made applying data Machine
Pre-Processing Graph Analytics Deep Learning Visualization
science easier, facilitating further Learning TensorFlow, PyTorch,
pandas scikit-learn NetworkX MxNet matplotlib
data-driven evolution.
While these tools have fostered

accessibility in data science, their
reliance on traditional CPU processing
creates bottlenecks increasing cycle time, CPU Memory
operational cost, and overhead.
12
Accelerated Data Science with RAPIDS
Powering Popular Data Science Ecosystems with NVIDIA GPUs
RAPIDS abstracts the complexities of

accelerated data science by building Data Preparation Model Training Visualization
upon popular Python and Java libraries,
enabling users to see benefits
immediately.
Spark / Dask Dask
RAPIDS utilizes NVIDIA CUDA primitives

for low-level compute optimization and Machine Deep Learning Visualization
Pre-Processing Learning
Graph Analytics
TensorFlow, PyTorch, cuXfilter, pyViz,
exposes GPU parallelism and cuIO & cuDF cuML cuGraph MxNet Plotly
high-bandwidth memory speed through
user-friendly interfaces like pandas,
scikit-learn, Apache Spark or Dask.
With Spark or Dask, RAPIDS can scale

out to multi-node, multi-GPU cluster to GPU Memory
power through big data processes.
13
Minor Code Changes for Major Benefits
Abstracting Accelerated Compute through Familiar Interfaces
14
Lightning-Fast End-to-End Performance
Reducing Data Science Processes from Hours to Seconds
RAPIDS delivers massive speed-ups across the end-to-end

data science lifecycle. Conducting benchmarks in a
commercial cloud environment, we’re able to get incredible
performance running a common ML model training pipeline.
Between loading and cleansing data, engineering features,

and training a classifier using a 200GB CSV dataset, a
RAPIDS-based pipeline completed these operations in just
over two minutes. The same process takes two and half
hours on a similar CPU-configuration.
16 70x 20x
A100s Provide More Power Faster Performance than More Cost-Effective than
than 100 CPU Nodes Similar CPU Configuration Similar CPU Configuration
*CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs.
15
Improving Performance Over Time
Holistic Development Yields Long-Term Results
As RAPIDS has matured, performance has continued to

improve, drastically reducing cycle time and costs for data
science operations. Since it’s initial release in fall 2018,
RAPIDS has improved performance nearly 3x by thinking
about software and hardware holistically.
Through whole-stack innovation, RAPIDS continues to

improve as NVIDIA hardware and software advance. As new,
more powerful GPU architectures, RAPIDS makes it simple
for data scientists to realize the benefits of innovative
hardware.
2.5 3x 7x
Year between initial and Faster Performance than Better TCO with Improved
latest RAPIDS release RAPIDS 0.2 on DGX-2 Compute Performance
16
A RAPIDLY GROWING RAPIDS ECOSYSTEM
SUPPORTED & USED BY A WIDE VARIETY OF INNOVATIVE PARTNERS
CONTRIBUTORS
ADOPTERS
OPEN SOURCE
17
Deploy RAPIDS, Everywhere
Easily Accelerated Your Workloads in Your Environment
GET SOFTWARE
GITHUB ANACONDA NGC DOCKER
CLOUD PROVIDERS
Azure Machine Cloud

CLOUD SERVICES Learning Dataproc
18
HANDS-ON LABs
19
LAB Details
Use !nvidia-smi to check GPU memory - we are using 1x NVIDIA Tesla T4 with 16 GB memory
Shutdown notebooks in the end to free GPU memory Some notebooks restart kernel to free GPU memory
and reset dataframe
20
Start the Lab!
21
SUMMARY
Checkout
RAPIDS.AI https://github.com/rapidsai/
cuDF https://github.com/rapidsai/cudf
cuML https://github.com/rapidsai/cuml
22
23

Slide 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slide 1

Uploaded by

Copyright:

Available Formats

Learn How to Create Features from

Tabular Data and Accelerate your Data

Learn and apply best practices for preprocessing and feature

Accelerate training pipelines

Leverage GPUs acceleration for quick experimentation

Developing high accuracy model requires a cycle to run multiple experiments to

NVIDIA @ RecSys Challenge 2019

● RecSys2020 - 28x faster than optimized CPU code

Feature Type Example Feature Engineering

● Keywords ● Target Encoding

Timestamp ● Timestamp ● Extract month, weekday, weekend, hour

● Events in order ● # of events in past X

Image ● Product image ● Extract latent representation with deep learning

● Description ● Extract latent representation with deep learning

Social graph ● Follower/Following graph ● Link analysis

Geo location ● Addresses ● Distances to point of interest

Types: Types: Types:

Components: Components: Components:

XGBoost cannot handle raw

Baseline: ~80% of events are high ratings

Hadoop Processing, Reading from Disk

HDFS HDFS HDFS HDFS HDFS

CPU-Based Spark In-Memory Processing 25-100x Improvement

Traditional GPU Processing 5-10x Improvement

With the rise of data science, businesses,

While these tools have fostered

RAPIDS abstracts the complexities of

RAPIDS utilizes NVIDIA CUDA primitives

With Spark or Dask, RAPIDS can scale

RAPIDS delivers massive speed-ups across the end-to-end

Between loading and cleansing data, engineering features,

As RAPIDS has matured, performance has continued to

Through whole-stack innovation, RAPIDS continues to

GITHUB ANACONDA NGC DOCKER

Azure Machine Cloud

You might also like