Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Learn How to Create Features from

Tabular Data and Accelerate your Data


Science Pipeline
Gilberto Titericz, Jiwei Liu, Chris Deotte, Benedikt Schifferer

1
Introduction
Chris Deotte
● Senior Data Scientist at NVIDIA
● 4x Kaggle Grandmaster with 2 years data science competitions
● PhD. Mathematics and Computational Science from UCSD
● https://www.kaggle.com/cdeotte

Gilberto Titericz
● Senior Data Scientist at NVIDIA
● 3x Kaggle Grandmaster with 11 years data science competitions
● MSc. Electronics from UTFPR
● https://www.kaggle.com/titericz

Jiwei Liu
● Senior Data Scientist and Software Developer at NVIDIA RAPIDS cuML
● 2x Kaggle Grandmaster
● PhD. Electrical and Computer Engineering from University of Pittsburgh
● https://www.kaggle.com/jiweiliu

Benedikt Schifferer
● Deep Learning Engineer for Recommender Systems at NVIDIA
● Built Recommender System 2 years at a German ecommerce
● MSc. Data Science from Columbia University
● https://www.linkedin.com/in/benedikt-schifferer/
2
GOALS
Improve model performance for Tabular datasets

Learn and apply best practices for preprocessing and feature


engineering

Accelerate training pipelines

Leverage GPUs acceleration for quick experimentation

3
Experimentation Pipeline for Tabular Datasets

Feature
Input Model Evaluation
Engineering

Iterations

Developing high accuracy model requires a cycle to run multiple experiments to


iterate on feature engineering, model types and architectures.

4
Accelerating pipelines enables more experiments
NVIDIA @ RecSys Challenge 2020 Illustrative day as a data scientist w/wo GPU acceleration

NVIDIA @ RecSys Challenge 2019

● RecSys2020 - 28x faster than optimized CPU code


● RecSys2019 - 15x faster than initial CPU code
● Less computation time enables more experiments

Source: https://medium.com/rapids-ai/rapids-accelerates-data-science-end-to-end-afda1973b65d
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2019/RAPIDS%20RecSys%20Challenge%202019.pdf
5
Source: https://github.com/rapidsai/deeplearning/blob/main/RecSys2020/RecSysChallenge2020.pdf
Overview Feature Types Bold techniques in focus

Feature Type Example Feature Engineering


● User ID / Item ID ● Target Encoding
Categorical ● Brand ● Count Encoding
● Main Category ● Categorify + Combining Categories

● Keywords ● Target Encoding


Unstructured list ● Subcategories ● Count Encoding
● Colors ● Categorify

● Price ● Binning
Numeric ● Deliver time ● Normalization
● Avg. reviews ● Gauss Rank

Timestamp ● Timestamp ● Extract month, weekday, weekend, hour

● Events in order ● # of events in past X


Timeseries ● Time since last event ● Difference in time (lag)

Image ● Product image ● Extract latent representation with deep learning

● Description ● Extract latent representation with deep learning


Text ● Term frequency-inverse document frequency

Social graph ● Follower/Following graph ● Link analysis

Geo location ● Addresses ● Distances to point of interest

6
Some Models
Support Vector Machines Tree Based Deep Learning

Types: Types: Types:


● SVM ● CatBoost ● Wide And Deep
● XGBoost ● DeepFM
● LightGBM ● DLRM

Components: Components: Components:


● Maximizes distance between ● Defines split by information gain ● Embedding Layers
decision boundary and data per feature ● Feed-Forward Layers
● May require normalization ● Does not require normalization ● Requires normalization of input
● Good for high dimensional data of input features features

XGBoost cannot handle raw


categorical features

7
Dataset of the Tutorial
Dataset: Amazon Review Dataset - Category Electronics
URL: https://jmcauley.ucsd.edu/data/amazon/

Events: Ratings
Timeframe: Jun 1999 - July 2014

Goal:
Positive target: Rating>=4
Negative target: Rating<=3

Dataset split:
Training: June-1999 - May-2014 (1.6 Mio samples)
Validation:June-2014 (43k samples)
Test: July-2014 (33k Mio samples)

Baseline: ~80% of events are high ratings

Features:
● userID, productID
● price
● timestamp
● category
● brand

8
Performance improvement of 11% with Target Encoding

9
The RAPIDS suite of open source software libraries gives
you the freedom to execute end-to-end data science and
analytics pipelines entirely on NVIDIA GPUs.

10
The Evolution of Data Processing
Faster Data Access, Less Data Movement

Hadoop Processing, Reading from Disk

HDFS HDFS HDFS HDFS HDFS


Query ETL ML Train
Read Write Read Write Read

CPU-Based Spark In-Memory Processing 25-100x Improvement


Less Code
Language Flexible
HDFS Primarily In-Memory
Query ETL ML Train
Read

Traditional GPU Processing 5-10x Improvement


More Code ?
Language Rigid Why Use GPUs
HDFS GPU CPU GPU CPU GPU ML Substantially on GPU GPUs are built for intensive
Query ETL
Read Read Write Read Write Read Train
parallel processing. As datasets
continue to grow, data scientists
50-100x Improvement are limited by the sequential
RAPIDS
Same Code nature of CPU compute. GPUs
Language Flexible provide the power and parallelism
Arrow
Query ETL
ML Primarily on GPU necessary for today’s data
Read Train
science.

11
Open Source Software Has Democratized Data Science
Highly Accessible, Easy to Use Tools Abstract Complexity

With the rise of data science, businesses,


Data Preparation Model Training Visualization
researchers, and individuals are using
data to solve a wide variety of challenges
across many problem domains.
Apache Spark / Dask
Tools like pandas, Apache Spark, and
scikit-learn have made applying data Machine
Pre-Processing Graph Analytics Deep Learning Visualization
science easier, facilitating further Learning TensorFlow, PyTorch,
pandas scikit-learn NetworkX MxNet matplotlib
data-driven evolution.

While these tools have fostered


accessibility in data science, their
reliance on traditional CPU processing
creates bottlenecks increasing cycle time, CPU Memory
operational cost, and overhead.

12
Accelerated Data Science with RAPIDS
Powering Popular Data Science Ecosystems with NVIDIA GPUs

RAPIDS abstracts the complexities of


accelerated data science by building Data Preparation Model Training Visualization
upon popular Python and Java libraries,
enabling users to see benefits
immediately.
Spark / Dask Dask

RAPIDS utilizes NVIDIA CUDA primitives


for low-level compute optimization and Machine Deep Learning Visualization
Pre-Processing Learning
Graph Analytics
TensorFlow, PyTorch, cuXfilter, pyViz,
exposes GPU parallelism and cuIO & cuDF cuML cuGraph MxNet Plotly
high-bandwidth memory speed through
user-friendly interfaces like pandas,
scikit-learn, Apache Spark or Dask.

With Spark or Dask, RAPIDS can scale


out to multi-node, multi-GPU cluster to GPU Memory
power through big data processes.

13
Minor Code Changes for Major Benefits
Abstracting Accelerated Compute through Familiar Interfaces

14
Lightning-Fast End-to-End Performance
Reducing Data Science Processes from Hours to Seconds

RAPIDS delivers massive speed-ups across the end-to-end


data science lifecycle. Conducting benchmarks in a
commercial cloud environment, we’re able to get incredible
performance running a common ML model training pipeline.

Between loading and cleansing data, engineering features,


and training a classifier using a 200GB CSV dataset, a
RAPIDS-based pipeline completed these operations in just
over two minutes. The same process takes two and half
hours on a similar CPU-configuration.

16 70x 20x
A100s Provide More Power Faster Performance than More Cost-Effective than
than 100 CPU Nodes Similar CPU Configuration Similar CPU Configuration

*CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs.

15
Improving Performance Over Time
Holistic Development Yields Long-Term Results

As RAPIDS has matured, performance has continued to


improve, drastically reducing cycle time and costs for data
science operations. Since it’s initial release in fall 2018,
RAPIDS has improved performance nearly 3x by thinking
about software and hardware holistically.

Through whole-stack innovation, RAPIDS continues to


improve as NVIDIA hardware and software advance. As new,
more powerful GPU architectures, RAPIDS makes it simple
for data scientists to realize the benefits of innovative
hardware.

2.5 3x 7x
Year between initial and Faster Performance than Better TCO with Improved
latest RAPIDS release RAPIDS 0.2 on DGX-2 Compute Performance

16
A RAPIDLY GROWING RAPIDS ECOSYSTEM
SUPPORTED & USED BY A WIDE VARIETY OF INNOVATIVE PARTNERS

CONTRIBUTORS

ADOPTERS

OPEN SOURCE

17
Deploy RAPIDS, Everywhere
Easily Accelerated Your Workloads in Your Environment

GET SOFTWARE

GITHUB ANACONDA NGC DOCKER

CLOUD PROVIDERS

Azure Machine Cloud


CLOUD SERVICES Learning Dataproc

18
HANDS-ON LABs
19
LAB Details
Use !nvidia-smi to check GPU memory - we are using 1x NVIDIA Tesla T4 with 16 GB memory

Shutdown notebooks in the end to free GPU memory Some notebooks restart kernel to free GPU memory
and reset dataframe

20
Start the Lab!
21
SUMMARY

Checkout
RAPIDS.AI https://github.com/rapidsai/
cuDF https://github.com/rapidsai/cudf
cuML https://github.com/rapidsai/cuml

22
23

You might also like