Deep Learning Over Big Data Stacks: Challenges and A Comprehensive Study On HPC Clusters

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Deep Learning over Big Data Stacks: Challenges

and A Comprehensive Study on HPC Clusters


Big Data and Deep Learning
• Deep Learning is a sub-set of Machine Learning.
– But, it is perhaps the most radical and revolutionary
subset
• Deep Learning is going through a resurgence
– Model: Excellent accuracy for deep/convolutional
neural networks
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-
– Data: Public availability of versatile datasets like flexibility-scalability-and-advocacy/

MNIST, CIFAR, and ImageNet


– Capability: Unprecedented computing and
communication capabilities: Multi-/Many-Core,
GPGPUs, Xeon Phi, InfiniBand, RoCE, etc.
• Big Data has become one of the most important
elements in business analytics
– Increasing demand for getting Big Value out of Big
Data to drive the revenue continuously growing MNIST handwritten digits Deep Neural Network
2
Deep Learning over Big Data (DLoBD)
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• Benefits of the DLoBD approach
– Easily build a powerful data analytics pipeline; or enhance existing workflows
• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof

(3) Non-deep
(1) Prepare (2) Deep (4) Apply ML
learning
Datasets @Scale Learning @Scale model @Scale
analytics @Scale
– Better data locality
– Efficient resource sharing and cost effective => Leverage existing Big Data deployments

3
The 3 V’s of Big Data: Deep Learning Challenges
• Managing Big Data Volume: Large number of examples
(inputs), large varieties of class types (outputs), and
very high dimensionality (attributes). Volume
– Distributed frameworks over parallelized machines
– CPU/GPU SIMD for training speed with accuracy;
support both model and data parallelism Variety Velocity
Data incompleteness and with unlabeled or noisy
labels
– E.g., 80 million tiny image database: low-resolution
color images over 79,000 search terms
– Need for more efficient cost function or Semi-
supervised strategies
4
The 3 V’s of Big Data: Deep Learning Challenges
• Managing Big Data Variety: Multiple modalities
– Data source variety: audio streams, graphics and
animations, and unstructured text, etc. Volume
– Step1: Learn data representations from each
individual data sources using DL
– Step2: Learn shared representations capable of Variety Velocity
capturing correlations across multiple modalities
– E.g., multimodal Deep Boltzmann Machine (DBM)
[Srivastava and Salakhutdinov]
• multiple stacked-RBMs for each modality
• an additional layer of binary hidden units on
top for joint representation.

5
The 3 V’s of Big Data: Deep Learning Challenges
• Managing Big Data Velocity: Data is being generated at
extremely high speed and need to be processed in a
timely manner Volume
– Need for Online Learning approaches
– Limited progress with Online DL over conventional
neural networks
– Mini-batches with SGD for a good balance between
Variety Velocity
computer memory and running time
High velocity is that data are often non-stationary
– Temporal locality of data => significant degree of
correlation
– Ability to learn the data as a stream

You might also like