Intel AI Everywhere

A New Paradigm for AI
Productivity & Performance

Seetha . Nookala
Senior Director
Datacenter Graphics, HPC/AI ,APAC & Japan
December 2023
AI Exponentially Increasing In
GPT4+
>1 Trillion
15,000x
Model Size
Increase In 5 Years Megatron-Turing

530B
GPT-3
175B
Transformer BERT GPT-2 GPT-2 8B TS Turing-NLG
65M 340M 1.5B 8.3B 11B 17B
Mid 2017 2018 2019 Mid 2019 Late 2019 2020 Mid 2020 Late 2021 2022
Source: Amazon Web Services , 2022

– Cost of AI Today
GPT-3 GPT-4
Training Cost
(3,640 petaFLOPS-days) costs if trained on Google TPU v3 (450,000 petaFLOPS-days) , 7,600 GPUs running for a year
ChatGPT Bing AI Chatbot

Inferencing Cost
to process prompts per month with 100 million active users Bing AI Chatbot to serve responses to all Bing users
Source: CSET, CNBC

Driving Will
Unlock New Levels of
Small AI Models
Smartphones & Mobile Devices Smart Cameras & Surveillance Industrial IoT Autonomous Vehicles Edge Servers & Gateways
Facial Recognition Object Detection Predictive Maintenance Object Detection Local Decision-Making
Voice Assistants Anomaly Detection Quality Control Driver Monitoring Data Filtering
Drones & UAVs Smart Home Devices Wearables Robots Retail and Customer Analysis
Obstacle Avoidance Thermostats Health Monitoring Navigation Foot Traffic Analysis
Aerial Data Processing Home Security Gesture Control Object Manipulation Personalized Marketing
Agriculture Sensors & Healthcare Environmental Monitoring Smart Grids & Energy Sports and Entertainment
Equipment Management
Livestock Monitoring Remote Patient Monitoring Air Quality Sensing Power Distribution Player and Audience Analytics
Precision Farming Medical Imaging Earthquake Detection Grid Security AR Headsets

Large Unlocking the AI Continuum Small
From the Largest Data Center Scale AI, to the Smallest AI That Fits On Any Device
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications
Training & Streamline the AI Workflow Inference &

Fine-Tuning Deployment
AI Software
Cloud
Simplify the AI Infrastructure Client
Scalable Systems & Solutions
AI Specific Accelerate the Workload General

Purpose
Foundational Silicon & Software
Approach
Unlocking the AI Continuum Novel Applications Across All Industries
FSI Healthcare Retail Automotive Public Sector …
Streamline the AI Workflow AI Software
Open Productive Accessible
Simplify the AI Infrastructure Scalable Systems & Solutions

Infrastructure Scalability Solutions
Data Center AI Systems AI PC
Accelerate the Workload Foundational Silicon & Software

Bringing AI
Novel Applications

AI Software
Cloud
Simplify the AI Infrastructure Client

Purpose
AI Workflow Today
Training Fine-Tuning Deployment Inference
Proprietary Fragmented Complex
Limiting Customization and Challenges Integrating Various Barrier For Developers

Hindering Innovation AI Frameworks Seeking to Harness the Full
and Libraries Potential of AI
Streamline the AI Workflow
Training Fine-Tuning Deployment Inference
DirectML
SigOpt Neural Compressor

AutoML

Ecosystem Engagement
Pre-configured AI tool
Solutions containers Selector Industry & Solutions High-Touch
Academia Marketplace Support
Optimized
OpenVINO Developer Training
Extensions
Tooling
Modin CNVRG Centers of Documentation
MLOPS training & Tutorials
Excellence
AI Reference Hugging Face Summits &

References Kits Collaboration Training Videos Liftoff Program
Hackathons
Bringing AI
Novel Applications

AI Software

Cloud Infrastructure Scalability Solutions Client
Data Center AI systems AI PC

Purpose
Age of the AI PC
AI Assistants Know
Elevated Your Daily Context
Video
Enhanced Collaboration
Audio Effects & Streaming
Creator & More Creative,

Gaming Productive, & Across Everything
Effects Collaborative You Do
Enhancements Everything
Bringing AI
Novel Applications

AI Software

Cloud Infrastructure Scalability Solutions Client

Purpose
Scalable Systems for Simple AI Infrastructures
State of the 100B+ Up to 100B Up to 20B <1B
Large Small
Art Model Parameters Parameters Parameters Parameters
Mainstream
Training & Baseline Endpoint Inference &
Fine-Tuning Training Peak Inference Inference / Deployment
Inference Inference
Fine-Tuning
Multi-Node Multi-GPU
Cluster & Data Deployment or Multi-Socket Single- Socket
Center Scale per Rack CPU CPU AI PC
Bringing AI
Novel Applications
Training & Inference &

Fine-Tuning Training Fine-Tuning Deployment Inference Deployment
Cloud
Simplify AI Infrastructure Client

General
AI Specific
Purpose
Intel® Gaudi® 2 AI Accelerator
Proven § The ONLY alternative to H100 for training LLMs based on MLPerf
Performance § Trained GPT-3* model in 311 minutes on 384 Intel Gaud2 accelerators
§ Intel Gaudi2 accelerators with FP8 estimated to deliver price-performance

Price >H100
Performance
§ ~2x price-performance to A100
§ 95% linear scaling on MLPerf GPT-3 training benchmark

Scalability
§ Access large Intel Gaudi2 cluster on the Intel Developer Cloud
§ Software optimized for deep learning training and inference

Ease of
Use § PyTorch, Hugging Face, Optimum Library optimizations
Performance metrics based on MLPerf Training 3.0 benchmark. For configuration details, see the results published by MLPCommons..
GPT-3 model tested on MLPerf Training 3.0 consisted of representative 1% slice of the entire GPT-3 model Performance expectations for Gaudi2 with FP8 based on Intel internal evaluation June 2023
Price-performance claim based on comparable pricing of Intel Gaudi server and Nvidia A100 server and MLPerf Inference 3.1 Results. , Aug 2023. See Supermicro for server pricing. Price-performance claim based on
significant pricing differential between Intel Gaudi2 and Nvidia H100 server, MLPerf Training 3.0 Results, May 2023 and internal estimates of performance advancement with FP8. See Supermicro for server pricing. Results
Code Transitioning
Performant AI Code with Minimal Changes
Across Generations & Architectures

Intel® Gaudi® 2 Intel® Gaudi® 3 AI Next Gen GPU (Codenamed
AI Accelerator Accelerator Falcon Shores)
Powered by Transitioning Into a

Single Software Environment
Gaudi Software Suite Unified Programming Model
Fine Tuning
Fine-tune with Fine-tune On Intel® Xeon®,

Intel® Gaudi® 2 Processor Exploiting Its Industry-leading
When Optimal Speed is Desired Ubiquity In the Data Center
6
Hugging Face Evaluations Substantiate Intel® Gaudi® 2

Accelerator LLM Performance vs. Nvidia A100 and H100
4
Relative performance (Higher is better)

2.4x 2.5x
2
1.4x
0
NVIDIA T5 – 3B Bridge NVIDIA Bridge
A100 Samples/s Tower H100 Tower
BS = 16
Visit https://habana.ai/habana-claims-validation for workloads and configurations. Results may vary.

https://huggingface.co/blog/habana-gaudi-2-benchmark
https://huggingface.co/blog/bridgetower
Out-of-the Box Intel® Xeon® Fine Tuning
Optimized Models & Spaces
100k’s
Dolly LLAMA2 MPT LDM3D Whisper Mode
BIOGPT 1.5 Billion Parameter and

Intel Optimized Hugging Face Libraries & Tools GPT-J 6 Billion Parameter Model
140
Transformers Diffusers Accelerate PEFT Optimum 120
Time-to-train in minutes
Fine Tuning Use Cases Fine Tuning at Scale Efficient Fine Performance
Tuning Optimization 100
80
60
Foundational Stack 40
+ IPEX ITREX 20
+ IDEX 0
BIOTGPT 1.5 billion paramater GPT-J 6 billion parameter model
Model
Inference
Novel Applications

Cloud

General
AI Specific
Purpose
6
Energy Efficiency
5 Throughput-per-Watt on BLOOMZ 176B Inference is 1.79x
better than H100; 1.61x better than A100
4
3 2.84x 2.89x
Relative performance (Higher is better)

2
1.42x
0
Stable Diffusion BLOOMz 176B BLOOM 7B
NVIDIA Latency Inference
A100 BS=1, BF16
BS =8; fp32 BS=1, BF16
https://huggingface.co/blog/habana-gaudi-2-benchmark
https://huggingface.co/blog/habana-gaudi-2-bloom
Visit https://habana.ai/habana-claims-validation/for workloads and configuration regarding power consumption claims. Results may vary.
Inference on GPT-J
Intel Gaudi 2 Accelerator with FP8
• Near-parity* on GPT-J with H100
• Outperformed A100 by 2.4x (Server) and 2x
(Offline)
• Achieved 99.9% accuracy with FP8
GPT-J On MLPerf Inference Benchmark
Source: MLPerf Inference 3.1 Data Center Benchmark Results: https://mlcommons.org/en/inference-datacenter-31/

Intel® Gaudi® 2 AI accelerator on GPT-J Vs H100 with 1.09x (Server) and 1.28 (Offline). Results may vary
LLaMA2 7B : Intel Xeon 4th Gen 8480 1S P90 Latency
Batch Size 1, Beam Width 4, PyTorch + IPEX
(Lower is better)
100
90
80
70
Token latency(ms)
60
50
40
30
20
10
0
32 Input Token 128 Input Token 1024 Input Token 2016 Input Token
• Use any popular industry standard AI libraries
BF16 INT8
• Intel AI Platform validated with over 300 inference models
• One socket of 4th Gen Intel® Xeon® processors can run LLaMa2 chatbots in under 100ms 2nd token latency
Results shown for bare metal.

On 1 Socket Intel® Xeon®
Inference Across Multiple Products Scalable Processor
153
134
Llama 2 Next Token Latency ( Lower is Better) 117119 114
Greedy Mode, Mixed Precision On 1 Tile (out of 2 tiles per card) Intel® 93 94
(bfloat16), BS = 1, 256 Output Tokens Data Center GPU Max 1550
79
74 72 75
64 66
59
Latency (ms)
33.8
31.6
29.2 29.5
42 44
20.4
17.6 18.9
15.5 16.2 15.2 15.6 17.2
12.2
9 9.5 10.3
7B 13B 7B 13B 7B 13B 7B 13B

1x Gaudi2® Intel® Data Center GPU Max 1550 single tile (1 card Intel® 4th Gen Xeon® 8480 1 Intel® Xeon® Max 9480 1 Socket
has 2 tiles) Socket
Intel® Gaudi2 ®AI Accelerator
Intel Data Center GPU Max 1550 Intel Xeon Scalable Processor
For configuration details see: https://habana.ai/habana-claims-validation/ and www.intel.com/PerformanceIndex

Bringing AI
Large Scale
AI Models
Unlock the AI Continuum Small AI
Models
Novel Applications

Cloud
Accelerate the Workload Foundational Silicon

General
AI Specific
Purpose
Approach
Unlocking the AI Continuum Novel Applications Across All Industries
FSI Healthcare Retail Automotive Public Sector …
Streamline the AI Workflow AI Software

Infrastructure Scalability Solutions

Thank You!
Notices and
Disclaimers
For notices, disclaimers, and details about performance
claims, visit www.intel.com/PerformanceIndex or scan
the QR code:
© Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. Other names
and brands may be claimed as the property of others.

Intel AI Everywhere

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intel AI Everywhere

Uploaded by

Copyright:

Available Formats

A New Paradigm for AI

Productivity & Performance

Increase In 5 Years Megatron-Turing

Source: Amazon Web Services , 2022

ChatGPT Bing AI Chatbot

Source: CSET, CNBC

Obstacle Avoidance Thermostats Health Monitoring Navigation Foot Traffic Analysis

Precision Farming Medical Imaging Earthquake Detection Grid Security AR Headsets

Training & Streamline the AI Workflow Inference &

AI Specific Accelerate the Workload General

FSI Healthcare Retail Automotive Public Sector …

Streamline the AI Workflow AI Software

Open Productive Accessible

Simplify the AI Infrastructure Scalable Systems & Solutions

Accelerate the Workload Foundational Silicon & Software

Training & Streamline the AI Workflow Inference &

AI Specific Accelerate the Workload General

Training Fine-Tuning Deployment Inference

Proprietary Fragmented Complex

Limiting Customization and Challenges Integrating Various Barrier For Developers

SigOpt Neural Compressor

Open Productive Accessible

AI Reference Hugging Face Summits &

Training & Streamline the AI Workflow Inference &

Simplify the AI Infrastructure Scalable Systems & Solutions

AI Specific Accelerate the Workload General

Creator & More Creative,

Training & Streamline the AI Workflow Inference &

Simplify the AI Infrastructure Scalable Systems & Solutions

AI Specific Accelerate the Workload General

Training & Inference &

Accelerate the Workload Foundational Silicon & Software

§ Intel Gaudi2 accelerators with FP8 estimated to deliver price-performance

§ 95% linear scaling on MLPerf GPT-3 training benchmark

§ Software optimized for deep learning training and inference

Across Generations & Architectures

Powered by Transitioning Into a

Fine-tune with Fine-tune On Intel® Xeon®,

Hugging Face Evaluations Substantiate Intel® Gaudi® 2

Relative performance (Higher is better)

Visit https://habana.ai/habana-claims-validation for workloads and configurations. Results may vary.

BIOGPT 1.5 Billion Parameter and

Transformers Diffusers Accelerate PEFT Optimum 120

Training & Inference &

Accelerate the Workload Foundational Silicon & Software

Relative performance (Higher is better)

GPT-J On MLPerf Inference Benchmark

Source: MLPerf Inference 3.1 Data Center Benchmark Results: https://mlcommons.org/en/inference-datacenter-31/

Results shown for bare metal.

Llama 2 Next Token Latency ( Lower is Better) 117119 114

7B 13B 7B 13B 7B 13B 7B 13B

For configuration details see: https://habana.ai/habana-claims-validation/ and www.intel.com/PerformanceIndex

Training & Inference &

Accelerate the Workload Foundational Silicon

FSI Healthcare Retail Automotive Public Sector …

Streamline the AI Workflow AI Software

Open Productive Accessible

Simplify the AI Infrastructure Scalable Systems & Solutions

Accelerate the Workload Foundational Silicon & Software

You might also like