Professional Documents
Culture Documents
Intel AI Everywhere
Intel AI Everywhere
15,000x
Model Size
GPT-3
175B
Transformer BERT GPT-2 GPT-2 8B TS Turing-NLG
65M 340M 1.5B 8.3B 11B 17B
Mid 2017 2018 2019 Mid 2019 Late 2019 2020 Mid 2020 Late 2021 2022
(3,640 petaFLOPS-days) costs if trained on Google TPU v3 (450,000 petaFLOPS-days) , 7,600 GPUs running for a year
to process prompts per month with 100 million active users Bing AI Chatbot to serve responses to all Bing users
Smartphones & Mobile Devices Smart Cameras & Surveillance Industrial IoT Autonomous Vehicles Edge Servers & Gateways
Facial Recognition Object Detection Predictive Maintenance Object Detection Local Decision-Making
Voice Assistants Anomaly Detection Quality Control Driver Monitoring Data Filtering
Drones & UAVs Smart Home Devices Wearables Robots Retail and Customer Analysis
Aerial Data Processing Home Security Gesture Control Object Manipulation Personalized Marketing
Agriculture Sensors & Healthcare Environmental Monitoring Smart Grids & Energy Sports and Entertainment
Equipment Management
Livestock Monitoring Remote Patient Monitoring Air Quality Sensing Power Distribution Player and Audience Analytics
Cloud
Simplify the AI Infrastructure Client
Scalable Systems & Solutions
Cloud
Simplify the AI Infrastructure Client
Scalable Systems & Solutions
DirectML
AI Assistants Know
Elevated Your Daily Context
Video
Enhanced Collaboration
Audio Effects & Streaming
Enhancements Everything
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications
Mainstream
Training & Baseline Endpoint Inference &
Fine-Tuning Training Peak Inference Inference / Deployment
Inference Inference
Fine-Tuning
Multi-Node Multi-GPU
Cluster & Data Deployment or Multi-Socket Single- Socket
Center Scale per Rack CPU CPU AI PC
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications
Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions
Performance metrics based on MLPerf Training 3.0 benchmark. For configuration details, see the results published by MLPCommons..
GPT-3 model tested on MLPerf Training 3.0 consisted of representative 1% slice of the entire GPT-3 model Performance expectations for Gaudi2 with FP8 based on Intel internal evaluation June 2023
Price-performance claim based on comparable pricing of Intel Gaudi server and Nvidia A100 server and MLPerf Inference 3.1 Results. , Aug 2023. See Supermicro for server pricing. Price-performance claim based on
significant pricing differential between Intel Gaudi2 and Nvidia H100 server, MLPerf Training 3.0 Results, May 2023 and internal estimates of performance advancement with FP8. See Supermicro for server pricing. Results
Code Transitioning
Performant AI Code with Minimal Changes
2
1.4x
0
NVIDIA T5 – 3B Bridge NVIDIA Bridge
A100 Samples/s Tower H100 Tower
BS = 16
100k’s
Dolly LLAMA2 MPT LDM3D Whisper Mode
Time-to-train in minutes
Fine Tuning Use Cases Fine Tuning at Scale Efficient Fine Performance
Tuning Optimization 100
80
60
Foundational Stack 40
+ IPEX ITREX 20
+ IDEX 0
BIOTGPT 1.5 billion paramater GPT-J 6 billion parameter model
Model
Inference
Large Unlock the AI Continuum Small
Novel Applications
Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions
Energy Efficiency
5 Throughput-per-Watt on BLOOMZ 176B Inference is 1.79x
better than H100; 1.61x better than A100
4
3 2.84x 2.89x
1.42x
0
Stable Diffusion BLOOMz 176B BLOOM 7B
NVIDIA Latency Inference
A100 BS=1, BF16
BS =8; fp32 BS=1, BF16
https://huggingface.co/blog/habana-gaudi-2-benchmark
https://huggingface.co/blog/habana-gaudi-2-bloom
Visit https://habana.ai/habana-claims-validation/for workloads and configuration regarding power consumption claims. Results may vary.
Inference on GPT-J
Intel Gaudi 2 Accelerator with FP8
• Near-parity* on GPT-J with H100
• Outperformed A100 by 2.4x (Server) and 2x
(Offline)
• Achieved 99.9% accuracy with FP8
100
90
80
70
Token latency(ms)
60
50
40
30
20
10
0
32 Input Token 128 Input Token 1024 Input Token 2016 Input Token
• Use any popular industry standard AI libraries
BF16 INT8
• Intel AI Platform validated with over 300 inference models
• One socket of 4th Gen Intel® Xeon® processors can run LLaMa2 chatbots in under 100ms 2nd token latency
153
134
Greedy Mode, Mixed Precision On 1 Tile (out of 2 tiles per card) Intel® 93 94
(bfloat16), BS = 1, 256 Output Tokens Data Center GPU Max 1550
79
74 72 75
64 66
59
Latency (ms)
33.8
31.6
29.2 29.5
42 44
20.4
17.6 18.9
15.5 16.2 15.2 15.6 17.2
12.2
9 9.5 10.3
Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions
© Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. Other names
and brands may be claimed as the property of others.