Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

A New Paradigm for AI

Productivity & Performance


Seetha . Nookala
Senior Director
Datacenter Graphics, HPC/AI ,APAC & Japan
December 2023
AI Exponentially Increasing In
GPT4+
>1 Trillion

15,000x
Model Size

Increase In 5 Years Megatron-Turing


530B

GPT-3
175B
Transformer BERT GPT-2 GPT-2 8B TS Turing-NLG
65M 340M 1.5B 8.3B 11B 17B

Mid 2017 2018 2019 Mid 2019 Late 2019 2020 Mid 2020 Late 2021 2022

Source: Amazon Web Services , 2022


– Cost of AI Today
GPT-3 GPT-4
Training Cost

(3,640 petaFLOPS-days) costs if trained on Google TPU v3 (450,000 petaFLOPS-days) , 7,600 GPUs running for a year

ChatGPT Bing AI Chatbot


Inferencing Cost

to process prompts per month with 100 million active users Bing AI Chatbot to serve responses to all Bing users

Source: CSET, CNBC


Driving Will
Unlock New Levels of
Small AI Models

Smartphones & Mobile Devices Smart Cameras & Surveillance Industrial IoT Autonomous Vehicles Edge Servers & Gateways

Facial Recognition Object Detection Predictive Maintenance Object Detection Local Decision-Making

Voice Assistants Anomaly Detection Quality Control Driver Monitoring Data Filtering

Drones & UAVs Smart Home Devices Wearables Robots Retail and Customer Analysis

Obstacle Avoidance Thermostats Health Monitoring Navigation Foot Traffic Analysis

Aerial Data Processing Home Security Gesture Control Object Manipulation Personalized Marketing

Agriculture Sensors & Healthcare Environmental Monitoring Smart Grids & Energy Sports and Entertainment
Equipment Management

Livestock Monitoring Remote Patient Monitoring Air Quality Sensing Power Distribution Player and Audience Analytics

Precision Farming Medical Imaging Earthquake Detection Grid Security AR Headsets


Large Unlocking the AI Continuum Small
From the Largest Data Center Scale AI, to the Smallest AI That Fits On Any Device
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications

Training & Streamline the AI Workflow Inference &


Fine-Tuning Deployment
AI Software

Cloud
Simplify the AI Infrastructure Client
Scalable Systems & Solutions

AI Specific Accelerate the Workload General


Purpose
Foundational Silicon & Software
Approach
Unlocking the AI Continuum Novel Applications Across All Industries

FSI Healthcare Retail Automotive Public Sector …

Streamline the AI Workflow AI Software

Open Productive Accessible

Simplify the AI Infrastructure Scalable Systems & Solutions


Infrastructure Scalability Solutions
Data Center AI Systems AI PC

Accelerate the Workload Foundational Silicon & Software


Bringing AI
Large Unlock the AI Continuum Small
Novel Applications

Training & Streamline the AI Workflow Inference &


Fine-Tuning Deployment
AI Software

Cloud
Simplify the AI Infrastructure Client
Scalable Systems & Solutions

AI Specific Accelerate the Workload General


Purpose
Foundational Silicon & Software
AI Workflow Today

Training Fine-Tuning Deployment Inference

Proprietary Fragmented Complex

Limiting Customization and Challenges Integrating Various Barrier For Developers


Hindering Innovation AI Frameworks Seeking to Harness the Full
and Libraries Potential of AI
Streamline the AI Workflow
Training Fine-Tuning Deployment Inference

DirectML

SigOpt Neural Compressor


AutoML

Open Productive Accessible


Ecosystem Engagement
Pre-configured AI tool
Solutions containers Selector Industry & Solutions High-Touch
Academia Marketplace Support
Optimized
OpenVINO Developer Training
Extensions
Tooling
Modin CNVRG Centers of Documentation
MLOPS training & Tutorials
Excellence

AI Reference Hugging Face Summits &


References Kits Collaboration Training Videos Liftoff Program
Hackathons
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications

Training & Streamline the AI Workflow Inference &


Fine-Tuning Deployment
AI Software

Simplify the AI Infrastructure Scalable Systems & Solutions


Cloud Infrastructure Scalability Solutions Client
Data Center AI systems AI PC

AI Specific Accelerate the Workload General


Purpose
Foundational Silicon & Software
Age of the AI PC

AI Assistants Know
Elevated Your Daily Context
Video
Enhanced Collaboration
Audio Effects & Streaming

Creator & More Creative,


Gaming Productive, & Across Everything
Effects Collaborative You Do

Enhancements Everything
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications

Training & Streamline the AI Workflow Inference &


Fine-Tuning Deployment
AI Software

Simplify the AI Infrastructure Scalable Systems & Solutions


Cloud Infrastructure Scalability Solutions Client
Data Center AI Systems AI PC

AI Specific Accelerate the Workload General


Purpose
Foundational Silicon & Software
Scalable Systems for Simple AI Infrastructures
State of the 100B+ Up to 100B Up to 20B <1B
Large Small
Art Model Parameters Parameters Parameters Parameters

Mainstream
Training & Baseline Endpoint Inference &
Fine-Tuning Training Peak Inference Inference / Deployment
Inference Inference
Fine-Tuning

Multi-Node Multi-GPU
Cluster & Data Deployment or Multi-Socket Single- Socket
Center Scale per Rack CPU CPU AI PC
Bringing AI
Large Unlock the AI Continuum Small
Novel Applications

Training & Inference &


Fine-Tuning Training Fine-Tuning Deployment Inference Deployment

Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions

Accelerate the Workload Foundational Silicon & Software


General
AI Specific
Purpose
Intel® Gaudi® 2 AI Accelerator
Proven § The ONLY alternative to H100 for training LLMs based on MLPerf
Performance § Trained GPT-3* model in 311 minutes on 384 Intel Gaud2 accelerators

§ Intel Gaudi2 accelerators with FP8 estimated to deliver price-performance


Price >H100
Performance
§ ~2x price-performance to A100

§ 95% linear scaling on MLPerf GPT-3 training benchmark


Scalability
§ Access large Intel Gaudi2 cluster on the Intel Developer Cloud

§ Software optimized for deep learning training and inference


Ease of
Use § PyTorch, Hugging Face, Optimum Library optimizations

Performance metrics based on MLPerf Training 3.0 benchmark. For configuration details, see the results published by MLPCommons..
GPT-3 model tested on MLPerf Training 3.0 consisted of representative 1% slice of the entire GPT-3 model Performance expectations for Gaudi2 with FP8 based on Intel internal evaluation June 2023
Price-performance claim based on comparable pricing of Intel Gaudi server and Nvidia A100 server and MLPerf Inference 3.1 Results. , Aug 2023. See Supermicro for server pricing. Price-performance claim based on
significant pricing differential between Intel Gaudi2 and Nvidia H100 server, MLPerf Training 3.0 Results, May 2023 and internal estimates of performance advancement with FP8. See Supermicro for server pricing. Results
Code Transitioning
Performant AI Code with Minimal Changes

Across Generations & Architectures


Intel® Gaudi® 2 Intel® Gaudi® 3 AI Next Gen GPU (Codenamed
AI Accelerator Accelerator Falcon Shores)

Powered by Transitioning Into a


Single Software Environment
Gaudi Software Suite Unified Programming Model
Fine Tuning

Fine-tune with Fine-tune On Intel® Xeon®,


Intel® Gaudi® 2 Processor Exploiting Its Industry-leading
When Optimal Speed is Desired Ubiquity In the Data Center
6

Hugging Face Evaluations Substantiate Intel® Gaudi® 2


Accelerator LLM Performance vs. Nvidia A100 and H100
4

Relative performance (Higher is better)


2.4x 2.5x

2
1.4x

0
NVIDIA T5 – 3B Bridge NVIDIA Bridge
A100 Samples/s Tower H100 Tower
BS = 16

Visit https://habana.ai/habana-claims-validation for workloads and configurations. Results may vary.


https://huggingface.co/blog/habana-gaudi-2-benchmark
https://huggingface.co/blog/bridgetower
Out-of-the Box Intel® Xeon® Fine Tuning
Optimized Models & Spaces

100k’s
Dolly LLAMA2 MPT LDM3D Whisper Mode

BIOGPT 1.5 Billion Parameter and


Intel Optimized Hugging Face Libraries & Tools GPT-J 6 Billion Parameter Model
140

Transformers Diffusers Accelerate PEFT Optimum 120

Time-to-train in minutes
Fine Tuning Use Cases Fine Tuning at Scale Efficient Fine Performance
Tuning Optimization 100

80

60
Foundational Stack 40

+ IPEX ITREX 20

+ IDEX 0
BIOTGPT 1.5 billion paramater GPT-J 6 billion parameter model
Model
Inference
Large Unlock the AI Continuum Small
Novel Applications

Training & Inference &


Fine-Tuning Training Fine-Tuning Deployment Inference Deployment

Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions

Accelerate the Workload Foundational Silicon & Software


General
AI Specific
Purpose
6

Energy Efficiency
5 Throughput-per-Watt on BLOOMZ 176B Inference is 1.79x
better than H100; 1.61x better than A100
4

3 2.84x 2.89x

Relative performance (Higher is better)


2

1.42x

0
Stable Diffusion BLOOMz 176B BLOOM 7B
NVIDIA Latency Inference
A100 BS=1, BF16
BS =8; fp32 BS=1, BF16

https://huggingface.co/blog/habana-gaudi-2-benchmark
https://huggingface.co/blog/habana-gaudi-2-bloom
Visit https://habana.ai/habana-claims-validation/for workloads and configuration regarding power consumption claims. Results may vary.
Inference on GPT-J
Intel Gaudi 2 Accelerator with FP8
• Near-parity* on GPT-J with H100
• Outperformed A100 by 2.4x (Server) and 2x
(Offline)
• Achieved 99.9% accuracy with FP8

GPT-J On MLPerf Inference Benchmark

Source: MLPerf Inference 3.1 Data Center Benchmark Results: https://mlcommons.org/en/inference-datacenter-31/


Intel® Gaudi® 2 AI accelerator on GPT-J Vs H100 with 1.09x (Server) and 1.28 (Offline). Results may vary
LLaMA2 7B : Intel Xeon 4th Gen 8480 1S P90 Latency
Batch Size 1, Beam Width 4, PyTorch + IPEX
(Lower is better)

100

90

80

70

Token latency(ms)
60

50

40

30

20

10

0
32 Input Token 128 Input Token 1024 Input Token 2016 Input Token
• Use any popular industry standard AI libraries
BF16 INT8
• Intel AI Platform validated with over 300 inference models
• One socket of 4th Gen Intel® Xeon® processors can run LLaMa2 chatbots in under 100ms 2nd token latency

Results shown for bare metal.


On 1 Socket Intel® Xeon®
Inference Across Multiple Products Scalable Processor

153

134

Llama 2 Next Token Latency ( Lower is Better) 117119 114

Greedy Mode, Mixed Precision On 1 Tile (out of 2 tiles per card) Intel® 93 94
(bfloat16), BS = 1, 256 Output Tokens Data Center GPU Max 1550
79
74 72 75
64 66
59
Latency (ms)

33.8
31.6
29.2 29.5
42 44
20.4
17.6 18.9
15.5 16.2 15.2 15.6 17.2
12.2
9 9.5 10.3

7B 13B 7B 13B 7B 13B 7B 13B


1x Gaudi2® Intel® Data Center GPU Max 1550 single tile (1 card Intel® 4th Gen Xeon® 8480 1 Intel® Xeon® Max 9480 1 Socket
has 2 tiles) Socket
Intel® Gaudi2 ®AI Accelerator
Intel Data Center GPU Max 1550 Intel Xeon Scalable Processor

For configuration details see: https://habana.ai/habana-claims-validation/ and www.intel.com/PerformanceIndex


Bringing AI
Large Scale
AI Models
Unlock the AI Continuum Small AI
Models
Novel Applications

Training & Inference &


Fine-Tuning Training Fine-Tuning Deployment Inference Deployment

Cloud
Simplify AI Infrastructure Client
Scalable Systems & Solutions

Accelerate the Workload Foundational Silicon


General
AI Specific
Purpose
Approach
Unlocking the AI Continuum Novel Applications Across All Industries

FSI Healthcare Retail Automotive Public Sector …

Streamline the AI Workflow AI Software

Open Productive Accessible

Simplify the AI Infrastructure Scalable Systems & Solutions


Infrastructure Scalability Solutions
Data Center AI Systems AI PC

Accelerate the Workload Foundational Silicon & Software


Thank You!
Notices and
Disclaimers
For notices, disclaimers, and details about performance
claims, visit www.intel.com/PerformanceIndex or scan
the QR code:

© Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. Other names
and brands may be claimed as the property of others.

You might also like