Vector IRAM: A Microprocessor Architecture For Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing
Christoforos E. Kozyrakis
kozyraki@cs.berkeley.edu
CS252 Graduate Computer Architecture February 10, 2000
Outline
Motivation for IRAM
technology trends design trends application trends
Vector IRAM
instruction set prototype architecture performance
2/10/2000
C.E. Kozyrakis, U.C. Berkeley
Page 2
Processor-DRAM Gap (latency)
1000 Performance
Moores Law
CPU
Proc 60%/yr.
100 10 1
Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.

1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Time
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3
Processor-DRAM Tax
logic Intel PIII Xeon MIPS R12000 HP PA-8500 Sun Ultra-2 PowerPC G4 IBM Power3 AMD Athlon Alpha 21264 0 6 5 3 4 2 1.8 4.5 7 11 9.2 10 15 20 25 30 6 8 11 8 4.2 126 memory 15
Million Transistors
2/10/2000
Page 4
Power Consumption
60 50 Performance (Spec95FP) 40 30 20 10 0 0 20 40 Power (W) 60 80
Alpha 21264 AMD Athlon IBM Power3 PowerPC G4 Sun Ultra-2 HP PA-8500 MIPS R12000 Intel PIII Xeon
2/10/2000
Page 5
Other Design Challenges

Interconnect scaling problems
multiple cycles to go across the chip difficult to achieve single cycle result forwarding need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency
Design complexity of high-end CPUs

4 to 5 years from scratch to chips for new superscalar architectures >100 engineers >50% of resources to design verification
2/10/2000
Page 6
Complexity Vs. Performance Gains

R5000 Clock Rate 200 MHz On-Chip Caches 32K/32K Instructions/Cycle 1(+ FP) Pipe stages 5 Model In-order Die Size (mm2) 84 wo cache, TLB 32 Development 60 (man years) SPECint_base95 5.7 R10000 195 MHz 32K/32K 4 5-7 Out-of-order 298 205 300 8.8 R10K/R5K 1.0x 1.0x 4.0x 1.2x --3.5x 6.3x 5.0x 1.6x
2/10/2000
Page 7
Future microprocessor applications

Multimedia applications
image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. narrow data types, streaming data, real-time requirements
Mobile and embedded environments

notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. small devices, limited chip-count, limited power/energy budget
Significantly different environment from the desktop/workstation model
2/10/2000
Page 8
Requirements on microprocessors (1)

High performance for multimedia:
real-time performance guarantees support for continuous media data-types exploit fine-grain parallelism exploit coarse-grain parallelism exploit high instruction reference locality code density high memory bandwidth
2/10/2000
Page 9
Average vs. real time performance ...

45% 40% 35% 30%
Average Which one is the best? Statistical Average C Real time Worst A
Inputs
25% 20% 15% 10% 5% 0%
C
Best Case
Page 10
Worst Case
2/10/2000
Performance
Requirements on microprocessors (2)

Low power and energy consumption
energy efficiency for long battery life power efficiency for system cost reduction (cooling system, packaging etc...)
Design scalability
performance scalability physical design scalability design complexity, verification complexity immunity to interconnect scaling problems locality of interconnect, tolerance to latency
System-on-a-chip (SoC)
highly integrated system low system chip-count
The IRAM vision statement

Microprocessor & DRAM on a single chip:
on-chip memory latency 5-10X, bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width Proc $ $ L2$ Bus D R A M L o f g a i b c
I/O I/O Bus
I/O I/O
Proc Bus D f R a A b M
D R
2/10/2000 C.E. Kozyrakis, U.C. Berkeley
A M
Page 12
Vector IRAM
Vector processing
high-performance for media processing low power/energy for processor control modularity, low complexity scalability well understood software development high bandwidth for vector processing low power/energy for memory accesses modularity, scalability small system size
Embedded DRAM
2/10/2000
Page 13
IRAM ISA summary

Full vector instruction set with
32 vector registers, 32 vector flag registers support for multiple data types (64b, 32b, 16b, 8b) support for strided and indexed memory accesses support for auto-increment addressing support for DSP operations (multiply-add, saturation etc) support for conditional execution support for software speculation support for fast reductions and butterfly permutations support for virtual memory restartable arithmetic (FP & integer) exceptions
Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2)

Vector architectural state

Virtual Processors ($vlr)
VP0 VP1 VP$vlr-1
Control Regs
vcr0 vcr1 vcr31
64b
General vr0 vr1 Purpose Registers vr31 (32)

$vpw
Flag Registers (32)
vf0 vf1 vf31 1b
Scalar Regs
vs0 vs1 vs31
64b
2/10/2000
Page 15
Fixed-point Multiply-add
Mul & Shift Right & Round Add & Sat
x n/2 y n/2 *
n Shift
zn +
Round
n
sat
Multiply halves & shift instruction provides support for any fixed-point format Precision is equal to the datatype width; multipliers inputs have half the width Uniform, simple support for all datatypes
VIRAM-1 prototype
2/10/2000
Page 17
Design Overview
64b MIPS scalar core
coprocessor interface 16KB I/D caches
Memory system
8 2MByte eDRAM banks single sub-bank per bank 256-bit synchronous interface, separate I/O signals 20ns cycle time, 6.6ns column access crossbar interconnect for 12.8 GB/sec per direction no caches
Vector unit
8KByte vector register file support for 64b, 32b, and 16b data-types 2 arithmetic (1 FP), 2 flag processing, 1 load-store units 4 64-bit datapaths per unit DRAM latency included in vector pipeline 4 addresses/cycle for strided/indexed accesses 2-level TLB
2/10/2000
Network interface
user-level message passing dedicated DMA engines 4 100MByte/s links
Page 18
Vector Unit Pipeline Structure

Single-issue, in-order pipeline
each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles
DRAM latency is included in the execution pipeline (delayed pipeline)

deep pipeline design, but not caches needed to avoid stalls worst case DRAM latency does not cause pipeline stalls
Address decoupling buffer

buffers memory addresses in the presence of conflicts (indexed/strided accesses) memory conflicts do not stall pipeline
2/10/2000
Page 19
Non-Delayed Pipeline
F D X M W
. . .
DRAM latency: >=20ns
vld VW mem vadd vst vld mem vadd vst . . .
VLOAD
Long Load-> ALU RAW hazard
VALU VSTORE
VR X1 X2 ... XN VW
VR
Load->ALU exposes full DRAM latency (long)

10
Tolerating Memory Latency Delayed Pipeline

F D X M W
DRAM latency: >20ns
. . .
VW vld vadd vst vld vadd vst . . .
VLOAD
Load-> ALU RAW hazard
VALU VSTORE
DELAY
VR X1 ... XN VW
VR
Load ALU sees functional unit latency (short)

Clustered VLSI Design

64b Xbar I/F Integer Datapath 0 Vector Registers Control Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F 256b
Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F
Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F
11
VIRAM-1 Floorplan
DRAM Bank 0 DRAM Bank 2 DRAM Bank 4 DRAM Bank 6
N I M I P S
Vector Lane 0
Vector Lane 1
C T L
Vector Lane 2
Vector Lane 3
DRAM Bank 1
DRAM Bank 3
DRAM Bank 5
DRAM Bank 7
2/10/2000
Page 23
Prototype Summary
Technology:
0.18um eDRAM CMOS process (IBM) 6 layers of copper interconnect 1.2V and 1.8V power supply
Memory: 16 MBytes Clock frequency: 200MHz Power: 2 W for vector unit and memory Transistor count: ~140 millions Peak performance:
GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) GFLOPS: 1.6 (32b)
2/10/2000
Page 24
12
Kernels Performance
Peak Perf. Image Composition iDCT Color Conversion Image Convolution Integer MV Multiply Integer VM Multiply FP MV Multiply FP VM Multiply AVERAGE 6.4 GOPS 6.4 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 1.6 GFLOPS 1.6 GFLOPS Sustained Perf. 6.40 GOPS 1.97 GOPS 3.07 GOPS 3.16 GOPS 2.77 GOPS 3.00 GOPS 1.40 GFLOPS 1.59 GFLOPS % of Peak 100.0% 30.7% 96.0% 98.7% 86.5% 93.7% 87.5% 99.6% 86.6%
Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths
Comparisons
VIRAM Image Composition iDCT Color Conversion Image Convolution
All
MMX 3.75 (3.2x) 8.00 (10.2x) 5.49 (4.5x)
VIS 2.22 (17.0x) 6.19 (5.1x)
TMS320C82 5.70 (7.6x) 6.50 (5.3x)
0.13 1.18 0.78 5.49
numbers in cycles/pixel
MMX, VIS, and TMS results assume all data in L1 cache
2/10/2000
Page 26
13
FFT Performance
200
Time (microseconds)
150
Fixed Point (16 bit) Floating Point (32 bit)

Pentium/200: 151 us
TMS320C67x: 124 us
100
PPC604e: 87 us
50
TigerSHARC: 41 us VIRAM: 37 us CRI Pathfinder-1: 22.3 us CRI Pulsar: 27.9 us Wildstar: 25 us
0 128 256 512 1024
Size (#points in FFT)

Note : Simulations performed with unscheduled fixed-point code
Motion Estimation Performance
Size
VIRAM-1 (cycles) 7.1x106 (4.6x)
MMX (cycles) 3.3x107
QCIF (176x144) CIF (352x288)
2.8x107 (5.0x)
1.4x108
Note : MMX results assume all data in L1 cache
2/10/2000
Page 28
14
Overall Performance of H.263
Akiyo (12.95 kbit/s) 23.5 fps
Mom (16.25 kbit/s) 22.7fps
Hall (20.47 kbit/s) 22.7fps
Foreman (65.52 kbit/s) 20.9fps
Average encoding speed for H.263 on VIRAM standard mpeg test sequences, using exhaustive search for motion estimation and LLM for DCT. Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths
2/10/2000
Page 29
Summary Class Project Suggestions

Architecture comparisons & applications
information retrieval signal processing apps neural nets training
Multimedia application analysis

operand reuse patterns branch behavior data/value locality and memory access patterns
Low power/energy architectures

energy-exposed ISA design compilation for low energy speculation use for power reduction
2/10/2000
Page 30
15

Vector IRAM: A Microprocessor Architecture For Media Processing

Uploaded by

Copyright:

Available Formats

You might also like

Vector IRAM: A Microprocessor Architecture For Media Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vector IRAM: A Microprocessor Architecture For Media Processing

Uploaded by

Copyright:

Available Formats

Vector IRAM: A Microprocessor Architecture for Media Processing

C.E. Kozyrakis, U.C. Berkeley

Processor-DRAM Gap (latency)

Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.

C.E. Kozyrakis, U.C. Berkeley

C.E. Kozyrakis, U.C. Berkeley

Other Design Challenges

Design complexity of high-end CPUs

C.E. Kozyrakis, U.C. Berkeley

Complexity Vs. Performance Gains

C.E. Kozyrakis, U.C. Berkeley

Future microprocessor applications

Mobile and embedded environments

Significantly different environment from the desktop/workstation model

C.E. Kozyrakis, U.C. Berkeley

Requirements on microprocessors (1)

C.E. Kozyrakis, U.C. Berkeley

Average vs. real time performance ...

25% 20% 15% 10% 5% 0%

C.E. Kozyrakis, U.C. Berkeley

Requirements on microprocessors (2)

The IRAM vision statement

I/O I/O Bus

C.E. Kozyrakis, U.C. Berkeley

IRAM ISA summary

Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2)

Vector architectural state

General vr0 vr1 Purpose Registers vr31 (32)

Flag Registers (32)

vf0 vf1 vf31 1b

C.E. Kozyrakis, U.C. Berkeley

C.E. Kozyrakis, U.C. Berkeley

C.E. Kozyrakis, U.C. Berkeley

Vector Unit Pipeline Structure

DRAM latency is included in the execution pipeline (delayed pipeline)

Address decoupling buffer

C.E. Kozyrakis, U.C. Berkeley

DRAM latency: >=20ns

vld VW mem vadd vst vld mem vadd vst . . .

Long Load-> ALU RAW hazard

Load->ALU exposes full DRAM latency (long)

Tolerating Memory Latency Delayed Pipeline

DRAM latency: >20ns

Load-> ALU RAW hazard

Load ALU sees functional unit latency (short)

Clustered VLSI Design

C.E. Kozyrakis, U.C. Berkeley

C.E. Kozyrakis, U.C. Berkeley

MMX 3.75 (3.2x) 8.00 (10.2x) 5.49 (4.5x)

VIS 2.22 (17.0x) 6.19 (5.1x)

TMS320C82 5.70 (7.6x) 6.50 (5.3x)

0.13 1.18 0.78 5.49

MMX, VIS, and TMS results assume all data in L1 cache

C.E. Kozyrakis, U.C. Berkeley

Fixed Point (16 bit) Floating Point (32 bit)

0 128 256 512 1024

Size (#points in FFT)