Professional Documents
Culture Documents
Vector IRAM: A Microprocessor Architecture For Media Processing
Vector IRAM: A Microprocessor Architecture For Media Processing
Vector IRAM: A Microprocessor Architecture For Media Processing
Christoforos E. Kozyrakis
kozyraki@cs.berkeley.edu
CS252 Graduate Computer Architecture February 10, 2000
Outline
Motivation for IRAM
technology trends design trends application trends
Vector IRAM
instruction set prototype architecture performance
2/10/2000
Page 2
1000 Performance
Moores Law
CPU
Proc 60%/yr.
100 10 1
Time
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3
Processor-DRAM Tax
logic Intel PIII Xeon MIPS R12000 HP PA-8500 Sun Ultra-2 PowerPC G4 IBM Power3 AMD Athlon Alpha 21264 0 6 5 3 4 2 1.8 4.5 7 11 9.2 10 15 20 25 30 6 8 11 8 4.2 126 memory 15
Million Transistors
2/10/2000
Page 4
Power Consumption
60 50 Performance (Spec95FP) 40 30 20 10 0 0 20 40 Power (W) 60 80
Alpha 21264 AMD Athlon IBM Power3 PowerPC G4 Sun Ultra-2 HP PA-8500 MIPS R12000 Intel PIII Xeon
2/10/2000
Page 5
2/10/2000
Page 6
2/10/2000
Page 7
2/10/2000
Page 8
2/10/2000
Page 9
Average Which one is the best? Statistical Average C Real time Worst A
Inputs
C
Best Case
Page 10
Worst Case
2/10/2000
Performance
Design scalability
performance scalability physical design scalability design complexity, verification complexity immunity to interconnect scaling problems locality of interconnect, tolerance to latency
System-on-a-chip (SoC)
highly integrated system low system chip-count
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 11
I/O I/O
Proc Bus D f R a A b M
D R
2/10/2000 C.E. Kozyrakis, U.C. Berkeley
A M
Page 12
Vector IRAM
Vector processing
high-performance for media processing low power/energy for processor control modularity, low complexity scalability well understood software development high bandwidth for vector processing low power/energy for memory accesses modularity, scalability small system size
Embedded DRAM
2/10/2000
Page 13
Control Regs
vcr0 vcr1 vcr31
64b
Scalar Regs
vs0 vs1 vs31
64b
2/10/2000
Page 15
Fixed-point Multiply-add
Mul & Shift Right & Round Add & Sat
x n/2 y n/2 *
n Shift
zn +
Round
n
sat
Multiply halves & shift instruction provides support for any fixed-point format Precision is equal to the datatype width; multipliers inputs have half the width Uniform, simple support for all datatypes
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 16
VIRAM-1 prototype
2/10/2000
Page 17
Design Overview
64b MIPS scalar core
coprocessor interface 16KB I/D caches
Memory system
8 2MByte eDRAM banks single sub-bank per bank 256-bit synchronous interface, separate I/O signals 20ns cycle time, 6.6ns column access crossbar interconnect for 12.8 GB/sec per direction no caches
Vector unit
8KByte vector register file support for 64b, 32b, and 16b data-types 2 arithmetic (1 FP), 2 flag processing, 1 load-store units 4 64-bit datapaths per unit DRAM latency included in vector pipeline 4 addresses/cycle for strided/indexed accesses 2-level TLB
2/10/2000
Network interface
user-level message passing dedicated DMA engines 4 100MByte/s links
Page 18
2/10/2000
Page 19
Non-Delayed Pipeline
F D X M W
. . .
VLOAD
VALU VSTORE
VR X1 X2 ... XN VW
VR
10
. . .
VW vld vadd vst vld vadd vst . . .
VLOAD
VALU VSTORE
DELAY
VR X1 ... XN VW
VR
Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F
Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F
11
VIRAM-1 Floorplan
DRAM Bank 0 DRAM Bank 2 DRAM Bank 4 DRAM Bank 6
N I M I P S
Vector Lane 0
Vector Lane 1
C T L
Vector Lane 2
Vector Lane 3
DRAM Bank 1
DRAM Bank 3
DRAM Bank 5
DRAM Bank 7
2/10/2000
Page 23
Prototype Summary
Technology:
0.18um eDRAM CMOS process (IBM) 6 layers of copper interconnect 1.2V and 1.8V power supply
Memory: 16 MBytes Clock frequency: 200MHz Power: 2 W for vector unit and memory Transistor count: ~140 millions Peak performance:
GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) GFLOPS: 1.6 (32b)
2/10/2000
Page 24
12
Kernels Performance
Peak Perf. Image Composition iDCT Color Conversion Image Convolution Integer MV Multiply Integer VM Multiply FP MV Multiply FP VM Multiply AVERAGE 6.4 GOPS 6.4 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 1.6 GFLOPS 1.6 GFLOPS Sustained Perf. 6.40 GOPS 1.97 GOPS 3.07 GOPS 3.16 GOPS 2.77 GOPS 3.00 GOPS 1.40 GFLOPS 1.59 GFLOPS % of Peak 100.0% 30.7% 96.0% 98.7% 86.5% 93.7% 87.5% 99.6% 86.6%
Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 25
Comparisons
VIRAM Image Composition iDCT Color Conversion Image Convolution
All
numbers in cycles/pixel
2/10/2000
Page 26
13
FFT Performance
200
Time (microseconds)
150
TMS320C67x: 124 us
100
PPC604e: 87 us
50
TigerSHARC: 41 us VIRAM: 37 us CRI Pathfinder-1: 22.3 us CRI Pulsar: 27.9 us Wildstar: 25 us
Size
2.8x107 (5.0x)
1.4x108
2/10/2000
Page 28
14
Average encoding speed for H.263 on VIRAM standard mpeg test sequences, using exhaustive search for motion estimation and LLM for DCT. Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths
2/10/2000
Page 29
2/10/2000
Page 30
15