VLIW Architectures For DSP: A Two-Part Lecture

VLIW Architectures for DSP
VLIW Architectures for DSP:

A Two-Part Lecture
Berkeley Design Technology, Inc.
www.BDTI.com
Copyright © 1999 Berkeley Design Technology, Inc.

1
Outline
u Part I: VLIW basics and a case study

l What's VLIW?
l Why VLIW?
l The TMS320C62xx
l Advantages, disadvantages of VLIW
u Part II: Other VLIW DSP architectures

l StarCore SC140
l ADI TigerSHARC
l Infineon Carmel
© 1999 Berkeley Design Technology, Inc. 2
© 1999 Berkeley Design Technology, Inc.
1
Why VLIW?
u Until ~1997, most DSP processors were very similar

l Specialized execution units
l Specialized instruction sets
• Difficult to program in assembly

• Unfriendly compiler targets
l One instruction per instruction cycle
u VLIW architectures execute multiple instructions/cycle

and use simple, regular instruction sets
l More parallelism, higher performance
l Better compiler targets
VLIW vs Superscalar
Memory Instruction Execution Units

scheduling,
INS 1 dispatch ALU MAC BMU • • •
Time
INS 2 INS 1 INS 2

INS 3 ?
INS 3 INS 4
•
• INS 6 INS 5
•
INS n
2
Characteristics of VLIW Processors
u Multiple independent instructions per cycle, packed

into single large "instruction word" or "packet"
l Instructions may be positional, or may include
routing information within each sub-instruction
u Large complement of independent execution units
u More regular, orthogonal, RISC-like instructions

l Usually wider than typical DSP instructions
l Usually simpler than typical DSP instructions
u Large, uniform register sets
u Wide program and data buses

Example VLIW DSP:

The TI TMS320C62xx
On-Chip Program Memory
2 independent 32x8=256 bits

data paths, Dispatch (8 instructions)
8 execution units
Unit
L1 S1 M1 D1 L2 S2 M2 D2
Register File A Register File B
L: ALU
32 32
S: Shifter, ALU
M: Multiplier
On-Chip Data Memory
D: Address gen.
3
FIR Filtering on the 'C62xx
LOOP:
ADD .L1 A0,A3,A0
Can execute ||ADD .L2 B1,B7,B1
up to eight ||MPYHL .M1X A2,B2,A3
32-bit ||MPYLH .M2X A2,B2,B7
instructions ||LDW .D2 *B4++,B2
in parallel ||LDW .D1 *A7--,A2
||[B0] ADD .S2 -1,B0,B0
||[B0] B .S1 LOOP
Compare to a conventional DSP...

dotprod:
MR=MR+MX0*MY0(SS), MX0=DM(I0,M0),MY0=PM(I4,M4);
Advantages of VLIW Architectures
u Increased performance
u Better compiler targets
u Potentially easier to program
u Potentially scalable
l Can add more execution units, allow more
instructions to be packed into the VLIW instruction
4
Disadvantages of VLIW Architectures
u New kinds of programmer/compiler complexity

• Programmer (or code-generation tool) must keep
track of instruction scheduling
• Deep pipelines and long latencies can be
confusing, may make peak performance elusive
u Increased memory use
• High program memory bandwidth requirements
u High power consumption
u Misleading MIPS ratings
Benchmark Results
Execution Time on Complex Block FIR
45 Microseconds
(lower is faster)
microseconds
30
15
0
ADSP-2189 DSP1620 DSP56311 '320C549 TMS320C6202
75 MHz 120 MHz 150 MHz 120 MHz 250 MHz
5
Benchmark Results
Memory Usage on FSM Benchmark
Bytes
200
(lower is better)
160
120
80
40
0
ADSP-218x DSP16xx DSP563xx '320C54x TMS320C62xx
For More Information...
Free resources on BDTI's web site,
www.bdti.com
l DSP Processors Hit the Mainstream
covers DSP architectural basics and new
developments. Originally printed in
IEEE Computer Magazine.
l Evaluating DSP Processor Performance,

a white paper from BDTI.
l Numerous other BDTI article reprints, slides
l comp.dsp FAQ
6
Outline
u Part I: VLIW basics and a case study

l What's VLIW?
l Why VLIW?
l The TMS320C62xx
l Advantages, disadvantages of VLIW
u Part II: Other VLIW DSP architectures

l StarCore SC140
l ADI TigerSHARC
l Infineon Carmel
StarCore SC140
u 16-bit fixed-point VLIW DSP core from Lucent/Motorola

u StarCore claims it's a scalable architecture
l First VLIW machine to target low-power apps
u More execution units (13) than 'C62xx (8), but fewer

instructions can be issued per cycle
l Six for SC140 vs eight for 'C62xx
MAC MAC MAC MAC

BMU ALU ALU ALU ALU
BFU BFU BFU BFU
7
StarCore SC140
u Uses 16-bit instructions with optional 16-bit prefixes

l Should have pretty good code density, better than
'C62xx ('C62xx uses fixed-width 32-bit instructions)
u Pipeline relatively simple and shallow (5 stages)
u Targeting 198 mW @ 300 MHz, 1.5 V
u Development chip expected late '99
u Lucent and Motorola will each create chips using the
SC140 core
l Motorola's MSC8101 sampling 1H00
ADI TigerSHARC
u 8-, 16-, 32-bit fixed-point and 32-bit floating-point

l Unusual data-type agility
u Combines VLIW with extensive SIMD (single instruction,

multiple data) to get massive parallelism
l Using SIMD, can perform eight 16x16-bit fixed-point
multiplications per cycle (4X the 'C62xx)
SIMD multiply instruction
ALU MAC Shift ALU MAC Shift
Four 16-bit multiplies Four 16-bit multiplies

8
ADI TigerSHARC
u "Hierarchical" SIMD is unusual

u Requires high on-chip data memory bandwidth
l Sixteen 16-bit data words/cycle
u Issues and executes up to four instructions per cycle

u Uses 32-bit instructions, like '62xx
l May have high program memory use
l Memory use may also be increased by data- and
algorithm-rearrangement needed for use of SIMD
u Targeting 250 MHz, expected to begin sampling
in late 1999
Infineon Carmel
u 16-bit fixed-point VLIW DSP core from Infineon (Siemens)

l In silicon at 120 MHz, 0.25 µm (development chip)
u Two data paths, six execution units
ALU MAC EXP Shift ALU MAC
u Mixed-width 24/48-bit instruction set

u Can execute in parallel:
l One 48-bit instruction, or
l One or two 24-bit instructions, or
l Up to six instructions as part of a "CLIW"
9
Infineon Carmel
CLIW (Configurable Long Instruction Word)
General format:
cliw name (operand1, ... , operand 4) {
ALU1 || MAC1 || ALU2 || MAC2 || MOV1 || MOV2
}
Example CLIW:
cliw fft4(r0+=rn0, r1, r4, r5) {
a2 = a1l * a0h
|| *ma1 = ff1 + ff2
|| *ma2 = a2 - a1h * a0h
|| a1h = *ma3 - *ma4
|| ff1 = *ma3
|| ff2 = *ma4;
}
Comparison
Data memory
Processor Issue bandwidth Instruction Clock Pipeline Notable
width (16-bit words) size (MHz) depth characteristics
1st VLIW-based
TMS320C62xx 8 4 words/cycle 32 bits 250 11
DSP processor
16 bits w/ Scalable, approach

SC140 6 8 words/cycle 300* 5
16-bit prefixes to compact code
SIMD + VLIW,
TigerSHARC 4 16 words/cycle 32 bits 250* 8
data type agility
CLIW instructions,
Carmel 2, 6 4 words/cycle 24/48 bits 120 8
4 AGUs
*Projected
10
Benchmark Results
Execution Time on Complex Block FIR
12
10
ted
8
jec
6
pro
4
TMS320C6202 Carmel SC140

250 MHz 120 MHz 300 MHz
For More Information...
Free resources on BDTI's web site,
http://www.BDTI.com
l DSP Processors Hit the Mainstream
covers DSP architectural basics and new
developments. Originally printed in
IEEE Computer Magazine.
l Evaluating DSP Processor Performance,

a white paper from BDTI.
l Numerous other BDTI article reprints, slides
l comp.dsp FAQ
11

VLIW Architectures For DSP: A Two-Part Lecture

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

VLIW Architectures For DSP: A Two-Part Lecture

Uploaded by

Copyright:

Available Formats

VLIW Architectures for DSP

VLIW Architectures for DSP:

Berkeley Design Technology, Inc.

Copyright © 1999 Berkeley Design Technology, Inc.

u Part I: VLIW basics and a case study

l Advantages, disadvantages of VLIW

u Part II: Other VLIW DSP architectures

© 1999 Berkeley Design Technology, Inc. 2

© 1999 Berkeley Design Technology, Inc.

u Until ~1997, most DSP processors were very similar

l Specialized instruction sets

• Difficult to program in assembly

u VLIW architectures execute multiple instructions/cycle

l Better compiler targets

© 1999 Berkeley Design Technology, Inc. 3

Memory Instruction Execution Units

INS 2 INS 1 INS 2

© 1999 Berkeley Design Technology, Inc. 4

© 1999 Berkeley Design Technology, Inc.

Characteristics of VLIW Processors

u Multiple independent instructions per cycle, packed

u Large complement of independent execution units

u More regular, orthogonal, RISC-like instructions

u Large, uniform register sets

u Wide program and data buses

Example VLIW DSP:

2 independent 32x8=256 bits

Register File A Register File B

© 1999 Berkeley Design Technology, Inc.

FIR Filtering on the 'C62xx

Compare to a conventional DSP...

© 1999 Berkeley Design Technology, Inc. 7

Advantages of VLIW Architectures

u Better compiler targets

u Potentially easier to program

© 1999 Berkeley Design Technology, Inc. 8

© 1999 Berkeley Design Technology, Inc.

Disadvantages of VLIW Architectures

u New kinds of programmer/compiler complexity

© 1999 Berkeley Design Technology, Inc. 9

© 1999 Berkeley Design Technology, Inc. 10

© 1999 Berkeley Design Technology, Inc.

© 1999 Berkeley Design Technology, Inc. 11

For More Information...

Free resources on BDTI's web site,

l Evaluating DSP Processor Performance,

© 1999 Berkeley Design Technology, Inc.

u Part I: VLIW basics and a case study

l Advantages, disadvantages of VLIW

u Part II: Other VLIW DSP architectures

© 1999 Berkeley Design Technology, Inc. 13

u 16-bit fixed-point VLIW DSP core from Lucent/Motorola

u More execution units (13) than 'C62xx (8), but fewer

MAC MAC MAC MAC

BFU BFU BFU BFU

© 1999 Berkeley Design Technology, Inc. 14

© 1999 Berkeley Design Technology, Inc.

u Uses 16-bit instructions with optional 16-bit prefixes

© 1999 Berkeley Design Technology, Inc. 15

u 8-, 16-, 32-bit fixed-point and 32-bit floating-point

u Combines VLIW with extensive SIMD (single instruction,

SIMD multiply instruction

ALU MAC Shift ALU MAC Shift