Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Design with Microprocessors

1
Hard are platform architecture
Hardware architect re

2
CPUs
„ CPU performance
Cycle
C l titime.
CPU pipeline.
„ Latency &Throughput
Memory system.
„ Indeterminacy in execution
„ Cache miss: compulsory, conflict, capacity
„ CPU power consumption.
„ Compare
ARM7, TI C54x
ARM7 C54x, TI 60x DSPs,
DSPs TriMedia,
TriMedia Pentium
MMX, Xilinx Vertex II, single purpose controllers

3
Selecting a Microprocessor
„ Issues
Technical:
T h i l speed,d power, size,
i costt
Other: development environment, prior expertise, licensing, etc.
„ Speed: how evaluate a processor’s speed?
Clock speed – but instructions per cycle may differ
Instructions per second – but work per instr. may differ
Dhrystone: Synthetic benchmark, developed in 1984.
Dhrystones/sec.
„ MIPS: 1 MIPS = 1757 Dhrystones per second (based on Digital’s
VAX 11/780). A.k.a. Dhrystone MIPS. Commonly used today.
So 750 MIPS = 750*1757
So, 750 1757 = 1,317,750
1 317 750 Dhrystones per second
SPEC: set of more realistic benchmarks, but oriented to
desktops
EEMBC – EDN Embedded Benchmark Consortium,
www.eembc.org
„ Suites of benchmarks: automotive, consumer electronics,
4
networking, office automation, telecommunications
General Purpose
P rpose Processors
Processor Clock speed Periph. Bus Width MIPS Power Trans. Price
General Purpose Processors
Intel Xeon 3.5GHz L1 16KB data 64 ~900 150W ~1.3B ~$2000
65nm tech. L2 8MB, L3 16MB
IBM 550 MHz 2x32 K 32/64 ~1300 5W ~7M $900
PowerPC L1, 256K
750X L2
MIPS 250 MHz 2x32 K 32/64 NA NA 3 6M
3.6M NA
R5000 2 way set assoc.
StrongARM 233 MHz None 32 268 1W 2.1M NA
SA-110
Microcontroller
Intel 12 MHz 4K ROM, 128 RAM, 8 ~1 ~0.2 W ~10K $7
8051 32 I/O
I/O, Timer,
Ti UART
Motorola 3 MHz 4K ROM, 192 RAM, 8 ~.5 ~0.1 W ~10K $5
68HC811 32 I/O, Timer, WDT,
SPI
Digital Signal Processors
TI C5416 160 MHz 128K,, SRAM,, 3 T1 16/32 ~600 0.22 W NA $34
Ports, DMA, 13
ADC, 9 DAC
TMS320 80 MHz 32KB on chip 32 80 0.01 W NA $8

S
Sources: I t l Motorola,
Intel, M t l MIPS,
MIPS ARM
ARM, TI,
TI andd IBM Website/Datasheet;
W b it /D t h t Embedded
E b dd d Systems
S t Programming
P i

5
RISC vs.
s CISC
„ C
Complex
l iinstruction
t ti sett computer
t (CISC)
(CISC):
many addressing modes;
many operations.
„ Reduced instruction set computer (RISC):
load/store;
pipelinable
p p instructions.

6
Parallelism in programs
„ Parallelism eexists
ists in
several levels of
granularity:
Task. P1 P2
Data.
Instruction. Ld r1, r2
Add r3,r4
„ Instruction dependency
p y
Sub r5
r5,r6
r6 P3
Data and resource
Check at compile &/or
run time
ti
7
Parallelism e
extraction
traction
„ Static
Static: „ D namic
Dynamic:
Use compiler to Use hardware to
analyze program. identify opportunities.
Simpler CPU control. More complex CPU.
Can make use of high- Can make use of data
level language values.
constructs.
Can’tt depend on data
Can
values.

8
S perscalar
Superscalar
n
„ RISC - 1 iinst/cycle
t/ l
„ Superscalar – n inst/cycle Execution
n
„ n2 HW for n-instr parallel unit
Execution
unit

Register file

9
n
Simple VLIW architecture
architect re
„ Compile time assignment of instr
instructions
ctions to FUs
„ Large register file feeds multiple function units.
E box
Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP

Register file

ALU ALU Load/store Load/store FU

10
Cl t d VLIW architecture
Clustered hit t
„ Register file,
file function
f nction units
nits di
divided
ided into cl
clusters.
sters

Cluster bus

Execution Execution

Register file Register file

11
T pes of CPUs used
Types sed in ES
„ RISC CPUs
ARM 7
„ CISC CPUs
TI C54x
„ VLIW
TI C6x
TriMedia
„ FPGA – Programmable
P bl CPU
CPUs
Virtex II
„ Single purpose processors
12
ARM7 design

„ ARM assembly language - RISCy


„ ARM p
programming
g g model
Audio players, pagers etc.; 130 MIPS
„ ARM memory organization.
„ ARM data operations (32 bit)
„ ARM flow of control.

13
ARM programming model

r0 r8
r1 r9 0
31
r2 r10
r3 r11 CPSR
r4 r12
r5 r13
r6 r14 NZCV
r7 r15 (PC)
( )

14
ARM status
stat s bits
„ Every arithmetic,
E ith ti llogical,
i l or shifting
hifti
operation sets CPSR bits:
N (negative), Z (zero), C (carry), V (overflow).
„ Examples:
-1 + 1 = 0: NZCV = 0110.
231-1+1 = -231: NZCV = 1001.

15
ARM pipeline e
execution
ec tion

fetch decode execute add r0,r1,#5

sub r2,r3,r6 fetch decode execute

cmp r2,#3 fetch decode execute

ti
time
1 2 3

16
ARM data instructions
instr ctions
„ ADD, ADC : add ((w.
ADD „ AND, ORR,
AND ORR EOR
carry) „ BIC : bit clear
„ SUB SBC : subtract
SUB, „ LSL LSR : llogical
LSL, i l
(w. carry) shift left/right
„ MUL MLA : multiply
MUL, „ ASL ASR : arithmetic
ASL, ith ti
(and accumulate) shift left/right
„ ROR : rotate right
„ RRX : rotate right
extended with C
17
ARM flow
flo of control
„ All operations
ti can bbe performed
f d
conditionally, testing CPSR:
EQ, NE, CS, CC, MI, PL, VS, VC,
HI, LS, GE, LT, GT, LE
„ B
Branch
h operation:
i
B #100
Can be performed conditionally.

18
ARM comparison instructions
instr ctions
„ CMP : compare
„ CMN : negated compare
„ TST : bit-wise AND
„ TEQ : bit-wise
bit wise XOR
„ These instructions set only the NZCV bits
of CPSR.
CPSR

19
ARM lload/store/move
d/ t / iinstructions
t ti
„ LDR, LDRH
LDR LDRH, LDRB : load (half
(half-word,
word byte)
„ STR, STRH, STRB : store (half-word,
byte)
„ Addressing modes:
register
i t indirect
i di t : LDR r0,[r1]
0 [ 1]
with second register : LDR r0,[r1,-r2]
with constant : LDR r0,[r1,#4]
0 [ 1 #4]
„ MOV, MVN : move (negated)
MOV r0,
0 r1 1 ; sets r0 0 to r1
1
20
Addressing modes
„ Base plus offset addressing:
Base-plus-offset
LDR r0,[r1,#16]
Loads from location r1+16
„ Auto-indexing increments base register:
LDR r0,[r1,#16]!
0 [ 1 #16]!
„ Post-indexing fetches, then does offset:
LDR r0,[r1],#16
0 [ 1] #16
Loads r0 from r1, then adds 16 to r1.

21
ARM subroutine
s bro tine linkage
„ B
Branch
h and
d lilink
k iinstruction:
t ti
BL foo
Copies current PC to r14.
„ To return from subroutine:
MOV r15,r14

22
ARM Summary
S mmar
„ Load/store
L d/ t architecture
hit t
„ Most instructions are RISCy, operate in
single cycle.
Some multi-register operations take longer.
„ All instructions can be executed
y
conditionally.

23
M ltimedia CPUs
Multimedia
Manyy registers,
g , adders etc are wide ((32/64 bit),
),
Multimedia data types are narrow
•e.g. 8 bit per color, 16 bit per audio sample per channel )
2 8 values can be stored per register and added
2-8

+
4 additions per instruction;
carry disabled at word
boundaries.
24
P ti
Pentium MMX
64-bit vectors representing 8 byte encoded, 4 word encoded or 2 double word
encoded numbers.
wrap around/saturating options.
Multimedia registers mm0 - mm7,
consistent with floating-point registers (OS unchanged).

Instruction Options Comments


Padd[b/w/d] wrap around, addition/subtraction of
PSub[b/w/d] saturating bytes, words, double words
Pcmpeq[b/w/d] Result= "11
11..11
11" if true,
true "00
00..00
00" otherwise
Pcmpgt[b/w/d] Result= "11..11" if true, "00..00" otherwise
Pmullw multiplication, 4*16 bits, least significant word
Pmulhw multiplication, 4*16 bits, most significant word

25
P ti
Pentium MMX
Psra[w/d]
P [ /d] No. off
N Parallel
P ll l shift
hif off words,
d d double
bl words
d
Psll[w/d/q] positions in or 64 bit quad words
Psrl[w/d/q] register or
instruction
Punpckl[bw/wd/dq] Parallel unpack
Punpckh[bw/wd/dq] Parallel unpack
Packss[wb/dw] saturating Parallel pack
Pand, Pandn Logical operations on 64 bit words
Por Pxor
Por,
Mov[d/q] Move instruction

26
DSPs
„ TI C5x
C5 DSP:
DSP
Basic features.
C54x architecture and programming.
C55x architecture and programming.
C55x co-processor.
C60x

27
C5 family
C5x famil
„ Fixed-point
Fi d i t DSP.
DSP
„ Modified Harvard architecture:
1 program memory bus.
3 data memory busses.
„ 40-bit ALU.
„ Multiple implementations:
1, 2 instructions/cycle.

28
TI C54x
C54 architectural
hit t l ffeatures
t
„ 40-bit
40 bit ALU + bbarrell shifter.
hift
„ Multiple internal busses: 1 instruction, 3
data, 4 address.
„ 17 x 17 multiplier.
p
„ Single-cycle exponent encoder.
„ Two address generators with dedicated
registers.

29
TI C54x
C54 iinstruction
t ti sett ffeatures
t
„ Specialized
S i li d iinstructions
t ti ffor Vit
Viterbi.
bi
„ Repeat and block repeat instructions.
„ Instructions that read 2, 3 operands
simultaneously. y
„ Conditional store.
„ Fast return from interrupt
interrupt.

30
C54 CPU
C54x
„ 40-bit
40 bit ALU.
ALU
„ Two 40-bit accumulators.
„ Barrel shifter.
„ 17 x 17 multiplier/adder.
„ Compare/select/store (CSSU) unit.

31
C54 architectural
C54x hit t l elements
l t
„ ALU:
40-bit arithmetic, Boolean operations.
Two 16-bit operations when status register 1 C16 bit is set.
„ Accumulators:
Low-order (0-15), high-order (16-31), guard (32-39).
„ Barrel shifter:
Input from accumulator or data memory.
Output to ALU
ALU.
„ Multiplier:
17 x 17 multiply with 40-bit accumulate.
„ CSSU unit:
Compares high and low accumulator words.
Accelerates Viterbi operations.

32
C54x registers
„ Status reigsters ST0, ST1:
Arithmetic, bit manipulation flags.
Data page pointer,
pointer auxiliary
a iliar register pointer
pointer.
Processor modes.
„ Auxiliary registers:
Used to generate 16-bit
16 bit data space addresses
addresses.
„ Temporary register:
Used to hold one multiplicand or dynamic shift count.
„ Transition register:
g
Used for Viterbi operations.
„ Stack pointer:
Top of system stack.
„ Circular buffer size register.
„ Block-repeat registers.
„ I t
Interrupt
t registers.
i t
„ Processor mode status register. 33
C54 pipeline
C54x
„ Program prefetch.
prefetch Send PC address on program
address bus.
„ Fetch Load instruction from program bus to IR
Fetch. IR.
„ Decode.
„ A
Access. P t operand
Put d addresses
dd on b
busses.
„ Read. Get operands from busses.
„ E
Execute.
t

34
C54 power
C54x po er do
down
n modes
„ Th
Three IDLE iinstructions:
t ti
IDLE1 shuts down CPU.
IDLE2 shuts down CPU and on-chip
peripherals.
IDLE3 shuts
h down
d chip
hi completely
l l (i (including
l di
PLL).

35
C54 busses
C54x b sses
„ PB: program read bus
bus.
„ CB, DB: data read busses.
„ EB: data write bus
bus.
„ PAB, CAB, DAB, EAB: address busses.
„ Can generate two data memory addresses
per cycle.
Stored in auxiliaryy register
g address units
ARAU0, ARAU1.

36
Addressing Modes
Addressing Register-file Memory
mode Operand field contents contents

Immediate Data

Register-direct
Register address Data

Register
Register address Memory address Data
indirect

Direct Memory address Data

Indirect Memory address Memory address

Data

37
C
Common addressing
dd i modes
d
„ ARn (*):
( ): indirect through auxiliary
registers.
„ DP (@): direct addressing offset from DP
register.
„ K23 ((#):
) absolute addressing g using g label.
„ Bit addressing (BIT instruction): modify a
single bit of a memory location or MMR
register.
i t

38
C54 instructions
C54x instr ctions
„ ABDST: absolute „ DELAY: memory delay
distance „ DSUB: double subtract
„ ADD „ EXP: accumulator
„ ADDC: add w. carry exponent
„ ADDM: add immediate to „ LMS: least mean square
mem „ MAC: multiply
„ ADDS: add w/o sign accumulate
extension „ MACA: multiply by
„ DADD: double add MACA, add to MACB

39
C54 instructions,
C54x instr ctions cont’d
cont’d.
„ MACP: multiply by „ POLY: evaluate
program memory, polynomial
then accumulate „ RND: round
„ MAS multiply
MAS: lti l by
b TT, accumulator
l t
then subtract „ SAT: saturate
„ MAX, MIN accumulator
„ MPY: multiply „ SQUR: square
„ NEG: negate „ SUB: subtract
„ NORM: normalize

40
C54 instructions,
C54x instr ctions cont’d
cont’d.
„ AND „ SFTA: shift accumulator
„ BIT: test bit arithmetically
„ BITF: test bit shown by „ XOR
immediate „ MVDD: move within data
„ CMPL: complement memory
accumulator „ MVDP: move data to
„ OR program memory
„ ROL: rotate accumulator „ READA: read data
left addressed by ACCA
„ WRITA: write data
addressed by ACCA

„ A d many more….
And
41
C55 pipeline
C55x
„ Two
T
segments:
Fetch.
Execute. fetch execute

4 7-8

42
C55 fetch segment
C55x
„ Prefetch 1:
1
Send address to memory.
„ Prefetch 2:
Wait for response.
„ Fetch:
Get instruction from memory and put in IBQ.
„ Predecode:
Identify where instructions begin and end; identif
parallel instructions.

43
C55 execute
C55x e ec te segment
„ Decode:
Decode an instruction pair or single instruction.
„ Address:
Perform address calculations.
„ Access 1/2:
Send address to memory; wait.
„ Read:
R d data
Read d t from
f memory. Evaluate
E l t condition
diti registers.
i t
„ Execute:
Read/modify registers. Set conditions.
„ W/W+:
Write data to MMR-addressed registers, memory; finish.

44
C55 organization
C55x organi ation
C,
B
D bus
D busses
3 data read busses 16
3 data read address busses 24
program address bus 24
program
read bus Program
32 Instruction Address Data
flow
unit unit unit
Dual
Dual-multiply
operand
Instruction
Data read
Single
Writes operand unit
read
coefficient
fetch
from memory
2 data write busses 16
2 data write address busses 24

45
I
Image/video
/ id hardware
h d extensions
t i
„ A il bl iin 5509 and
Available d 5510
5510.
Equivalent C-callable functions for other
d i
devices.
„ Available extensions:
DCT/IDCT.
Pixel interpolation
Motion estimation.

46
DCT/IDCT
„ 2-D
2 D DCT/IDCT is
i
computed from block
t
two 1-D
1D
DCT/IDCT.
Column DCT
Put data in
different banks to
maximize Row
R
interim DCT
throughput. DCT

47
C55 motion estimation
„ Search strategy:
Full vs. non-full.
„ Accuracy:
Full-pixel vs. half-pixel.
„ Number of returned motion vectors:
1 (one 16x16) vs
vs. 4 (four 8x8)
8x8).
„ Algorithms:
3-step algorithm (distance 4,2,1).
4-step algorithm (distance 8,4,2,1).
4-step with half-pixel refinement.

48
T pes of CPUs used
Types sed in ES
„ RISC CPUs
ARM 7
„ CISC CPUs
TI C54x
„ VLIW
TI C6x
T iM di
TriMedia
„ FPGA – Programmable CPUs
Vi t II
Virtex
49
VLIW TI C62/C67
VLIW:
„ Up to 8 instructions/cycle
instructions/cycle.
„ 32 32-bit registers.
„ Function
F ti units:
it
Two multipliers.
Si ALUs.
Six ALU
„ Data operations:
8/16/32-bit arithmetic.
40-bit operations.
Bit manipulation
i l ti operations.
ti
50
Partitioned register files
• Many memory ports are required to supply enough
operands per cycle.
cycle
• Memories with many ports are expensive.
) Registers are partitioned into sets, e.g. for TI C60x:

Data path A Data path B

register file A register file B

L1 S1 M1 D1 D2 M2 S2 L2
Address bus

Data bus
51
C6 data paths
C6x
„ General-purpose
G l register
i t fil
files (A and
dBB,
16 words each).
„ Eight function units:
.L1, .L2, .S1, .S2, .M1, .M2, .D1, .D2
„ Two load units (LD1, LD2).
„ Two store units (ST1,
(ST1 ST2)
ST2).
„ Two register file cross paths (1X and 2X).
„ Two
T data
d t address
dd paths
th (DA1 and d DA2)
DA2).
52
C6 function
C6x f nction units
nits
„ .L
L
32/40-bit arithmetic.
Leftmost 1 counting.
Logical ops.
„ .S
S
32-bit arithmetic.
32/40-bit shift and 32-bit field.
Branches.
Branches
Constants.
„ .M
16 x 16 multiply.
„ .D
32-bit add, subtract, circular address.
Load, store with 5/15-bit constant offset.

53
C6 system
C6x s stem
„ On-chip
O hi RAM.
RAM
„ 32-bit external memory: SDRAM, SRAM,
etc.
„ Host port.
p
„ Multiple serial ports.
„ Multichannel DMA
DMA.
„ 32-bit timer.

54
VLIW Trimedia TM
VLIW: TM-1
1

register file

read/write
d/ it crossbar
b

FU1 ... FU27

slot 1 slot 2 slot 3 slot 4 slot 5


TM 1 characteristics
TM-1
„ Characteristics
Floating point support
Sub-word parallelism
support
If Conversion
Additional custom
operations

56
Philips
TriMedia-
TriMedia
Processor
For multimedia-
applications, up to
5 instructions/
cycle.

57
DSPs
„ Greatt for
G f multimedia
lti di
„ CISC
MMX
TI C54x, C55x
„ VLIW
TI C6x
TriMedia

58
VIRTEX II FPGAs

59
Config rable Logic Block (CLB)
Configurable

60
2 carry paths per
CLB (Vertex II
Pro)

[© and
d source: Xilinx
Xili IInc.: Vi
Virtex-II
t II P Pro™
™ Platform
Pl tf FPGA
FPGAs: Functional
F ti l
Description, Sept. 2002, //www.xilinx.com]

61
Virte II Slice
Virtex Example:
Look-up tables LUT F and G can be used to a b c d G
compute any Boolean function of ≤ 4 variables
variables. 0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 1
1 0 0 1 0
1 0 1 0 0
1 0 1 1 1
1 1 0 0 0
1 1 0 1 1
1 1 1 0 1
1 1 1 1 0
62
Virtex II Pro
Devices include
up to 4 PowerPC
processor cores


© and source: Xilinx Inc.: Virtex-II Pro™ Platform
f
FPGAs: Functional Description, Sept. 2002,
//www.xilinx.com]
63
Number of resources in Virtex II

[© and source: Xilinx Inc.: Virtex-II Pro™ Platform FPGAs:


Functional Description, Sept. 2002, //www.xilinx.com]
64
Single p rpose processors
Single-purpose
„ Performs specific computation task
„ Custom single-purpose processors
Designed for a unique task
„ Standard single-purpose processors
“Off-the-shelf”
Off-the-shelf -- pre-designed for a common
task
a.k.a.,, peripherals
p p
serial transmission
analog/digital
g g conversions

65
Timers counters
Timers, co nters
„ Timer: measures time intervals by counting clock pulses
To generate timed output events e.g., hold light for 10 s
To measure input events e.g., measure a car’s speed
„ Watchdog timer
Reset timer every X time units, else it generates a signal
Uses: detect failure, self-reset, timeout on an ATM machine
„ Counter: counts pulses on a general input signal
E.g. count cars passing over by a sensor
Timer/counter
Basic timer
Clk
2x1 16-bit up
16-bit up 16 Cnt
Clk 16 Cnt mux counter
counter
Cnt_in Top
Top
Reset
Reset
Mode
66
P lse width
Pulse idth mod
modulator
lator
„ Generates pulses with
specific high/low times pwm_o

„ Duty cycle: % time high clk


Square wave: 50% duty
25% duty cycle – average pwm_o is 1.25V
cycle
„ Common use: control pwm_o
average voltage to an
clk
electric device
Simpler than DC-DC 50% duty cycle – average pwm_o is 2.5V.
converter or digital-analog
converter pwm o
pwm_o
DC motor speed, dimmer clk
lights
T
75% duty cycle – average pwm_o is 3.75V.
„ Another use: encode
commands, receiver uses
timer to decode
67
LCD controller
E void WriteChar(char c){
communications
R/W bus
RS = 1; /* indicate data being sent */
RS
DATA_BUS = c; /* send data to LCD */
DB7–DB0 EnableLCD(45); /* toggle the LCD with appropriate delay */
8
}
microcontroller C
LCD
controller

CODES RS R/W DB7 DB6 DB5 DB4 DB3 DB2 DB1 DB0 Description
I/D = 1 cursor moves left DL = 1 88-bit
bit
0 0 0 0 0 0 0 0 0 1 Clears all display, return cursor home
I/D = 0 cursor moves right DL = 0 4-bit
S = 1 with display shift N = 1 2 rows 0 0 0 0 0 0 0 0 1 * Returns cursor home
S/C =1 display shift N = 0 1 row
Sets cursor move direction and/or
0 0 0 0 0 0 0 1 I/D S
S/C = 0 cursor movement F = 1 5x10 dots specifies not to shift display
R/L = 1 shift to right F = 0 5x7 dots ON/OFF of all display(D),
p y( ) cursor
0 0 0 0 0 0 1 D C B
ON/OFF (C), and blink position (B)
R/L = 0 shift to left
0 0 0 0 0 1 S/C R/L * * Move cursor and shifts display

Sets interface data length, number of


0 0 0 0 1 DL N F * *
display lines, and character font

1 0 WRITE DATA Writes Data

68
Ke pad controller
Keypad
N1
N2
N3 k_pressed
N4

M1
M2
M3
M4 4
key_code key_code

keypad controller

N=4, M=4

69
S mmar
Summary
„ RISC CPUs
ARM 7
„ CISC CPUs
TI C54x
„ VLIW
TI C6x
TriMedia
„ FPGA – Programmable
P bl CPU
CPUs
Virtex II
„ Single purpose processors
70
Sources and References

„ Peter Marwedel, “Embedded Systems


D i ” 2004
Design,” 2004.
„ Frank Vahid, Tony Givargis, “Embedded
System Design,” Wiley, 2002.
„ Wayne
y Wolf,, “Computers
p as
Components,” Morgan Kaufmann, 2001.

71

You might also like