Unit-Ii Es

Unit- II
•Custom Single-Purpose Processors
•General Purpose Processors
•Application-Specific Instruction-Set Processors
(ASIPs)
1
Introduction
 Processor
◦ Digital circuit that performs
a computation tasks CCD
Digital camera chip
◦ Controller and datapath A2D

CCD
preprocessor
Pixel coprocessor D2A
◦ General-purpose: variety of lens

computation tasks JPEG codec Microcontroller Multiplier/Accum
◦ Single-purpose: one
particular computation task DMA controller Display
ctrl
◦ Custom single-purpose: non-

standard task Memory controller ISA bus interface UART LCD ctrl
 A custom single-purpose
processor may be
◦ Fast, small, low power
◦ But, high NRE, longer time-
to-market, less flexible
2
Combinational Logic
Transistors and Logic gates:
A transistor is the basic electrical component in digital systems.
Combination of transistors form more abstract components called
logic gates, which designers use when building digital systems.
A transistor acts as a simple on/off switch. One type of transistor
is CMOS.
Voltage at “gate” controls whether current flows from source to
drain
source
gate Conducts
if gate=1
drain
gate
IC package IC oxide
source channe drain
l Silicon
substrate 3
CMOS Transistor Implementations
 We refer to logic levels
source source
◦ Typically 0 is 0V, 1 is gate Conducts gate Conducts
5V if gate=1
drain
if gate=0
drain
 Two basic CMOS types nMOS pMOS
◦ nMOS conducts if
gate=1
1 1 1
◦ pMOS conducts if x y x
x F = x'
gate=0 x
F = (xy)' y
F = (x+y)'
◦ Hence “complementary” 0 y x y
0 0
 Basic gates NOR gate
inverter NAND gate
◦ Inverter, NAND, NOR
4
Basic logic gates
x x x
x F F
x y F y F x y F x F y F
x F
0 0 y 0 0 0 0 0 0 y 0 0 0
1 1 0 1 0 0 1 1 0 1 1
F AND 1 0 0 F=x+ 1 0 1 F=xy 1 0 1
F=x 1 1 1 1 1 1 1 1 0
Driver =xy y XOR
OR
x F x F x
F
x y F x x
F y F x
F
x y F
0 1 y 0 0 1 y 0 0 1 y 0 0 1
1 0 0 1 1 0 1 0 0 1 0
F = x’ F = (x y)’ 1 0 1 F = (x+y)’ 1 0 0 F =x y 1 0 0
Inverter NAND 1 1 0 NOR 1 1 0 XNOR 1 1 1
5
Combinational Logic Design
A combinational circuit is a digital circuit whose O/P is
purely function of its present I/Ps.
D) Minimized output equations
A) Problem description y bc
a 00 01 11 10
0 0 0 1 0
y is 1 if a is to 1, or b and c are
1. z is 1 if b or c is to 1, but not 1 1 1 1 1
both, or if all are 1. y = a + bc

z
bc
a 00 01 11 10
B) Truth table 0 0 1 0 1
Inputs Outputs 1 0 1 1 1
a b c y z
0 0 0 0 0 z = ab + b’c + bc’
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0 E) Logic Gates
1 0 0 1 0
1 0 1 1 1
a y
1 1 0 1 1 b
1 1 1 1 1 c
C) Output equations
z
y = a'bc + ab'c' + ab'c + abc' + abc
z = a'b'c + a'bc' + ab'c + abc' + abc
6
Combinational Components
If a circuit with 16 inputs would have 2^16, or 64k, rows in its truth
table.
Below figure shows such combinational components often called
register-transfer or RT level components.
I(log n -1) I0 A A B
B A B
I(m-1) I1 I0 n n
… n n n
n …
log n x n n-bit n bit,
S0 n-bit, m x 1 n-bit
Decoder Adder m function S0
… Multiplexor Comparator
ALU …
… n
S(log n n S(log
m) m)
O(n-1) O1O0 carry sum less equa greate
O O
l r
O= O0 =1 if I=0..00 sum = A+B less = 1 if A<B O = A op B

I0 if S=0..00 O1 =1 if I=0..01 (first n bits) equal =1 if A=B op determined
I1 if S=0..01 … carry = (n+1)’th greater=1 if A>B by S.
… O(n-1) =1 if I=1..11 bit of A+B
I(m-1) if S=1..11
With enable input e  With carry-in input May have status

all O’s are 0 if e=0 Ci outputs carry, zero,
sum = A + B + Ci etc.
7
Multiplexor (Selector):
Allows only one of its data inputs Im to pass through to the
output O.
If there are m data I/Ps, then there are log2(m) select lines S.
We call this an m by 1 multiplexor.
The binary value of S determines which data input passes
through.
For ex., an 8x1 multiplexor has eight data inputs and thus
three select lines.
8
Decoder
A decoder converts pits binary input I into a one-hot output
O.
One-hot means that exactly one of the output lines can be 1
at a given time.
If there are n outputs then there must be log2(n) inputs.
Ex., a 3x8 decoder has 3 inputs and 8 outputs.
A common feature on a decoder is an extra input called
enable. When enable is 0, all outputs are 0. when enable is 1,
the decoder function is before.
9
Adder
An adder adds two n-bit binary inputs A and B, generating
an n-bit output sum along with an output carry.
Ex., a 4-bit adder would have a 4-bit A input, a 4-bit B
input, a 4-bit sum output carry, and a 1-bit carry output.
If A were 1010 and B were 1001, then sum would be 0011
and carry would be 1.
Comparator
A comparator compares two n-bit binary inputs A and B,
generating outputs that indicate whether A is less than, equal
to, or greater than B.
10
Arithmetic-logic Unit
An ALU can perform a variety of arithmetic and logic
functions on its n-bit inputs A and B.
The select lines S choose the current functions: if there are
m possible functions, then there must be at least leg2(m)
select lines.
Common functions include addition, subtraction, AND and
OR.
11
Sequential Logic
A sequential circuit is digital circuit whose outputs are a
function of the present as well as previous input values .
Flip-Flops
One of the most basic sequential circuits is a Flip-Flop, which
stores a single bit.
The simplest type is D FF, it has two inputs : D and clock.
When clock is 1, the value of D is stores in the FlipFlop, and
that value appears at the output Q. when clock is 0, the value of
D is ignored.
SR flipflop: which has 3 inputs S, R, and clock. When clock is
0, the previously stored bit is maintained and appears at output
Q.
When clock is 1, the inputs S and R are examined.
JK, when J and K are 1, the stored bit toggles from 1 to 0 or 0
to 1.
12
RT-Level Sequential Components
We use more abstract sequential components for complex
sequential systems.
I
n
load shift n-bit
n-bit n-bit
Register Shift register Counter
clear I Q
n n
Q Q
Q= Q = lsb Q=
0 if clear=1, - Content shifted 0 if clear=1,
I if load=1 and clock=1, - I stored in msb Q(prev)+1 if count=1 and clock=1.
Q(previous) otherwise.
13
Register
A register stores n bits from its n-bit input data I, with those
stored bits appearing at its output Q.
A register usually has at least two control inputs clock and
load.
For a rising-edge-triggered register, the inputs I are only
stored when load in 1 and clock is rising from 0 to 1.
Another common register control input is clear, which
resets all bits to 0, regardless of the value of I.
Shift Register
A shift register stores n bits, but these bits cannot be stored
in parallel.
Instead they must be shifted into the registered serially,
meaning one bit per clock edge.
14
Counter
A counter is a register that can also increment, meaning add
binary 1, to its stored binary value.
A counter often also has a parallel load data input and
associated load control signal.
A common counter feature is both up and down counting or
incrementing and decrementing, require an additional control
input to indicate the count direction.
15
Sequential Logic Design
A) Problem Description C) Implementation Model D) State Table (Moore-type)

You want to construct a clock
x
divider. Slow down your pre- a Inputs Outputs
I1 Q1 Q0 a I1 I0 x
existing clock so that you output Combinational logic 0 0 0 0 0
I0 0
a 1 for every four clock cycles 0 0 1 0 1
0 1 0 0 1 0
Q1 Q0 0 1 1 1 0
1 0 0 1 0 0
B) State Diagram State register 1 0 1 1 1
1 1 0 1 1
x=0 x=1 a=0 1
a=0 1 1 1 0 0
I1 I0
0 a=1 3
a=1 a=1
1
a=1
2  Given this implementation model
a=0 x=0 x=0 a=0
◦ Sequential logic design quickly reduces

to combinational logic design
16
E) Minimized Output Equations F) Combinational Logic
I1 Q1Q0
a 00 01 11
10 a
0 0 0 1 1
I1 = Q1’Q0a + Q1a’ + x
1 Q1Q0’
0 1 0 1
I0
a
Q1Q0 01 10 I1
00 11
0 0 1 1 0 I0 = Q0a’ + Q0’a
1 1 0 0 1
x Q1Q0 I0
a
00 01 11 10
0 0 0 1 0 x = Q1Q0
1 0 0 1 0 Q1 Q0
17
Custom Single-Purpose Processor Design
A basic processor consists of a controller and a datapath.
… …
external external
control data
inputs inputs controller datapath
… …
datapath
control next-state registers
inputs and
control
controller datapath logic
datapath
control
outputs state functional
… … register units
external external
control data
outputs outputs
… …
Controller and Datapath

A view inside the controller and
datapath
18
 The datapath stores and manipulates a system’s data.
 The datapath contains register units, functional units and
connection units like wires and multiplexors.
 The datapath can be configured to read data from particular
registers, feed that data through functional units configured
to carry out particular operations like add or shift, and store
the operations results back into particular registers.
A controller carries out such configurations of the datapath.
 Controller sets the datapath control inputs, like register
load and multiplexor select signals, of the register units,
functional units, and connection units to obtain the desired
configuration at a particular time.
19
Example: Greatest Common Divisor
(a) black-box
view !1
 Firstcreate algorithm
1:
1
go_i x_i y_i !(!go_i)
2:
 Convert algorithm to GCD !go_i
“complex” state machine

2-J:
d_o
3: x = x_i
◦ Known as FSMD: 4: y = y_i
finite-state machine (b) desired

functionality 5: !(x!=y)
with datapath 0: int x, y; x!=y
1: while (1) { 6:
◦ Can use templates to 2: while (!go_i); x<y !(x<y)

3: x = x_i; y = y -x 8: x = x - y
perform such
7:
4: y = y_i;
5: while (x != y) { 6-J:
conversion 6: if (x < y)
7: y = y - x; 5-J:
else
9: d_o = x
8: x = x - y;
}
1-J:
9: d_o = x;
}
(c) state diagram
20
State Diagram Templates
Assignment statement Loop statement Branch statement
a=b if (c1)
while (cond) {
next statement c1 stmts
loop-body-
else if c2
statements
c2 stmts
}
else
next
other stmts
statement
next statement
a=b
! C:
C: cond c1 !c1*c2 !c1*!c2
next
cond
statement
loop-body- c1 c2 others
statements stmts stmts
J:
J:
next
next statement
statemen
t
21
Creating the datapath
 Create a register for any declared variable
 Create a functional unit for each arithmetic operation
 Connect the ports, registers and functional units
◦ Based on reads and writes
◦ Use multiplexors for multiple sources

 Create unique identifier
◦ for each datapath component control input and output
22
Splitting into a controller and datapath
go_i
Controller implementation model Controller !1

0000 1: x_i y_i
go_i
x_sel 1 !(!go_i) (b) Datapath
y_sel 0001 2:
!go_i x_se
x_ld l n-bit 2x1 n-bit 2x1
Combinational y_ld 00102-J: y_se
logic lx_ld
x_neq_y x_sel = 0
0011 3: x_ld = 1 0: x 0: y
x_lt_y y_ld
d_ld
y_sel = 0
0100 4: y_ld = 1
!= < subtractor subtractor
x_neq_y=0 5: x!=y 6: x<y 8: x-y 7: y-x
0101 5: x_neq_
Q3 Q2 Q1 Q0 x_neq_y= y
0110 6: x_lt_y 9: d
1
State register d_ld
x_lt_y=1 x_lt_y=
I3 I2 I1 I0 0 =1
7: y_sel = 1 8: x_sel d_
y_ld = 1 x_ld = 1 o
0111 1000
1001 6-J:
1010 5-J:
1011 9: d_ld = 1
1100 1-J:
23
RT-level Custom Single-Purpose Processor
Design
In many cases, we prefer not to start with program, but
instead directly with an FSMD.
◦ Reason is that often the cycle-by-cycle timing of a
system is central to the design, but programming
languages don’t typically support cycle-by-cycle
description.
•For ex., consider the design problem
• Where we want one device to send an 8-bit number to
another device.
• The problem is that while the receiver can receive all 8-
bits at once, the sender sends 4 bits at a time.
• So we need to design a bridge that will enable to two
devices to communicate.
24
Problem Specification
Bridge
rdy_in A single-purpose processor that rdy_out
converts two 4-bit inputs, arriving
Sender clock one at a time over data_in along with Receiver
a rdy_in pulse, into one 8-bit output
data_in(4)
on data_out along with a rdy_out data_out(8)
pulse.
Bridge
rdy_in=0 rdy_in=1
rdy_in=1
WaitFirst4 RecFirst4Start RecFirst4End

data_lo=data_in
rdy_in=0 rdy_in=0
rdy_in=1
rdy_in=1
WaitSecond4 RecSecond4Start RecSecond4End

data_hi=data_in
FSMD
rdy_in=0
Inputs
Send8Start rdy_in: bit; data_in: bit[4];
data_out=data_hi & Send8End Outputs
data_lo rdy_out=0 rdy_out: bit; data_out:bit[8]
rdy_out=1
Variables
data_lo, data_hi: bit[4];
25
Bridge
(a) Controller
rdy_in=0 rdy_in=1
rdy_in=1
WaitFirst4 RecFirst4Start RecFirst4End
data_lo_ld=1
rdy_in=0 rdy_in=0 rdy_in=1
rdy_in=1
WaitSecond4 RecSecond4Start RecSecond4End
data_hi_ld=1
Send8Start Send8End
data_out_ld=1 rdy_out=0
rdy_out=1
rdy_in rdy_out
clk
data_in(4) data_ou
t
data_lo_ld
data_out_ld
data_hi_ld
registers
data_hi data_lo
to all
data_out
(b) Datapath
RT-level custom single-purpose processor design example

continued: a) controller, b) datapath.
26
Optimizing Custom Single-Purpose Processors
Optimization is the task of making design metric values the best possible.
Optimization opportunities
◦ original program
◦ FSMD
◦ datapath
◦ FSM
27
Optimizing the Original Program
At this level, we can analyze the number of computations
and size of variables that are required by the algorithm.
We also have to analyze the algorithm in terms of time
complexity and space complexity.
28
original program optimized program
0: int x, y; 0: int x, y, r;
1: while (1) { 1: while (1) {
2: while (!go_i); 2: while (!go_i);
3: x = x_i; // x must be the larger
4: y = y_i; replace the subtraction number
5: while (x != y) { operation(s) with 3: if (x_i >= y_i) {
6: if (x < y) modulo operation in 4: x=x_i;
7: y = y - x; 5: y=y_i;
order to speed up
else }
8: x = x - y; program 6: else {
} 7: x=y_i;
9: d_o = x; 8: y=x_i;
} }
9: while (y != 0) {
10: r = x % y;
11: x = y;
12: y = r;
}
13: d_o = x;
}
GCD(42, 8) - 9 iterations to complete the GCD(42,8) - 3 iterations to complete the

loop loop x and y values evaluated as follows:
x and y values evaluated as follows : (42, 8), (42, 8), (8,2), (2,0)
(34, 8), (26,8), (18,8), (10, 8), (2,8), (2,6),
29
(2,4), (2,2).
Optimizing the FSMD
Scheduling is the task of assigning operations from the
original program to states in an FSMD.
The scheduling obtained using the template-based method
can be improved.
States with independent operations can be merged.
States which require complex operations (a*b*c*d) can be
broken into smaller states to reduce hardware size
30
Original FSMD Optimized FSMD
!1
1:
int x, y;
1 !(!go_i)
2: eliminate state 1 – transitions have constant 2:
!go_i values !go_i
go_i
2-J: x = x_i
3: y = y_i
merge state 2 and state 2J – no loop operation
x = x_i
3: in between them 5:
4: y = y_i x<y x>y
merge state 3 and state 4 – assignment
y7:= y -x 8: x = x - y
5: !(x!=y) operations are independent of one another
x!=y
9: d_o = x
6: merge state 5 and state 6 – transitions from
x<y !(x<y) state 6 can be done in state 5
y = y -x 8: x = x - y
6-J:
eliminate state 5J and 6J – transitions from
each state can be done from state 7 and state 8,
5-J: respectively
d_o = x eliminate state 1-J – transition from state 1-J

9:
can be done directly from state 9
1-J:
31
Optimizing the Datapath
Sharing of functional units
◦ one-to-one mapping, as done previously, is not necessary
◦ if same operation occurs in different states, they can
share a single functional unit.
Multi-functional units
◦ ALUs support a variety of operations, it can be shared

among operations occurring in different states.
Allocation is the task of choosing which RT components to
use in the datapath.
Binding is the task of mapping operations from the FSMD
to allocated components.
32
Optimizing the FSM
Designing a sequential circuit to implement an FSM also
provides some opportunities for optimizing, namely, state
encoding and state minimization.
State Encoding
task of assigning a unique bit pattern to each state in an
FSM.
size of state register and combinational logic vary.
can be treated as an ordering problem.
State Minimization.
task of merging equivalent states into a single state.
Two states are equivalent if, for all possible input
combinations, those two states generate the same outputs and
transition to the same next state, since merging them will
yield exactly the same output behavior.
33
General-Purpose Processors:
Software
34
Introduction:
A general-purpose processor is a programmable digital system
intended to solve computation problems in a large variety of
applications.
The unit cost of the processor may be very low. The reason for
this low cost is that the processor manufacturer can spread its NRE
cost for the processor’s design over large number of units.
As the processor manufacturer can spread NRE cost over large
numbers of units, the manufacturer can afford to invest large NRE
cost into the processor's design, without significantly increasing the
unit cost.
The embedded system designer may incur low NRE cost, since
the designer need only write software, and then apply a compiler
and/or an assembler.
Time-to-prototype and time-to-market will be short, since
processor ICs can be purchased and then programmed in the
designer’s lab.
35
Basic Architecture
Datapath and Control unit, tightly linked with a memory.
◦ Note similarity to single-purpose processor
Key differences
◦ Datapath is general
◦ Control unit doesn’t store the algorithm – the algorithm is
“programmed” into the memory
Processor
Control unit Datapath
ALU
Control
/Status
Controller
Registers
PC IR
I/O
Memory
36
Datapath
The datapath consists of the circuitry for transforming data
and for storing temporary data.
The datapath contains an ALU capable of transforming data
through operations such as addition, subtraction, logical
(AND, OR), inverting and shifting.
Processor
ALU
Controller Control +1
/Status
Registers
10 11
PC IR
I/O
...
Memory
10
37
11
 This also contains registers capable of storing temporary
data.
 The internal data bus carries data within the datapath,
while the external data bus carries data to and from the
data memory.
 Common processor sizes include 4-bit, 8-bit, 16-bit, 32-bit,
and 64-bit.
 However, in some cases, a particular processor may have
different sizes among its registers, ALU, internal bus, or
external bus, so the processor –size definition is not an
exact one. 38
Control Unit
Control unit consists of circuitry for retrieving program
instructions and moving data to, from, and through the
datapath according to those instructions.
It has a PC that holds the address in memory of the next
program instruction to fetch and an IR to hold the fetched
instruction.
It also has a controller, consisting of a state register plus
next state and control logic.
If the address size is M, then the address space is 2^M.
39
 For each instruction, the controller typically sequences
through several stage, such as
◦ Fetching the instruction from memory,
◦ Decoding it
◦ Fetching operands
◦ Executing the instructions in the datapath and
◦ Storing results.
 Each stage may consist of one or more clock cycles.
 A clock cycle is usually the longest time required for data
to travel from one register to another.
 The path through the datapath or controller that results in
this longest time is called the critical path.
 the inverse of the clock cycle is the clock frequency,
measured in cycles per second, or Hertz.
 The shorter the critical path, the higher the clock
frequency.
40
Memory Architectures
Registers serve a processors short term storage
requirements, memory serves the processors medium and
long-term information storage requirements.
Stored information as either program or data.
Program information consists of the sequence of
instructions that cause the processor to carry out the desired
system functionality.
Data information represents the values being input, output
and transformed by the program.
41
Processor Processor
Program Data Memory

memory memory (program and data)
Harvard Princeton
 Princeton
◦ Data and program words share the same memory space.

◦ This may result in a simpler hardware connection to memory,
since only one connection is necessary.
 Harvard
◦ The program memory space is distinct from the data memory

space.
◦ Can perform instruction and data fetches simultaneously.
◦ The intel 8051 is a well-known Harvard architecture. 42
 Memory may be ROM or RAM.
 An embedded system often uses ROM for program
memory.
 Constant data may be stored in ROM, but other data of
course requires RAM.
 Memory may be on-chip or off-chip.
 To reduce the time needed to access memory, a local copy
of a portion of memory may be kept in a small but
especially fast memory called cache.
 Cache memory is based on the principle that if at a
particular time a processor accesses a particular memory
location, then the processor will likely access that location
and immediate neighbours of the location in the near
feature.
43
Control Unit Sub-Operations
Processor Fetch Instruction
The task of reading
ALU
Controller Control the next instruction
/Status
from memory into the
Registers instruction register.
PC: program
PC 100 IR R0 R1
counter, always
load R0, M[500]
points to next
instruction
I/O
100 load R0, M[500] Memory

... IR: holds the
500 10
101 inc R1, R0
102 store M[501], R1
501
... fetched instruction
44
Processor
Control unit Datapath Decode Instruction
ALU The task of
Controller Control
/Status determining what
operation the
Registers
instruction in the
instruction register
PC 100 IR
load R0, M[500] R0 R1 represents.
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
45
Processor
Fetch operands
ALU
Move data from
Controller Control
/Status
memory to datapath
appropriate register
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
46
Processor
Execute operation
ALU
The task of feeding
Controller Control
/Status
the appropriate
registers through the
Registers
ALU and back into
an appropriate
10
PC 100 IR
load R0, M[500]
R0 R1 register.
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
47
Processor
Store results
ALU
Write data from
Controller Control
/Status
register to
memory
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
48
Instruction Cycles
PC=100 Processor
Fetch Decode Fetch Exec. Store Control unit Datapath

ops results ALU
clk Controller Control
/Status
Registers
10
PC 100 IR R0 R1
load R0, M[500]
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
49
Instruction Cycles
PC=100 Processor

ops results ALU
clk Controller Control +1
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 101 IR R0 R1
inc R1, R0
I/O

...
500 10
101 inc R1, R0
102 store M[501], R1
501
...
50
Instruction Cycles
PC=100 Processor

ops results ALU
clk Controller Control
/Status
PC=101
Registers
Fetch Decode Fetch Exec. Store
ops results
clk
10 11
PC 102 IR R0 R1
store M[501], R1
PC=102
Fetch Decode Fetch Exec. Store I/O
ops results ...
clk 500 10
101 inc R1, R0 501 11
102 store M[501], R1 ...
51
Pipelining: Increasing Instruction
Throughput
Wash 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Non-pipelined Pipelined
Dry 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
non-pipelined dish Time pipelined dish Time

cleaning cleaning
Fetch-instr. 1 2 3 4 5 6 7 8
Decode 1 2 3 4 5 6 7 8
Fetch ops. 1 2 3 4 5 6 7 8 Pipelined
Execute 1 2 3 4 5 6 7 8
Instruction 1
Store res. 1 2 3 4 5 6 7 8
Time
pipelined instruction
execution
52
Superscalar and VLIW Architectures
By using the multiple ALUs to further speedup a processor.
A superscalar microprocessor can execute two or more
scalar operations in parallel, requiring two or more ALUs.
A scalar operation transforms one or two numbers, as
opposed to vector or matrix operations that transform entire
sets of numbers.
A VLIW (very long instruction word) architecture is a type
of static superscalar architecture that encodes several
operations in a single machine instruction.
53
Programmer’s View
A programmer write the program instructions that carry out
the desired functionality on the general-purpose processor.
The level of abstraction depends on the level of
programming.
◦ First is assembly-language programming, in which one
programs in a language representing processor-specific
instructions.
◦ Second is structured-language programming, in which
one programs in language using processor-independent
instructions. (C, C++, Java, etc.)
A compiler automatically translates those instructions to
processor-specific instructions.
54
 Most development today done using structured languages
◦ But, some assembly level programming may still be
necessary
◦ Drivers: portion of program that communicates with
and/or controls (drives) another device.
 Even we can define an lower programming level, machine-
language programming, in which the programmer writes
machine instructions in binary.
 This level has rarely used due to advent of assemblers.
 Machine-language-programmed computers often had rows
of lights representing to the programmer the current binary
instructions being executed.
55
Instruction Set
The assembly-language programmer must know the
processor’s instruction set.
The instruction set describes the bit configurations allowed
in the IR, indicating the processor operations that the
programmers may invoke.
An instruction typically has two parts, opcode field and
operand fields.
Instruction 1 opcode operand1 operand2
...
An opcode specifies the operation to take place during the

instruction.
An operand specifies the location of the actual data that takes
part in an operation.
56
We classify the instruction set into 3 categories
1.Data-transfer instructions move data between memory and

registers, between input/output channels and registers, and
between registers themselves.
2.Arithmetic/logical instructions configure the ALU to carry

out a particular function, move data from the registers
through the ALU, and move data from ALU back to a
particular register.
3.Branch instructions determine the address of the next

instruction, based on datapath status signal.
57
Addressing Modes
Addressing Register-file Memory

mode Operand field contents contents
Immediate Data
Register-direct
Register address Data
Register
Register address Memory address Data
indirect
Direct Memory address Data
Indirect Memory address Memory address
Data
58
 In immediate addressing, the operand field contains the
data itself.
 In register addressing, the operand filed contains the
address of a datapath register in which the data resides.
 In register-indirect addressing, the operand filed contains
the address of a register, which intern contains the address
of a memory location in which the data resides.
 In direct addressing, the operand field contains the address
of memory location in which the data resides.
 In indirect addressing, the operand field contains the
address of a memory location, which in turn contains the
address of a memory location in which the data resides.
59
Sample Program
C program
Equivalent assembly program
0 MOV R0, #0; // total = 0

1 MOV R1, #10; // i = 10
2 MOV R2, #1; // constant 1
3 MOV R3, #0; // constant 0
int total = 0; Loop: JZ R1, Next; // Done if i=0
for (int i=10; i!=0; i--)
5 ADD R0, R1; // total += i
total += i;
6 SUB R1, R2; // i--
// next instructions...
7 JZ R3, Loop; // Jump always
Next: // next instructions...
Shows a program written in C that adds the numbers 1

through 10 and program written in assembly language
using the instruction set.
60
Program and Data Memory Space
The embedded systems programmer must be aware of the
size of the available memory for program and for data.
For ex, a particular processor may have 64k program space,
and a 64k data space.
In addition, the programmer will probably want to be aware
of on-chip program and data memory capacity.
Registers
Assembly-language programmers must know how many
registers are available for general purpose data storage.
Other special-function registers must be known by both the
assembly-language and the structured-language programmer.
Such registers- configuring inbuilt timers, counters, and
serial communication devices.
61
Input / Output
The programmer should be aware of the processor’s I/O
facilities, with which the processors communicates with
other devices.
One common I/O facility is parallel I/O, in which the
programmer can read or write a port by reading/ writing a
special function register.
Another common I/O facility is a system bus, consisting of
address and data ports that are automatically activated by
certain addresses or type of instructions.
62
Interrupts
An interrupt causes the processor to suspend execution of the
main program and jump to an interrupt service routine that
fulfil a special, short-term processing need.
After the ISR completes, the processor resumes execution of
the main program by restoring the PC.
The assembly- language programmer places each ISR at a
specific address in program memory.
Some compilers allow a programmer to force a procedure to
start at a particular memory location, while others recognize
predefined names for particular ISRs.
63
Operating System
An operating system is a layer of software that provides
low-level services to the application layes, a set of one or
more programs executing on the CPU consuming and
producing I/O data.
The task of managing the application layer involves
◦ The loading and executing of programs,

◦ Sharing and allocating system resources to programs,
Protecting these allocated resources from corruption by
non-owner programs. The OS is responsible for deciding
what program is to run next on the CPU and for how long.
This is called process (task) scheduling and it is
determined by the OS’s pre-emption policy.
64
 Another very important resource is memory, including
disk storage, which is also shared among the applications
running on the CPU.
 For high-level application programs, the OS provides the
software required for servicing various hardware-
interrupts, and provides device drivers for driving the
peripheral devices present in the system.
 A system call is mechanism for an application to invoke
the OS.
 When a program requires some service from the OS, it
generates a predefined software interrupt that is serviced
by the OS.
 The OS abstracts the details of the underlying hardware
and provides the application layer an interface to the
hardware through the system call mechanism.
65
Development Environment
General software design tools that are used by embedded
system designers in design, test and debugging of embedded
software.
Development processor
◦ The processor on which we write and debug our

programs
 Usually a PC
Target processor
◦ The processor that the program will run on in our

embedded system
◦ Often different from the development processor
Development processor Target processor

66
C File C File Asm.
File Figure: Software
Compiler Assemble Development Process
r
Binary Binary Binary
File File File
Linker
Library Debugger
Exec.
Profiler
File
Implementation Phase Verification Phase
• Assemblers translate assembly instructions to binary machine

instructions.
• Assemblers may also translate symbolic labels into actual addresses.
• Compilers translate structured programs into machine programs.
• Cross compiler executes on one processor but generates code for
different processor.
• Linker allows a programmer to create a program in separately
assembled or compiled files: it combines the machine instructions of
each into a single program.
67
 Programming of an embedded system’s processor is
similar to writing a program that runs on desktop.
 The general design flow for programming applications that
run on a desktop computer starts with writing source code,
possibly organized in an number of files for modularity
using an editor.
 We compile or assemble the code in each file, using a
compiler or assembler, into corresponding binary files.
 Using a linker, we combine these binary files into a final
executable.
 Next, we test program by running the executable file under
the command of a debugger.
 Sometimes we use profiler to pinpoint the performance
bottlenecks of program.
68
Testing and Debugging
It is an major part of the overall design processes.
The most common method od verifying the correctness of a
program is running it with input data that check the
program’s behaviour, especially using boundary cases.
Specifically, a program running in an embedded system
most often needs to be real-time.
69
 ISS
(a) (b)
◦ Help programmers evaluate
Implementation Implementation and correct their programs.
Phase Phase
◦ Gives us control over time –
set breakpoints, look at register
Verification
Phase Development processor
values, set values, step-by-step
execution, ...
Debugger/ ◦ But, doesn’t interact with real
ISS environment.
Emulator  Emulator
◦ Support debugging of the
External tools
program while it executes on
the target processor.
◦ Runs in real environment, at
Programmer
speed or near
Verification
Phase
◦ Supports some controllability
from the PC
 Device Programmers
Figure: Software Design Process: (a) Desktop, (b) Embedded
◦ These download a binary
machine program from the
development processors
memory into target processor’s
memory.
70
The difference between these three methods is as follows:
The design cycle using a debugger based on ISS running on the
development computer is fast , but it is inaccurate since it can only
interact with the rest of the system and the environment to a limited
degree.
The design cycle using an emulator is little longer, since the code
must be downloaded into emulator hardware.
◦ The emulator hardware can interact with the rest of the
system, hence can allow for more accurate testing.
The design cycle using a programmer to download the program
into the target processor is the longest of all,
◦ Here the target processor must be removed from its system
and put into the programmer, programmed, and return to the
system.
◦ This method enable the system to interact with its
environment freely, hence provides the highest execution
accuracy.
71
Application-Specific Instructions-set
Processors (ASIPs)
Todays embedded applications , such as HDTV, require
high computing power and very specific functionality.
ASIPs are instruction-set processors, they can be
programmed by writing software, resulting in short time-to-
market and good flexibility, while the performance and other
constraints may be efficiently satisfied.
They are expensive to integrate into low-cost embedded
systems.
Three major Varieties are
◦ Microcontrollers (large amount of control-oriented tasks)

◦ Digital signal Processing (to process large amounts of
data)
◦ Less-General ASIPs
72
Microcontrollers
Numerous processor IC manufacturers market devices
specifically for the control-oriented embedded systems
domain.
Microcontroller features
◦ They m ay include several peripheral devices, such as

timers, A-D converters, and serial communication
devices, on the same IC as a processor.
◦ They may include some program and data memory on the
same IC.
◦ They may provide the programmer with direct access to a
number of pins of the IC.
◦ They may provide specialized instructions for common
embedded system control operations.
73
Digital Signal Processors
That are highly optimized for processing large amounts of
data.
The source of this large amount of data is some form of
digitized signal, like
◦ A photo image captured by a digital camera,
◦ A voice packet going through a network router or an
audio clip played by a digital keyboard.
This may contain numerous register files, memory blocks,
multipliers, and other arithmetic units.
They also provide instructions that are central to digital
signal processing, such as filtering, and transforming vectors
or metrics of data.
74
Less-General ASIP Environments
These are designed to perform some very domain specific
processing while allowing some degree of programmability.
Ex. An ASIP designed for networking hardware may be
designed to be programmable with different network routing,
and packet processing protocols.
75
End of UNIT-II
76

Unit-Ii Es

Uploaded by

Copyright:

Available Formats

You might also like

Unit-Ii Es

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-Ii Es

Uploaded by

Copyright:

Available Formats

Unit- II

•Custom Single-Purpose Processors

•General Purpose Processors

•Application-Specific Instruction-Set Processors

◦ Controller and datapath A2D

◦ General-purpose: variety of lens

◦ Custom single-purpose: non-

◦ Inverter, NAND, NOR

both, or if all are 1. y = a + bc

O= O0 =1 if I=0..00 sum = A+B less = 1 if A<B O = A op B

With enable input e  With carry-in input May have status

A) Problem Description C) Implementation Model D) State Table (Moore-type)

◦ Sequential logic design quickly reduces

Controller and Datapath

“complex” state machine

◦ Known as FSMD: 4: y = y_i

finite-state machine (b) desired

with datapath 0: int x, y; x!=y

◦ Can use templates to 2: while (!go_i); x<y !(x<y)

(c) state diagram

◦ Based on reads and writes

◦ Use multiplexors for multiple sources

◦ for each datapath component control input and output

Controller implementation model Controller !1

WaitFirst4 RecFirst4Start RecFirst4End

WaitSecond4 RecSecond4Start RecSecond4End

RT-level custom single-purpose processor design example

GCD(42, 8) - 9 iterations to complete the GCD(42,8) - 3 iterations to complete the

d_o = x eliminate state 1-J – transition from state 1-J

◦ ALUs support a variety of operations, it can be shared

Program Data Memory

◦ Data and program words share the same memory space.

◦ The program memory space is distinct from the data memory

100 load R0, M[500] Memory

100 load R0, M[500] Memory

100 load R0, M[500] Memory

100 load R0, M[500] Memory

100 load R0, M[500] Memory

Fetch Decode Fetch Exec. Store Control unit Datapath

100 load R0, M[500] Memory

Fetch Decode Fetch Exec. Store Control unit Datapath

100 load R0, M[500] Memory

Fetch Decode Fetch Exec. Store Control unit Datapath

non-pipelined dish Time pipelined dish Time

Fetch ops. 1 2 3 4 5 6 7 8 Pipelined

An opcode specifies the operation to take place during the

1.Data-transfer instructions move data between memory and

2.Arithmetic/logical instructions configure the ALU to carry

3.Branch instructions determine the address of the next

Addressing Register-file Memory

Direct Memory address Data

Indirect Memory address Memory address

0 MOV R0, #0; // total = 0

Shows a program written in C that adds the numbers 1

◦ The loading and executing of programs,

◦ The processor on which we write and debug our

◦ The processor that the program will run on in our