Lecture Notes Workings

Take it through
C / C++ / any other compilation

higher level language
0 profiling of the software code

1 can we able to calculate the processing time and the deadlines (rea
2 can we reduce the processing time by having inline assembly for ke
3 can we able to figure out if some function is called many-many tim
build an accelearator out of it…. Or we can create a new instructio
processor to do this activity. (profiling and seeking HW functions)
HW a> Create an instruction in the Processor its
b> Create a accelerator (push the data, wait
done, read the processed output back)
3 Take care of loops in the sw to compute….. (foreach, while, loops…
vga screen size

X 640
Y 480
Total Frame size307200 pixels in one frame
60frames per second 18432000 ~18.4M pixels need to be sent to the proje
Complex Multiplication
Y = (a + ib) (c + id)
SW many-many assembly instructions Y =which
(ac - bd)
will+tke
i (bcclocks
+ ad)/ cycles
which will take clock cycles
r1=a
r2=b Accelerator a b c
r3=c
r4=d
* *
t1 = r1 * r3
t2 = bd
-
Real Part
If design is pipelined = initially there is latency of say 6 clock cycles, but then lat
cpu ram rom
interconnect bus
DMA
controller complexMulAccelerator
Combination1
T1 MIPS
T2 DSP
T3 FPGA
T4 MIPS
Area (200+120+240), Time(5+20
DAG = Directed Acyclic Graph

only fork and join. Not cyclic.
b00 b matrix 0th row 0th column
3x2 matrix 2x3 matrix

col0 col1 col0
row0 a00 a01 row0 b00
row1 a10 a11 row1 b10
row2 a20 a21
option 1 a00 * b00 a01 * b10

1 2 3 4
5clocks mult * * * *
2clocks adder + +
output obtained c00 in 7 clock cycles
Option2
6 multipliers and 3 adders ==> do it 3 times Reduce area from 18 multipliers
Reduce area from 9 adders to 3
Output obtained in 7 clock cycles x 3 times = 21 minimum… if our FSM is good enough….
Option3
2 multipliers and 1 adder ==> do it 9 times
Output obtained in 7 clock cycles x 9 times = 63 minimum… if our FSM is good enough (pip
Option4 SW
main () {
int A[3][2] = { {1,2}, {3,4}, {5,6}};
int B[2][3] = { {7,8,9}, {10,11,12}};
int C[3][3];
int i, j, k;
for (i=0; i < 3; i++) {
for (j=0; i < 3; j++) {
C[i][j] = 0;
Option5 for (k=0; k < 2; k++) { What if this loop in unrolled….
C[i][j] += A[i][k] * B[k][j]; c[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j];
}
}
}
}
Load to Register
Store to Memory
indirect store to Mem
R0 is a register ==> this can be 8bits or 16bits or 32bits we yet don’t know….
But there are 16 such registers… which can be named as R0, R1, R2….., R15
Can I write an ISS given the instruction set itself in a document ?? Yes !!
I'm trivializing; but these instructions can be in a large case statement in 'C' to create an ISS out of it.
Can I do a sum of 10 numbers using the above CPU…

int total = 0;
for (int i = 10; i != 0; i--) total += i;
Use the Instruction Set and implement the 'C' program of "sum of 10 numbers"
i.e, 1st step : convert the 'C' program into assembly routine for the given InstructionSet.
2nd Step : assembly to executable conversion
3rd Step : run the executable on the ISS
Y = ab – 6d + dc
a b c
3 multipliers * *
o/p at 3rd clock

ab cd
1 adder +
o/p at 5th clock
(ab + cd)
1 subtractor
o/p at 7th clock
DMA
BC DE
Source Source Destination Destinatio

Address Value Address n Value
0x1000 5A 0x200 5
0x1001 5 0x201 5
0x1002 5 0x202 5 cpu will read from source one value
0x1003 6 0x203 6 cpu will write to the destination one value
0x1004 12 0x204 12 increment source pointer
0x1005 1 0x205 1 increment destination pointer
0x1006 11 0x206 11
0x1007 1 0x207 1
0x1008 0 0x208 0
0x1009 12 0x209 12
0x100A 5 0x20A 5
0x100B 5 0x20B 5
0x100C 12 0x20C 12
0x100D 8 0x20D 8
0x100E 5 0x20E 5
0x100F 1 0x20F 1
while (1) { //forever loop
get high_time to a register say B

high_time_loop : clock 0 clock 1
set gpio ‘1’
decrement B 1 1
if B ! = 0 then
go to high_time loop
get low_time to a register say C

low_time_loop : 25%
set gpio ‘0’
decrement B
if B != 0 then
go to low_time_loop
}
arrival time for a, b, c, d and fixed number 5 can be assumed to be zero.
a b c d
* (3) - (2)
1st mult
1st sub
5 arrival time = 2
* (3)
2nd mult operation, shared hardware
arrival = 3 arrival = 5
+ (2)
critical path
arrival = 7
y
y
Thoughts for Assignments 15 marks
10marks 1a We will give MCQ quiz for say 8 marks or 7marks : basically 15 MCQ
1b Add subjective type Questions also.
2 For another 7 to 8 marks we will give an assignment which need to be submitte
resource HW a> You have a kit (arduino, beagle, pico, …) --> show two assignments
5marks resource Laptop/PC b> We have the 8085 instruction set.
(i) Create an InstructionSetSimulator (ISS) using 'C' or 'C++
(ii) Write an assembly program to sort 16 numbers. Use yo
5marks resource Lapotp/PC c> Create a domain-specific-architecture for say a EthernetLAN packe
(i) Student can decide the packet type [(header) (source) (
(ii) Parse the packet using a domain-specific-processor to r
(iii) Do some level of time-stamping in the packet [(header
resource HW d> Choose any DSP processor of your choice (TI, ADI, MaxLinear, Free
(i) Filter taps (<<>>), Filter coeffs (<<>>) --> <<goldenInput
5marks resource Laptop/PC e> Image Processing example YUV -> RGB conversion. Create a HW Pe
Verific, Icarus Verilog (open source), Quartus / ModelSim
3
https://fpgasoftware.intel.com/
Co-simulation & Emulation

1 Normal simulation --> say on a single core processor… Slow… say to the tune of 100kHz wall clock time
2 Accelerated simulation --> where gates are parallelized by usage of either a super-scalar boolean proce
How parallization happens
Synopsys ZeBu (1) on FPGA (Field Programmable Gate Array) --> Array of FPGAs…. Like a board wh
Cadence Palladium (2) using a super-scalar boolean processor (Potentially N x M) possibility for addin
Mentor Veloche (3) a mix of FPGA + super-scalar boolean processors…
boolean processor
boolean processor
boolean processor
Process Path Methodology
Assembly level code assembly instr1 1clock

and other object files instr2 2clock
like elf (executable instr3 1clock
and linkable format) instr4 6clocks
call functionN 10clocks
…
…
g time and the deadlines (real time system) functionN 1clock

having inline assembly for key functions to reduce processing time
tion is called many-many times, and we can instrN1 2clocks
e can create a new instruction in the instrN2 1clocks
and seeking HW functions) ….
struction in the Processor itself. ….
elerator (push the data, wait till processing is
processed output back) return from function
te….. (foreach, while, loops….)
sum all the # clocks
taken TOTAL NN clocks
N clocks
frequency of the processor (F)
period of each clock = 1/(F)
need to be sent to the projector in one second….. Nclocks * period = N /(F) = some ms, s, usec, nsec
CEO XPU
CFO DSP digital signal processor
CTO GPU graphics processing unit
CXO DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
d a b c d 2clocks NPU Neural ? Processing unit
SPU Security processor
bc ad VPU Vector Processing unit
* * 2clocks CPU Central Processing unit
DomainSp
ecific VLIW Processor
bc+ad + 1clocks
Imaginary Part 1clocks
Total 6clocks
ay 6 clock cycles, but then later on, there is sustained throughput…. Every clcok cycle, you get one ComplexMultiplication done
mplexMulAccelerator
Combination2 CombinationN
a (200+120+240), Time(5+20+2)
max(T2, T3) because T2 & T3 are executed in paralallel
G = Directed Acyclic Graph

y fork and join. Not cyclic.
MIPS itself is a processor… We write SW on this processor
DSP is a special purpose processor… We write SW on this
FPGA is a HW
ASIC is a HW (HW-Accelerator)
FPGA with CPU + FABRIC
CPU == SW
FABRIC == HW
3x3 matrix
col1 col2
b01 b02 c00 = a00 * b00 + a01 * b10
b11 b12 For c00 --> 2 multiplier and 1 adder
9 * 2 multipliers Total 18 mults

9 * 1 adders Total 9 adders
5 6 7 8 9 10 11 12 13
* * * * * * * * *
duce area from 18 multipliers to 6 multipliers

duce area from 9 adders to 3 adders
our FSM is good enough….
our FSM is good enough (pipelining is good) to to proper read and writes to memory locations…
p in unrolled….
b[0][j] + a[i][1] * b[1][j];
Can we convert this instruction set to InstructionSetSimulator
Instruction Set Simulator
assembly program --> output at what instruction

ISS 'C' or 'C++' or
what happens
'Python' or 'Perl'
like which register is updated
what happens to memory locations
….
Observations
1. Looks like 16bit instruction because it has 2 bytes.
2. Rn field is 4bits wide. Rn = internal Register… So number of internal registers can be max 16
3. MSB 4bits are opcodes of the Instruction….
4. Registers are R0 to R15 max…
5. M(direct) addressing is available….. What does it mean… (direct) is max 8bits, so memory location
at max can be 256 locations only… :(
6. Rn = M(direct) ??? >> read the memory location whose address is (direct) and copy it to
register Rn (where Rn = R0, or R1, or R2. ,,,,, R15)
7. M(direct) = Rn ??? >> write the contents of the Rn register to the Memory location whose address i
8. M(Rn) = Rm ??? >> write the contents of the Rm register to the Memory location whose address
is the value of the register Rn
C' to create an ISS out of it. 9. Rn = immediate >> Rn now has value "immediate"
10. Rn = Rn + Rm >> add instruction… I = I + J.. Similarly for subtract…
11. PC = PC +Relative (only if Rn is 0) >> Jump if Zero….to a relative location…
nstructionSet.
ab-d(6+c)
d 6
* 3 clocks d a b 6 c
* +
3 clocks
2 clocks
* reuse multiplier
6d
- 2 clocks
ab + cd - 6d 7 clocks
clock0 clock1 clock2 clock3
multiplier * a*b
adder + 6+c blank

waiting for multiplier to complete
subtractor -
om source one value in a loop for the number of bytes (words)…

o the destination one value
ination pointer
while(1)
clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock 8 clock 9 clock 10
1 1 0 0 0 0 0 0 0
75%
ed to be zero. sharing the multiplier
d ab *(3) -(2) c-d max (3 clocks) Level 1
5(c-d) *(3) shared mult (3 clocks) Level 2
ab + 5(c-d) +(2) adder (2 clocks) Level 3
8 clocks
a
5 MUX I1
*
b
MUX I2
(c-d)
select line to select depending on statemachine

either to use (a, b) pair for multiplication
or to use (5, (c-d) ) pair for multiplication
basically 15 MCQ
nt which need to be submitted

…) --> show two assignments, (i) if I press a button, an led will glow (ii) a UART showing "hello world" on screen..
mulator (ISS) using 'C' or 'C++', no need for SystemC.

m to sort 16 numbers. Use your ISS, to ensure correctness of code….
for say a EthernetLAN packet…. IEEE 802.3
cket type [(header) (source) (destination) (payloadSize) (payload) (crc)(tail)]
omain-specific-processor to route the packet from source to destination.
mping in the packet [(header) (source) (destination)(TIME_STAMP) (payloadSize) (payload) (crc)(tail)]
ice (TI, ADI, MaxLinear, Freescale, ….) ---> Write a FIR filter for that processor.
effs (<<>>) --> <<goldenInput, and goldenOutput>>…
B conversion. Create a HW Peripheral for the same. [Writing a VHDL, or Verilog] [simulate it and show it working]
uartus / ModelSim
KNOBS
Speed Debugability
une of 100kHz wall clock time per clock cycle of simulation. (server) 100kHz High (100%)
a super-scalar boolean processor, or an FPGA.
y of FPGAs…. Like a board which has 9+ FPGAs running simultaneously. 10MHz Low
y N x M) possibility for adding processors…. 4MHz High/Medium
FPGA0,0 FPGA0,1 FPGA0,2 8MHz Medium
FPGA1,0
olean processor
FPGA2,2
Average of 10 numbers done completely in HW
n0 n1 n2 n3 n4
1st clock cycle Plus Plus Plus
funciton N 2nd clock cycle PLUS

"S" clocks
if it were done in SW
3rd clock cycle PLUS
function N
"H" clocks if it were 4th clock cycle
done in HW
5th clock cycle
F) = some ms, s, usec, nsec
al processor
ocessing unit
ocessing unit
ocessing unit
lexMultiplication done
te SW on this processor
r… We write SW on this
14 15 16 17 18
* * * * *
at what instruction
register is updated
ens to memory locations
al registers can be max 16
s max 8bits, so memory location
(direct) and copy it to
Memory location whose address is (direct).

emory location whose address
2 clocks
clock4 clock5 clock6 clock7

d*(6+c)
multiplier to complete
ab - d(6+c)
clock 11 clock 12 clock 13 clock 14 clock 15
0 0 0 0 0
(pipelining)
(pipelining)
(pipelining)
Users
multiple
one
multiple
multiple
n5 n6 n7 n8 n9
Plus Plus Plus
PLUS
PLUS
DIV/10
`
What is Glue logic
1 uP working at reset active low, but peripheral working at reset active high….
so glue logic is needed to invert the reset signal before going either the uP or the peripheral.
2 the uP may have a different bus architecture, peripheral may have a different… So glue logic is needed (bus bridge
How the firmware starts executing…

1 Power UP.
2 Clocking should start… Reset is held to the reset state of the uP.
3 Reset release
4 As soon as reset release happens, the uP starts execution, and what it does, --> fetch the first instruction…
5 uP can fetch instruction from ROM (within the SoC chip), or from NAND flash (outside the SoC chip).....
6 Concept called BootUP-Pins…. So the uP based on the BootUP-Pins, … start reading the first instruction from eithe
7 Then on, the instruction execution started….
PPAS Goals For any application

P Performance
P Power
A Area
S Schedule
(P)PPAS(S)
(P) People
(S) Security
₹
2ns 6ns 8ns
10ns is ok… then I will design with sing
Start from scratch : Design a Frequency Detector
1 How would this frequency detector be done in sw.

2 If there are components which are analog in nature, pick them first. (sensor, DAQ module,
3 If there is a SW -- is there a time limit -- is it a real time problem…
4 Ask yourself a Question -- Is SW good enough for the real time (like say 1ms to detect a fre
5 If SW is not good enough…. Look into the SW program and profile it…. Do you see any patt
6 Of the few functions that are being called over and over again, or are leading to a non-real
7 What we identified, try to put it in HW…
8 In HW again, there will be a lot of choices to be made…
9 In SW was originally done, is there a way to optimize the same, using better memory and d
A good example is DFT (Discrete Fourier Transform)….

1 many butterfly units in parallel or not…
2 many memories in parallel or not (read has one memory) (write has another separate mem
3 make the refinements till your application is satisfied…
first SW, and then HW to design a good product?.. vice versa is not good strategy sir?
6:07 PM
1 If product is already there, and it is HW based.. Then you can reduce cost by putting it in SW…
2
FPGA's…. has the programmable sub-system (processors, memory controllers

PS PART
has the programmable logic… (LUT, DSP blocks, memory blocks, pll
PL PART
What is a maskable interrupt
interrupt to the processor

interrupt AND GATE
flip flop (register mapped programmable bit) [interrupt mask register]
Timing construct
A if A changes and B was stable, then Y will com

AND GATE Y A changes at time 0
B
intrinsic delay 0
delay within the gate A changed
B
A->Y 1ns Y
B->Y 1ns Y (same Y)
Completion time of A = 1ns

Completion time of B = 1ns
Completion time of C = 1.5ns
What time will join really happen

1.5ns
hi Sir, what are the software we should get a hands-on as part of this cour
5:59 PM
1 C', 'C++'
2 Verilog HDL / VHDL
3 SystemC
Tools
1 g++ (gnu C / C++ compiler).
2 ModelSim / QuestaSim --> download it from Intel website… (check for license…
3 SystemC --> many freeware may be available… Don’t know for sure.
4 ModelSim / QuestaSim --> may be supporting mixed Verilog, VHDL and SystemC
5 People from ASIC world…. VCS (Synopsys), Xcelium (Cadence) all support mixed
6 Xsim (Xilinx) --> download it from the Xilinx website… (may support verilog, vhd
Complete Application
Hardware Software
Implementation Implementation
Verilog HDL C
SystemVerilog C++
Python
VHDL
1> Since tapeout is expensive -- we must do simulation and Co-debug.. (Debug of HW and SW) (Also includes Boar
2>
nand/nor flash mem

ddr memory
usb DUT
uart
pcie
eth
Simulation On a linux box or server, we simulate our Verilog or VHDL design with SW.
Slower (wall clock of 250kHz approx)
Observability is high
For basic bringup it's good
Emulation Usage of either FPGA, Array of Boolean Processors, Hybrid parallel processors t
Faster (wall clock of say 1MHz to 4MHz approx)
Observability is low
Post basic bringup for running higher level of sw or boot-rom-firmware
KNOBS
SPEED OBSERVABILITY COST
High Good Emulation Simulation Emulation

FPGA
Medium Simulation Emulation FPGA
Low Simulation
Full RTL simulation of Processor each and every node of the processor RTL is simulated
Bus Functional Model (BFM) The processor internals are accelerated (by just modeling and it's no
accurate. (DONUT)
Cycle Accurate Model If an instruction takes say 3 clock cycles, the model allows 3 clock cy
Instruction Set Simulation of Processor Only Instructions are executed as if like a pure 'C' model of the proc
Concept of LUT [Look Up Table]
Boolean Equation Y = A* B + C * D
4 Input LUT (Look Up Table)

decimal A B C D Y
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 1 0 0
3 0 0 1 1 1
4 0 1 0 0 0
5 0 1 0 1 0
6 0 1 1 0 0
7 0 1 1 1 1
8 1 0 0 0 0
9 1 0 0 1 0
10 1 0 1 0 0
11 1 0 1 1 1
12 1 1 0 0 1
13 1 1 0 1 1
14 1 1 1 0 1
15 1 1 1 1 1
static the values will remain until the FPGA is programmed

Later, you can re-program it for another design.
load to register (from memory)
direct store to memory (from register)
indirect store to memory (from register)
load to register immediate
Addition operation
Subtraction operation
Jump if a register value is zero.

Jump to a location given by
the relative value
Registers
Immediate value
direct value
Memory range
PC range
We have a We want to write a C

instruction Set. Call program for an
this as Processor 1 application
We have a
instruction Set. Call
this as Processor 2
Practice Create an assembly routine to do the folloiwng using the trivial instr
1 unsigned int x;
2 unsigned int y;
3 unsigned int a, b, c, d, e;
<in assembly, say r0, r1 are x, y respectively>
<in assembly, say r2, r3, r4, r5 are b, c, d, e> And you can choose r10
4 if (x < y) { a = b + c } else { a = d + e }
Steps 1 Create a flow chart corresponding to the trivial instructio

2 From the flow chart, start mapping blocks to assembly la
3 Using paper / pen & in the mind, work out the correctne
example1 x = 5, y =8; b = 4, c = 3, d =2, e = 1;

example2 x = 8, y = 5; b =4, c=3, d=2, e = 1;
Hints
Gautam Amankumar .
we can keep subtracting 1 from both and the first register gets to ze
pingmem pongmem pingmem pongmem
STEP1 STEP2
(own fsm) (own fsm)
HW 1 Each statemachine is running concurrently

in parallel 2 But data is flowing in a pipeline fashion
3 ping - pong memory concept

`-- When ping memory is in write phase, the pong memory is in read phase
`-- When ping memory is in read phase, the pong memory is in write phase
`-- There will be some point in time, when the ping and pong memory switch their roles.
SW 1 function step1
in series 2 function step2
3 function step3
4 main {
call function 1 t0
call function 2 t1
call funciton 3 t2
}
Asynchronous
Synchronous
e peripheral.
So glue logic is needed (bus bridge, or bus gasket, ….) AXI3 to AXI4, AHB to AXI, ….
fetch the first instruction…

utside the SoC chip).....
ing the first instruction from either the ROM or NAND Flash…
top
uP ROM
BUS
RAM Peripheral
RAM mem Peripheral
FrontEnd logic State structural

M/C O/P Logic scope of o/p logic from top : top.peripheral.op_logic
If there is no hierarchy, it is difficult to scale the system.

LEVEL 0 add & sub each individually take 2ns If sharing of adder not done
3 add / sub
1 mul
If sharing of adder done

LEVEL 1 multiply takes 4ns 2 add/sub
1 mul
If performance is not critical

1 add/sub
1 mul
LEVEL 2 add takes 2ns Sharing also involves multiplexing

the previous outputs and
maybe storing for later use..
is ok… then I will design with single adder… Requirement for Performance….
them first. (sensor, DAQ module, …).
time (like say 1ms to detect a frequency).

d profile it…. Do you see any patterns.
again, or are leading to a non-real-time nature… Identify them….
same, using better memory and data structures, better handling of interrupts, external Ios...
) (write has another separate memory), …

cost by putting it in SW…
m (processors, memory controllers, usb, serial ports, …..)
T, DSP blocks, memory blocks, pll blocks, ….)
mask register]
s and B was stable, then Y will come out in 1ns… why A->Y = 1ns
B changes at time 0.5ns Y will change at what time --> 1.5ns…
0.5 1 1.5 2
stable stable stable stable
changed
Y chnge due to A
Y chnge due to B
between 0.5 and 1.5 the value of Y is unstable… or there might be glitches…
Completion time of A = 1ns

Completion time of B = 1ns
Completion time of C = 1.5ns
What time will join really happen…..

top
M1 M1 M2
Logi logic of
top level
ds-on as part of this course?
ntel website… (check for license…)

… Don’t know for sure.
mixed Verilog, VHDL and SystemC.
elium (Cadence) all support mixed Verilog, SystemC, VHDL.
ebsite… (may support verilog, vhdl)… Don’t know about SystemC.
f HW and SW) (Also includes Board debug)
TB Simulator slower to run

nand/nor flash mem Emulator not full scale freq, but faster
ddr memory than simulation
spi
i2c
display interface
jesd
log or VHDL design with SW.

processor freq = 1GHz
ssors, Hybrid parallel processors to speed up simulation

processor freq = 1GHz
sw or boot-rom-firmware
CAPACITY (Gate count)
Emulation
Simulation
FPGA
ssor RTL is simulated
rated (by just modeling and it's not cycle accurate), but external world is cycle
cycles, the model allows 3 clock cycles to elapse, and is accurate to BFM
if like a pure 'C' model of the processor
Implementation of the Look Up Table

Assume that you have a 16 : 1 MUX
0 0
0 1
0 2
1 3
0 4
0 5 Y RST Q
0 6 output output of the LUT D
1 7 FF
0 8 CLK
0 9
0 10
1 11
1 12
1 13
1 14
1 15 MUX 16:1 1. Sometimes you want only combinatori
2. Sometimes you want combinatorial log
static
ram
of the LUT S3,S2,S1,S0
A,B,C,D input signals of the LUT
stimulus to the LUT & Design
R0 to R15
Immediate value 0 to 0xFF (255)
direct value 0 to 0xFF (255)
Memory range 0 to 0xFF (255)
0 to 0xFF (255)
For this 'C' application;

we want to map it to a
specific processor.
COMPILER
We want to write a C [Convert frm 'c' to
program for an assembly --
application COMPILER] [then
convert from assembly
to executable (binary)
[LINKER & LOADER]]
1 it could be processor1 (say trivial instruction set)
2 it could be some another processor (say arm)
3 it could be another processor (say 8085)
4 it could be a NEW processor also, and you are trying to figure out
the best instruction set for the NEW processor
the folloiwng using the trivial instruction set processor :
b, c, d, e> And you can choose r10 as a;
rresponding to the trivial instruction set processor

tart mapping blocks to assembly language
the mind, work out the correctness of the solution
, d =2, e = 1;
oth and the first register gets to zero will be smaller so we can jump accordingly
pingmem pongmem
final output
STEP3
(own fsm)
ry is in read phase
y is in write phase
ong memory switch their roles.
Hardware Side Additions

IO's Consoles
SSD Disks Disk & CD's
Graphics Card Consoles
Wireless radio Networking
Virtualization and cloud Networking & Resource Allocation
Sensors Connectivity
Converters (USB 2 UART).. Connectivity
Bluetooth… Connectivity
Security Security in OS
peripheral.op_logic
lt to scale the system.

f adder not done
f adder done
ance is not critical
o involves multiplexing
us outputs and
ring for later use..
2:1 MUX output of the CLB
goes either to IO, or the FPGA interconnect
static ram of the CLB

CLB = Configurable Logic Block OUTPUT
mes you want only combinatorial logic

mes you want combinatorial logic + sequential cell
LOAD instruction : Loads data to an internal Processor Register. M(direct) = Value (Memory whos
Example of Rn = M(direct)
MOV R0, 0x5 Memory (M)

R0 <- M(0x5) = d5 Address
0x0
0x1
Assembly MOV R0, 0x5 0x2
nibble3 nibble2 nibble1 nibble0 0x3
Binary 0000 0000 0000 0101 0x4
0x5
nibble3 OPCODE --> because it helps us to understand 0x6
what kind of instruction it is. 0x7
0x8
nibble2 Rn --> It helps us to figure out which register 0x9
is being addressed… 0x10
Because it is 4bits wide field --> how many 0xA
registers can be addressed…. 0xB
We can address 16 registers, call it R0 to R15
{nibble1, nibble0} --> Gives the address to the memory location which needs to be accessed.
Assembly MOV direct, Rn --> stores data from a Register to Memory (whose address is directly given)
Example MOV 0x8, R15
nibble3 nibble2 nibble1 nibble0
Binary 0001 1111 0000 1000
Assembly MOV @Rn, Rm --> stores data from an internal Register Rm to the Memory (whose addes is giv
Example MOV @R5, R7
Binary 0010 XXXX 0101 0111
Assembly MOV Rn, #immed --> loads the #immediate value to the internal Register Rn
Example MOV R12, 0xB
Binary 0011 1100 0000 1011
Assembly ADD Rn, Rm --> Adds the values in the registers Rn and Rm and the result is stored in Rn
Example ADD R9, R10
Binary 0100 XXXX 1001 1010
Assembly SUB Rn, Rm --> Subtracts the values in the registers Rn and Rm (Rn - Rm) and result is stored
Example SUB R9, R10
Binary 0101 XXXX 1001 1010
Assembly JZ Rn, relative --> PC (Program Counter) is the pointer to the next instruction to be executed
--> The PC jumps to a new location when a condition happens.
--> What is condition : that the register Rn is '0'
--> What is the new PC value : PC = current PC + relative value
--> if condition is not zero, then PC will continue its operation..
--> what is normal operation : moves PC to next instruction.
Memory (M)
Instructio
n (in
Address binary)
0x0 i0
0x1 i1
0x2 i2
0x3 i3
0x4 i4
0x5 i5
0x6 i6
0x7 i7
0x8 i8
0x9 i9
0x10 i10
0xA i11
0xB i12
e Allocation
M(direct) = Value (Memory whose address is (Direct))
Memory (M)
Data (Value)
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
which needs to be accessed.
hose address is directly given)
o the Memory (whose addes is given by internal Register Rn)
l Register Rn
m and the result is stored in Rn
d Rm (Rn - Rm) and result is stored in Rn

next instruction to be executed
ndition happens.
C + relative value
ue its operation..
ext instruction.
1 Embedded Systems, Embedded Software,
2 Microelectornics
3 Automotive Electronics 1
4 Process control Engg. 2
5 V&V (Validation and Verification) 3
6 Virtual Prototyping 4
7 Industrial process automation 5
8 VLSI, GNSS 6
9 Functional Safety. 7
10 IoT, IO controller (chips) 8
11 Telecom
12 STA
13 Space and Defence
14 ESS (Energy Storage Systems) & UPS
15 Train Control
16 Numerical Programming
17 Highspeed HW design
18 WLAN S/W
19 Automotive (Rail)
20 PCB Design for space, defence, telecom, automotive,
21 DRAM verification
22 Healthcare product
23 Firmware Development
24 LabVIEW
25 Battery Management Systems
26 Data Radio & LTE Radios
27 Bluetooth SW Validation
28 SoC Validation
29 E-Loco
30 System integration
31 Physical design of SoC
32 Audio processing DSP Embedded systgems
33 ATE (Automatic Test Equipment)
34 Civil Aviation product certification
35 ADAS (Automotive)
36 Radar SW Testing
37 Lighting
38 BootLoader Engineer
39 Automotive Hardware design
40 Software Test Automation
41 Platform Software AutoSar
42 Railway Signalling
43 Wire Harness Design and Installation
44 Industrial control box for HVAC application
45 Linux Kernel development, Networking, Security
45 Aero Engine Control,
46 Off Highway Vehicles (Tractors)
47 Avionics
48 Fuel and CNG Vehicle development
49 UFO and Drones
Switch statement MUX kind of implementation

if-then-else PRIORITY Encoding
C
a = 0xFFFF_FFFF
b = 0x1
e=a+b what happens 0 overflow condition !!
clock 1 0 1 0 1 0
rising edge of clock or any signal !! falling edge of clock or any signal !!
POSEDGE NEGEDGE
https://class.ece.uw.edu/371/peckol/doc/Always@.pdf
a
PLUS AS A
COMBO
LOGIC D
b Q
CLK
clk
set of 32 flipflops
Take it through
C / C++ / any other compilation
higher level language
0 profiling of the software code

1 can we able to calculate the processing time and the deadlines (rea
2 can we reduce the processing time by having inline assembly for ke
3 can we able to figure out if some function is called many-many time
build an accelearator out of it…. Or we can create a new instruction
processor to do this activity. (profiling and seeking HW functions)
HW a> Create an instruction in the Processor itse
b> Create a accelerator (push the data, wait
processing is done, read the processed outpu
4 Take care of loops in the sw to compute….. (foreach, while, loops….
For COLUMN 13
DSP1 MIPS
PERIPHERALS
For COLUMN 2
DSP1 MIPS
PERIPHERALS
XPU
DSP digital signal processor
GPU graphics processing unit
DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
NPU Neural ? Processing unit
SPU Security processor
VPU Vector Processing unit
CPU Central Processing unit
DomainSp
ecific VLIW Processor
INSTR- INSTR- INSTR- INSTR-

ALU1 ALU2 ALU3 ALU4
ALU1 ALU2 ALU3 ALU4
FPGA Architecture
PS = Programmable subsystem
1> set of Application Processors,
2> optional Real time processors, optional DSPs, optional GPUs, optional Hardw
3> set of peripherals -- [High speed -- pcie, GBEthernet, usb, DDR controller, DM
4> Interface to talk with the PL
PL = Programmable Logic subsystem
1> CLBs
2> clock block, memory blocks, dsp blocks, some HW Accelerators (ethernet, pc
3> basic Ios
4> Interface to talk with the PS
complexity with FPGAs.

1 both SW and HW to work…..
2 SW team reads code from top to bottom… That is the view….
3 HW team reads code from left to right… (waveform, concurrency)….
4 SW & HW team need to sit together to debug… Need for knowledge of the dom
Instruction Level Parallelizm Can happen if the "XPU" has multiple ALU pipelines
example 1 Parallelizm is possible as the two instructions are not dependent on each other
z=x+y
a=b+c
example 2 Parallelizm is not possible because second isntruction is dependent on the first
z=x+y
a=z+c
Instructions which are good to have to make it a bit more complete ISA
1 load indirect Rn = M(Rm)
2 jump unconditional JMP <relative> ; where label is a relative path PC = PC + relative
3 jump if not-zero JNZ Rn, <relative> ; jump to label if Rn is not zero. PC = PC + relative
Data Program
Memory Memory For Addition of 10 numbers
0 a0 0 Instr0 Method 1 Use unrolling of the loop
a1 Instr1 Method 2 Use the JZ when the JNZ is not present…like
a2 …
a3 Method 3 Say you will add a new Instruction which is R
a4
a5
a6
a7
a8
a9
… …
… …
255 255
Boards
Raspberry Pi
Terasic FPGA Development Kits
MicroSemi or Xilinx or Altera Dev Kits
STM 32 Discovry Kit
Beagle Bone black
Teensy ?
pico uC board
switch priority encoding
A0
A1
4:1 MUX 2:2MUX 2:1MUX
A2
A3
O0
S0 S1
How
you'll
define
Define Scope
A A A
B B A.B
C A.B.C
D A.B.C.D
D
BEHAVIORAL CODE
module alu ( input [31:0] a, inpu
);
wire [31:0] e;
assign e = a + b;
assign d = e - c;
endmodule
STRUCTURAL CODE
module plus ( input [31:0] i1, inpu

);
assign o1 = i1 + i2;
endmodule
module minus ( input [31:0] i1, inp

);
assign o1 = i1 - i2;
endmodule
module alu ( input [31:0] a, inpu

);
overflow condition !! wire [31:0] e;

// instantiate the module plus here
plus plus_i ( .i1 (a), .i2(b), .o1(
// instantiate the module minus her
minus minus_i (.i1( e) , .i2 ( c)
endmodule
[12:17 PM] BHAGWAT RAJESH MARUTI .
I found this playlist very useful for SystemC ::: covers both design nd
like 1
Learn SystemC (1) - Introduct…
11:14 | 80.7K views | 9 years ago
e of clock or any signal !!
set of 32 flip flops one behind the other but it will be numbered
31 down to 0
Assembly level code assembly instr1 1clock

and other object files instr2 2clock
like elf (executable instr3 1clock
and linkable format) instr4 6clocks
call functionN 10clocks
…
…
ssing time and the deadlines (real time system) functionN 1clock
e by having inline assembly for key functions to reduce processing time
unction is called many-many times, and we can instrN1 2clocks
r we can create a new instruction in the instrN2 1clocks
iling and seeking HW functions) ….
an instruction in the Processor itself. ….
a accelerator (push the data, wait till
is done, read the processed output back) return from function
mpute….. (foreach, while, loops….)
sum all the # clocks taken TOTAL NN clocks

how many ways you can do T1 1
T2 2
T3 3
T4 2
12
increment total time

al time taken
component taken (ms)
T1 MIPS 5ms 5ms
T2 DSP1 20ms 25ms
T3 DSP1 18ms 43ms
T4 DSP1 5ms 48ms
increment total time

al time taken
component taken (ms)
T1 MIPS 5ms 5ms
T2 DSP1 20ms
T3 DSP2 18ms 25 add worst performance of T2
T4 MIPS 2ms 27
SoC
Area = 440
DSP2 ROM RAM
Time taken to complete the task graph = 27ms
PERIPHERALS
SoC
ROM RAM Area = 320
Time taken to complete the task graph = 48ms

PERIPHERALS
128bit instruction = say 32bits per instruction * 4 ALU's.

increasing the parallelizm in the computation
Ps, optional GPUs, optional Hardware Accelerators for say H.261 or codec or similar.
BEthernet, usb, DDR controller, DMA controllers …] [Lower Speed -- uart, spi, flash interface, usb, i2c, gpio's ….]
me HW Accelerators (ethernet, pcie, …), serdes,
at is the view….
eform, concurrency)….
… Need for knowledge of the domain we are working….
pendent on each other
dependent on the first

elative path PC = PC + relative
f Rn is not zero. PC = PC + relative if Rn not-equal-to zero.
ng of the loop
when the JNZ is not present…like you start form i= 10, and do i-- until you reach I = 0…
l add a new Instruction which is Rm = M(Rn) then the load part becomes simpler…
101376 101K pixels 303K Bytes assume R, G, B is 8 bits each
in a JPG file it’s a lossy compression…. YET we do not see any visible differences… at least
look the same.
HW
R
G
B
VGA input = 640 x 480 this is frame size, 30 frames per second…
307200 ~307K pixels
For each pixel R, G, B value… to get one Y output 2 additions, 3 mul,

assume addition = 1 clock cycle
mul = 9 clock cycles
div = 9 clock cycles
So, to get one Y sample output (from R, G, B to Y out) minimum,
What should be my cpu frequency, so that I can do this VGA input R

within the 30frames per second…..
1548288000
~1.54GHz only to do RGB to YCbCr computation using SW….
We did not even consider the Memory requirements…
Coefficients INPUTS R, G, B
C00, C01, C02 R G B
C10, C11, C12
C20, C21, C22 R * C00 G * C01 B * C02
(R*C00) + (G* C01)
((R*C00) + (G* C01) )+ (B * C02)
OUTPUTS Y
Option 1 All the Y, Cb, Cr multipliers and adders are in parallel
Option 2 Do a one-by-one first Y, then Cb, then Cr… this will reduce the aread
0 1 2 3 4
columns
0 lines 0 1 2 3 4
1 11 12 13 14 15
2 22 23 24 25 26
3 33 34 35 36 37
4 44 45 46 47 48
5 55 56 57 58 59
6 66 67 68 69 70
7 77 78 79 80 81
8
9
10
11
12
13
14
15
16
17
18
19
20
DCT
HW When we need to do it in real time…
SW When we can do it in offline mode…
Processor which can

do memory, control
tasks,
Accelerator 0
Accelerator 2
priority encoding
input [31:0] a, input [31:0] b, input [31:0] c, output [31:0] d

nput [31:0] i1, input [31:0] i2, output [31:0] o1
input [31:0] i1, input [31:0] i2, output [31:0] o1
input [31:0] a, input [31:0] b, input [31:0] c, output [31:0] d
the module plus here

i1 (a), .i2(b), .o1( e) );
the module minus here
.i1( e) , .i2 ( c) , .o1 (d));
temC ::: covers both design nd testbech https://www.youtube.com/watch?v=NCFxBGLB5xs&list=PLcvQHr8v8MQLj9tCYyOw44X1PLisEsX-

Average of 10 numbers done completely in HW
n0 n1 n2 n3 n4
1st clock cycle Plus Plus Plus
funciton N 2nd clock cycle PLUS

"S" clocks
if it were done in SW
3rd clock cycle PLUS
function N
"H" clocks if it were 4th clock cycle
done in HW
add worst performance of T2 or T3….
G, B is 8 bits each
visible differences… at least visually both images
Y
Cb
Cr
e, 30 frames per second…
e Y output 2 additions, 3 mul, 3 div…
R, G, B to Y out) minimum, 56
168
that I can do this VGA input RGB to YCbCr computation
mputation using SW….
y requirements…
in HW
3 Multipliers9 clocks
1 Adder 1clock
1 Adder 1 clock
11 clock
are in parallel 11 clocks
Cr… this will reduce the aread (of the HW), But it will take more time… (3x of one computation)
33 clocks 8 x 8 block for
DCT purposes IN RAM MEMORY
Address Data
5 6 7 8 9 10 0
F 1
5 6 7 8 9 10 i 2
16 17 18 19 20 21 r 3
s
27 28 29 30 31 32 4
t
38 39 40 41 42 43 5
l
49 50 51 52 53 54 i 6
60 61 62 63 64 65 n 7
71 72 73 74 75 76 e 8
82 83 84 85 86 87 9
10
11
S 12
e
13
c
14
o
n 15
d 16
L 17
i 18
n 19
e
n
e 20
21
22
23
24
Bus structure to store and load data from memory

Address, data, clocking, control signals and sideband control signals
Processor which can

do memory, control
tasks, MEMORIES
Accelerator 0 Accelerator 1
Accelerator 2 Accelerator 3
Hr8v8MQLj9tCYyOw44X1PLisEsX-J
n5 n6 n7 n8 n9
Plus Plus Plus
PLUS
PLUS
PIPELINE DRIVEN CONCURRENCY : HOW IT IMPROVES TH
Assume it really takes 5 clock cycles to execute one instruction….
then it will take 15 clock cycles to execute 3 instructions ==> simplistic case….
Now if I have pipeline concept…..

1st instruction will take 5 clock cycles (Instruction Fetch on 1st clock, Instruction Decode o
2nd instruciton will immediately happen on the 6th clock cycle
3rd instruction will happen on the 7th clock cycle…
Typical Performance constraints of Hardware Synthesis Tools
1 clock creation
2 generated clock creation
Synthesis 3 setting of false paths
Related 4 setting of multicycle paths
Constraints 5 setting of any fixed max or min paths
6 setting input and output delays
7 Design for Testability constraints
8 Placement guidelines
Physical
9 Layout guidelines
Design
10 IO guidelines
Related
11 High Fanout Net guidelines
Constraints
12 Clock Net creation guidelines
W IT IMPROVES THE PERFORMANCE OF THE SYSTEM
==> simplistic case….
on 1st clock, Instruction Decode on 2nd clock, Execute on 3rd clock, mem access on 4th clock, register wb on 5th clock)
RTL
Synthesis and DFT insertion
LEC = Logic Equivalence Checking
Physical Design
Fix cell LEC
Fix clock LEC
Fix wire LEC
DRC LEC
TapeOut (GDS)

Lecture Notes Workings

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes Workings

Uploaded by

Copyright:

Available Formats

Take it through

C / C++ / any other compilation

0 profiling of the software code

vga screen size

cpu ram rom

DAG = Directed Acyclic Graph

3x2 matrix 2x3 matrix

option 1 a00 * b00 a01 * b10

output obtained c00 in 7 clock cycles

indirect store to Mem

Can I do a sum of 10 numbers using the above CPU…

o/p at 3rd clock

Source Source Destination Destinatio

get high_time to a register say B

get low_time to a register say C

arrival time for a, b, c, d and fixed number 5 can be assumed to be zero.

Thoughts for Assignments 15 marks

Co-simulation & Emulation

Assembly level code assembly instr1 1clock

g time and the deadlines (real time system) functionN 1clock

Imaginary Part 1clocks

G = Directed Acyclic Graph

MIPS itself is a processor… We write SW on this processor

DSP is a special purpose processor… We write SW on this

9 * 2 multipliers Total 18 mults

duce area from 18 multipliers to 6 multipliers

Can we convert this instruction set to InstructionSetSimulator

Instruction Set Simulator

assembly program --> output at what instruction

adder + 6+c blank

om source one value in a loop for the number of bytes (words)…

ed to be zero. sharing the multiplier

d ab *(3) -(2) c-d max (3 clocks) Level 1

5(c-d) *(3) shared mult (3 clocks) Level 2

ab + 5(c-d) +(2) adder (2 clocks) Level 3

select line to select depending on statemachine

nt which need to be submitted

mulator (ISS) using 'C' or 'C++', no need for SystemC.

1st clock cycle Plus Plus Plus

funciton N 2nd clock cycle PLUS

F) = some ms, s, usec, nsec

al registers can be max 16

s max 8bits, so memory location

(direct) and copy it to

Memory location whose address is (direct).

clock4 clock5 clock6 clock7

Plus Plus Plus

How the firmware starts executing…

PPAS Goals For any application

10ns is ok… then I will design with sing

Start from scratch : Design a Frequency Detector

1 How would this frequency detector be done in sw.

A good example is DFT (Discrete Fourier Transform)….

FPGA's…. has the programmable sub-system (processors, memory controllers

What is a maskable interrupt

interrupt to the processor

flip flop (register mapped programmable bit) [interrupt mask register]

A if A changes and B was stable, then Y will com

Completion time of A = 1ns

What time will join really happen

nand/nor flash mem

SPEED OBSERVABILITY COST