Download as pdf or txt
Download as pdf or txt
You are on page 1of 135

Take it through

C / C++ / any other compilation


higher level language

0 profiling of the software code


1 can we able to calculate the processing time and the deadlines (rea
2 can we reduce the processing time by having inline assembly for ke
3 can we able to figure out if some function is called many-many tim
build an accelearator out of it…. Or we can create a new instructio
processor to do this activity. (profiling and seeking HW functions)
HW a> Create an instruction in the Processor its
b> Create a accelerator (push the data, wait
done, read the processed output back)
3 Take care of loops in the sw to compute….. (foreach, while, loops…

vga screen size


X 640
Y 480
Total Frame size307200 pixels in one frame

60frames per second 18432000 ~18.4M pixels need to be sent to the proje

Complex Multiplication

Y = (a + ib) (c + id)
SW many-many assembly instructions Y =which
(ac - bd)
will+tke
i (bcclocks
+ ad)/ cycles
which will take clock cycles
r1=a
r2=b Accelerator a b c
r3=c
r4=d
* *

t1 = r1 * r3
t2 = bd
-

Real Part

If design is pipelined = initially there is latency of say 6 clock cycles, but then lat

cpu ram rom

interconnect bus

DMA
controller complexMulAccelerator

Combination1
T1 MIPS
T2 DSP
T3 FPGA
T4 MIPS
Area (200+120+240), Time(5+20

DAG = Directed Acyclic Graph


only fork and join. Not cyclic.
b00 b matrix 0th row 0th column

3x2 matrix 2x3 matrix


col0 col1 col0
row0 a00 a01 row0 b00
row1 a10 a11 row1 b10
row2 a20 a21

option 1 a00 * b00 a01 * b10


1 2 3 4
5clocks mult * * * *

2clocks adder + +

output obtained c00 in 7 clock cycles

Option2
6 multipliers and 3 adders ==> do it 3 times Reduce area from 18 multipliers
Reduce area from 9 adders to 3
Output obtained in 7 clock cycles x 3 times = 21 minimum… if our FSM is good enough….

Option3
2 multipliers and 1 adder ==> do it 9 times

Output obtained in 7 clock cycles x 9 times = 63 minimum… if our FSM is good enough (pip

Option4 SW
main () {
int A[3][2] = { {1,2}, {3,4}, {5,6}};
int B[2][3] = { {7,8,9}, {10,11,12}};
int C[3][3];
int i, j, k;
for (i=0; i < 3; i++) {
for (j=0; i < 3; j++) {
C[i][j] = 0;
Option5 for (k=0; k < 2; k++) { What if this loop in unrolled….
C[i][j] += A[i][k] * B[k][j]; c[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j];
}
}
}
}

Load to Register
Store to Memory

indirect store to Mem

R0 is a register ==> this can be 8bits or 16bits or 32bits we yet don’t know….
But there are 16 such registers… which can be named as R0, R1, R2….., R15

Can I write an ISS given the instruction set itself in a document ?? Yes !!
I'm trivializing; but these instructions can be in a large case statement in 'C' to create an ISS out of it.

Can I do a sum of 10 numbers using the above CPU…


int total = 0;
for (int i = 10; i != 0; i--) total += i;

Use the Instruction Set and implement the 'C' program of "sum of 10 numbers"
i.e, 1st step : convert the 'C' program into assembly routine for the given InstructionSet.
2nd Step : assembly to executable conversion
3rd Step : run the executable on the ISS

Y = ab – 6d + dc
a b c

3 multipliers * *

o/p at 3rd clock


ab cd
1 adder +
o/p at 5th clock
(ab + cd)
1 subtractor
o/p at 7th clock

DMA
BC DE

Source Source Destination Destinatio


Address Value Address n Value
0x1000 5A 0x200 5
0x1001 5 0x201 5
0x1002 5 0x202 5 cpu will read from source one value
0x1003 6 0x203 6 cpu will write to the destination one value
0x1004 12 0x204 12 increment source pointer
0x1005 1 0x205 1 increment destination pointer
0x1006 11 0x206 11
0x1007 1 0x207 1
0x1008 0 0x208 0
0x1009 12 0x209 12
0x100A 5 0x20A 5
0x100B 5 0x20B 5
0x100C 12 0x20C 12
0x100D 8 0x20D 8
0x100E 5 0x20E 5
0x100F 1 0x20F 1
while (1) { //forever loop

get high_time to a register say B


high_time_loop : clock 0 clock 1
set gpio ‘1’
decrement B 1 1
if B ! = 0 then
go to high_time loop

get low_time to a register say C


low_time_loop : 25%
set gpio ‘0’
decrement B
if B != 0 then
go to low_time_loop
}

arrival time for a, b, c, d and fixed number 5 can be assumed to be zero.

a b c d

* (3) - (2)
1st mult
1st sub

5 arrival time = 2
* (3)
2nd mult operation, shared hardware

arrival = 3 arrival = 5
+ (2)
critical path

arrival = 7
y
y

Thoughts for Assignments 15 marks

10marks 1a We will give MCQ quiz for say 8 marks or 7marks : basically 15 MCQ
1b Add subjective type Questions also.
2 For another 7 to 8 marks we will give an assignment which need to be submitte
resource HW a> You have a kit (arduino, beagle, pico, …) --> show two assignments
5marks resource Laptop/PC b> We have the 8085 instruction set.
(i) Create an InstructionSetSimulator (ISS) using 'C' or 'C++
(ii) Write an assembly program to sort 16 numbers. Use yo
5marks resource Lapotp/PC c> Create a domain-specific-architecture for say a EthernetLAN packe
(i) Student can decide the packet type [(header) (source) (
(ii) Parse the packet using a domain-specific-processor to r
(iii) Do some level of time-stamping in the packet [(header
resource HW d> Choose any DSP processor of your choice (TI, ADI, MaxLinear, Free
(i) Filter taps (<<>>), Filter coeffs (<<>>) --> <<goldenInput
5marks resource Laptop/PC e> Image Processing example YUV -> RGB conversion. Create a HW Pe
Verific, Icarus Verilog (open source), Quartus / ModelSim

3
https://fpgasoftware.intel.com/

Co-simulation & Emulation


1 Normal simulation --> say on a single core processor… Slow… say to the tune of 100kHz wall clock time
2 Accelerated simulation --> where gates are parallelized by usage of either a super-scalar boolean proce
How parallization happens
Synopsys ZeBu (1) on FPGA (Field Programmable Gate Array) --> Array of FPGAs…. Like a board wh
Cadence Palladium (2) using a super-scalar boolean processor (Potentially N x M) possibility for addin
Mentor Veloche (3) a mix of FPGA + super-scalar boolean processors…

boolean processor
boolean processor
boolean processor
Process Path Methodology

Assembly level code assembly instr1 1clock


and other object files instr2 2clock
like elf (executable instr3 1clock
and linkable format) instr4 6clocks
call functionN 10clocks

g time and the deadlines (real time system) functionN 1clock


having inline assembly for key functions to reduce processing time
tion is called many-many times, and we can instrN1 2clocks
e can create a new instruction in the instrN2 1clocks
and seeking HW functions) ….
struction in the Processor itself. ….
elerator (push the data, wait till processing is
processed output back) return from function
te….. (foreach, while, loops….)
sum all the # clocks
taken TOTAL NN clocks

N clocks
frequency of the processor (F)
period of each clock = 1/(F)

need to be sent to the projector in one second….. Nclocks * period = N /(F) = some ms, s, usec, nsec

CEO XPU
CFO DSP digital signal processor
CTO GPU graphics processing unit
CXO DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
d a b c d 2clocks NPU Neural ? Processing unit
SPU Security processor
bc ad VPU Vector Processing unit
* * 2clocks CPU Central Processing unit
DomainSp
ecific VLIW Processor
bc+ad + 1clocks

Imaginary Part 1clocks

Total 6clocks

ay 6 clock cycles, but then later on, there is sustained throughput…. Every clcok cycle, you get one ComplexMultiplication done

mplexMulAccelerator

Combination2 CombinationN

a (200+120+240), Time(5+20+2)
max(T2, T3) because T2 & T3 are executed in paralallel

G = Directed Acyclic Graph


y fork and join. Not cyclic.

MIPS itself is a processor… We write SW on this processor

DSP is a special purpose processor… We write SW on this

FPGA is a HW
ASIC is a HW (HW-Accelerator)
FPGA with CPU + FABRIC

CPU == SW
FABRIC == HW

3x3 matrix
col1 col2
b01 b02 c00 = a00 * b00 + a01 * b10
b11 b12 For c00 --> 2 multiplier and 1 adder

9 * 2 multipliers Total 18 mults


9 * 1 adders Total 9 adders

5 6 7 8 9 10 11 12 13
* * * * * * * * *

duce area from 18 multipliers to 6 multipliers


duce area from 9 adders to 3 adders
our FSM is good enough….

our FSM is good enough (pipelining is good) to to proper read and writes to memory locations…

p in unrolled….
b[0][j] + a[i][1] * b[1][j];

Can we convert this instruction set to InstructionSetSimulator

Instruction Set Simulator

assembly program --> output at what instruction


ISS 'C' or 'C++' or
what happens
'Python' or 'Perl'
like which register is updated
what happens to memory locations
….

Observations
1. Looks like 16bit instruction because it has 2 bytes.
2. Rn field is 4bits wide. Rn = internal Register… So number of internal registers can be max 16
3. MSB 4bits are opcodes of the Instruction….
4. Registers are R0 to R15 max…
5. M(direct) addressing is available….. What does it mean… (direct) is max 8bits, so memory location
at max can be 256 locations only… :(
6. Rn = M(direct) ??? >> read the memory location whose address is (direct) and copy it to
register Rn (where Rn = R0, or R1, or R2. ,,,,, R15)
7. M(direct) = Rn ??? >> write the contents of the Rn register to the Memory location whose address i
8. M(Rn) = Rm ??? >> write the contents of the Rm register to the Memory location whose address
is the value of the register Rn
C' to create an ISS out of it. 9. Rn = immediate >> Rn now has value "immediate"
10. Rn = Rn + Rm >> add instruction… I = I + J.. Similarly for subtract…
11. PC = PC +Relative (only if Rn is 0) >> Jump if Zero….to a relative location…

nstructionSet.
ab-d(6+c)

d 6

* 3 clocks d a b 6 c

* +
3 clocks
2 clocks
* reuse multiplier
6d
- 2 clocks

ab + cd - 6d 7 clocks
clock0 clock1 clock2 clock3
multiplier * a*b

adder + 6+c blank


waiting for multiplier to complete
subtractor -

om source one value in a loop for the number of bytes (words)…


o the destination one value

ination pointer

while(1)
clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock 8 clock 9 clock 10

1 1 0 0 0 0 0 0 0

75%

ed to be zero. sharing the multiplier

d ab *(3) -(2) c-d max (3 clocks) Level 1

5(c-d) *(3) shared mult (3 clocks) Level 2

ab + 5(c-d) +(2) adder (2 clocks) Level 3

8 clocks

a
5 MUX I1

*
b
MUX I2
(c-d)

select line to select depending on statemachine


either to use (a, b) pair for multiplication
or to use (5, (c-d) ) pair for multiplication

basically 15 MCQ

nt which need to be submitted


…) --> show two assignments, (i) if I press a button, an led will glow (ii) a UART showing "hello world" on screen..

mulator (ISS) using 'C' or 'C++', no need for SystemC.


m to sort 16 numbers. Use your ISS, to ensure correctness of code….
for say a EthernetLAN packet…. IEEE 802.3
cket type [(header) (source) (destination) (payloadSize) (payload) (crc)(tail)]
omain-specific-processor to route the packet from source to destination.
mping in the packet [(header) (source) (destination)(TIME_STAMP) (payloadSize) (payload) (crc)(tail)]
ice (TI, ADI, MaxLinear, Freescale, ….) ---> Write a FIR filter for that processor.
effs (<<>>) --> <<goldenInput, and goldenOutput>>…
B conversion. Create a HW Peripheral for the same. [Writing a VHDL, or Verilog] [simulate it and show it working]
uartus / ModelSim

KNOBS
Speed Debugability
une of 100kHz wall clock time per clock cycle of simulation. (server) 100kHz High (100%)
a super-scalar boolean processor, or an FPGA.
y of FPGAs…. Like a board which has 9+ FPGAs running simultaneously. 10MHz Low
y N x M) possibility for adding processors…. 4MHz High/Medium
FPGA0,0 FPGA0,1 FPGA0,2 8MHz Medium
FPGA1,0
olean processor

FPGA2,2
Average of 10 numbers done completely in HW
n0 n1 n2 n3 n4

1st clock cycle Plus Plus Plus

funciton N 2nd clock cycle PLUS


"S" clocks
if it were done in SW
3rd clock cycle PLUS

function N
"H" clocks if it were 4th clock cycle

done in HW
5th clock cycle

F) = some ms, s, usec, nsec

al processor
ocessing unit

ocessing unit

ocessing unit
lexMultiplication done

te SW on this processor

r… We write SW on this
14 15 16 17 18
* * * * *
at what instruction

register is updated
ens to memory locations

al registers can be max 16

s max 8bits, so memory location

(direct) and copy it to

Memory location whose address is (direct).


emory location whose address
2 clocks

clock4 clock5 clock6 clock7


d*(6+c)

multiplier to complete
ab - d(6+c)
clock 11 clock 12 clock 13 clock 14 clock 15

0 0 0 0 0

(pipelining)

(pipelining)

(pipelining)
Users
multiple
one
multiple
multiple
n5 n6 n7 n8 n9

Plus Plus Plus

PLUS

PLUS

DIV/10
`
What is Glue logic

1 uP working at reset active low, but peripheral working at reset active high….
so glue logic is needed to invert the reset signal before going either the uP or the peripheral.
2 the uP may have a different bus architecture, peripheral may have a different… So glue logic is needed (bus bridge

How the firmware starts executing…


1 Power UP.
2 Clocking should start… Reset is held to the reset state of the uP.
3 Reset release
4 As soon as reset release happens, the uP starts execution, and what it does, --> fetch the first instruction…
5 uP can fetch instruction from ROM (within the SoC chip), or from NAND flash (outside the SoC chip).....
6 Concept called BootUP-Pins…. So the uP based on the BootUP-Pins, … start reading the first instruction from eithe
7 Then on, the instruction execution started….

PPAS Goals For any application


P Performance
P Power
A Area
S Schedule

(P)PPAS(S)
(P) People
(S) Security


2ns 6ns 8ns

10ns is ok… then I will design with sing

Start from scratch : Design a Frequency Detector

1 How would this frequency detector be done in sw.


2 If there are components which are analog in nature, pick them first. (sensor, DAQ module,
3 If there is a SW -- is there a time limit -- is it a real time problem…
4 Ask yourself a Question -- Is SW good enough for the real time (like say 1ms to detect a fre
5 If SW is not good enough…. Look into the SW program and profile it…. Do you see any patt
6 Of the few functions that are being called over and over again, or are leading to a non-real
7 What we identified, try to put it in HW…
8 In HW again, there will be a lot of choices to be made…
9 In SW was originally done, is there a way to optimize the same, using better memory and d

A good example is DFT (Discrete Fourier Transform)….


1 many butterfly units in parallel or not…
2 many memories in parallel or not (read has one memory) (write has another separate mem
3 make the refinements till your application is satisfied…

first SW, and then HW to design a good product?.. vice versa is not good strategy sir?
6:07 PM

1 If product is already there, and it is HW based.. Then you can reduce cost by putting it in SW…
2

FPGA's…. has the programmable sub-system (processors, memory controllers


PS PART

has the programmable logic… (LUT, DSP blocks, memory blocks, pll
PL PART

What is a maskable interrupt

interrupt to the processor


interrupt AND GATE

flip flop (register mapped programmable bit) [interrupt mask register]

Timing construct

A if A changes and B was stable, then Y will com


AND GATE Y A changes at time 0
B

intrinsic delay 0
delay within the gate A changed
B
A->Y 1ns Y
B->Y 1ns Y (same Y)

Completion time of A = 1ns


Completion time of B = 1ns
Completion time of C = 1.5ns

What time will join really happen


1.5ns
hi Sir, what are the software we should get a hands-on as part of this cour
5:59 PM

1 C', 'C++'
2 Verilog HDL / VHDL
3 SystemC

Tools
1 g++ (gnu C / C++ compiler).
2 ModelSim / QuestaSim --> download it from Intel website… (check for license…
3 SystemC --> many freeware may be available… Don’t know for sure.
4 ModelSim / QuestaSim --> may be supporting mixed Verilog, VHDL and SystemC
5 People from ASIC world…. VCS (Synopsys), Xcelium (Cadence) all support mixed
6 Xsim (Xilinx) --> download it from the Xilinx website… (may support verilog, vhd

Complete Application
Hardware Software
Implementation Implementation

Verilog HDL C
SystemVerilog C++
Python
VHDL
1> Since tapeout is expensive -- we must do simulation and Co-debug.. (Debug of HW and SW) (Also includes Boar
2>

nand/nor flash mem


ddr memory

usb DUT
uart

pcie
eth

Simulation On a linux box or server, we simulate our Verilog or VHDL design with SW.
Slower (wall clock of 250kHz approx)
Observability is high
For basic bringup it's good

Emulation Usage of either FPGA, Array of Boolean Processors, Hybrid parallel processors t
Faster (wall clock of say 1MHz to 4MHz approx)
Observability is low
Post basic bringup for running higher level of sw or boot-rom-firmware

KNOBS

SPEED OBSERVABILITY COST

High Good Emulation Simulation Emulation


FPGA

Medium Simulation Emulation FPGA

Low Simulation

Full RTL simulation of Processor each and every node of the processor RTL is simulated

Bus Functional Model (BFM) The processor internals are accelerated (by just modeling and it's no
accurate. (DONUT)
Cycle Accurate Model If an instruction takes say 3 clock cycles, the model allows 3 clock cy

Instruction Set Simulation of Processor Only Instructions are executed as if like a pure 'C' model of the proc

Concept of LUT [Look Up Table]

Boolean Equation Y = A* B + C * D

4 Input LUT (Look Up Table)


decimal A B C D Y
0 0 0 0 0 0
1 0 0 0 1 0
2 0 0 1 0 0
3 0 0 1 1 1
4 0 1 0 0 0
5 0 1 0 1 0
6 0 1 1 0 0
7 0 1 1 1 1
8 1 0 0 0 0
9 1 0 0 1 0
10 1 0 1 0 0
11 1 0 1 1 1
12 1 1 0 0 1
13 1 1 0 1 1
14 1 1 1 0 1
15 1 1 1 1 1

static the values will remain until the FPGA is programmed


Later, you can re-program it for another design.
load to register (from memory)

direct store to memory (from register)

indirect store to memory (from register)

load to register immediate

Addition operation

Subtraction operation

Jump if a register value is zero.


Jump to a location given by
the relative value

Registers
Immediate value
direct value
Memory range
PC range

We have a We want to write a C


instruction Set. Call program for an
this as Processor 1 application
We have a
instruction Set. Call
this as Processor 2

Practice Create an assembly routine to do the folloiwng using the trivial instr
1 unsigned int x;
2 unsigned int y;
3 unsigned int a, b, c, d, e;
<in assembly, say r0, r1 are x, y respectively>
<in assembly, say r2, r3, r4, r5 are b, c, d, e> And you can choose r10
4 if (x < y) { a = b + c } else { a = d + e }

Steps 1 Create a flow chart corresponding to the trivial instructio


2 From the flow chart, start mapping blocks to assembly la
3 Using paper / pen & in the mind, work out the correctne

example1 x = 5, y =8; b = 4, c = 3, d =2, e = 1;


example2 x = 8, y = 5; b =4, c=3, d=2, e = 1;

Hints
Gautam Amankumar .
we can keep subtracting 1 from both and the first register gets to ze
pingmem pongmem pingmem pongmem
STEP1 STEP2
(own fsm) (own fsm)

HW 1 Each statemachine is running concurrently


in parallel 2 But data is flowing in a pipeline fashion

3 ping - pong memory concept


`-- When ping memory is in write phase, the pong memory is in read phase
`-- When ping memory is in read phase, the pong memory is in write phase
`-- There will be some point in time, when the ping and pong memory switch their roles.

SW 1 function step1
in series 2 function step2
3 function step3

4 main {
call function 1 t0
call function 2 t1
call funciton 3 t2
}
Asynchronous

Synchronous
e peripheral.
So glue logic is needed (bus bridge, or bus gasket, ….) AXI3 to AXI4, AHB to AXI, ….

fetch the first instruction…


utside the SoC chip).....
ing the first instruction from either the ROM or NAND Flash…

top
uP ROM

BUS

RAM Peripheral

RAM mem Peripheral

FrontEnd logic State structural


M/C O/P Logic scope of o/p logic from top : top.peripheral.op_logic

If there is no hierarchy, it is difficult to scale the system.


LEVEL 0 add & sub each individually take 2ns If sharing of adder not done
3 add / sub
1 mul

If sharing of adder done


LEVEL 1 multiply takes 4ns 2 add/sub
1 mul

If performance is not critical


1 add/sub
1 mul

LEVEL 2 add takes 2ns Sharing also involves multiplexing


the previous outputs and
maybe storing for later use..

is ok… then I will design with single adder… Requirement for Performance….

them first. (sensor, DAQ module, …).

time (like say 1ms to detect a frequency).


d profile it…. Do you see any patterns.
again, or are leading to a non-real-time nature… Identify them….

same, using better memory and data structures, better handling of interrupts, external Ios...

) (write has another separate memory), …


cost by putting it in SW…

m (processors, memory controllers, usb, serial ports, …..)

T, DSP blocks, memory blocks, pll blocks, ….)

mask register]

s and B was stable, then Y will come out in 1ns… why A->Y = 1ns
B changes at time 0.5ns Y will change at what time --> 1.5ns…

0.5 1 1.5 2
stable stable stable stable
changed
Y chnge due to A
Y chnge due to B

between 0.5 and 1.5 the value of Y is unstable… or there might be glitches…

Completion time of A = 1ns


Completion time of B = 1ns
Completion time of C = 1.5ns

What time will join really happen…..


top

M1 M1 M2
Logi logic of
top level

ds-on as part of this course?

ntel website… (check for license…)


… Don’t know for sure.
mixed Verilog, VHDL and SystemC.
elium (Cadence) all support mixed Verilog, SystemC, VHDL.
ebsite… (may support verilog, vhdl)… Don’t know about SystemC.
f HW and SW) (Also includes Board debug)

TB Simulator slower to run


nand/nor flash mem Emulator not full scale freq, but faster
ddr memory than simulation

spi
i2c

display interface
jesd

log or VHDL design with SW.


processor freq = 1GHz

ssors, Hybrid parallel processors to speed up simulation


processor freq = 1GHz

sw or boot-rom-firmware

CAPACITY (Gate count)

Emulation
Simulation

FPGA

ssor RTL is simulated

rated (by just modeling and it's not cycle accurate), but external world is cycle
cycles, the model allows 3 clock cycles to elapse, and is accurate to BFM

if like a pure 'C' model of the processor

Implementation of the Look Up Table


Assume that you have a 16 : 1 MUX

0 0
0 1
0 2
1 3
0 4
0 5 Y RST Q
0 6 output output of the LUT D
1 7 FF
0 8 CLK
0 9
0 10
1 11
1 12
1 13
1 14
1 15 MUX 16:1 1. Sometimes you want only combinatori
2. Sometimes you want combinatorial log
static
ram
of the LUT S3,S2,S1,S0
A,B,C,D input signals of the LUT
stimulus to the LUT & Design
R0 to R15
Immediate value 0 to 0xFF (255)
direct value 0 to 0xFF (255)
Memory range 0 to 0xFF (255)
0 to 0xFF (255)

For this 'C' application;


we want to map it to a
specific processor.
COMPILER
We want to write a C [Convert frm 'c' to
program for an assembly --
application COMPILER] [then
convert from assembly
to executable (binary)
[LINKER & LOADER]]
1 it could be processor1 (say trivial instruction set)
2 it could be some another processor (say arm)
3 it could be another processor (say 8085)
4 it could be a NEW processor also, and you are trying to figure out
the best instruction set for the NEW processor

the folloiwng using the trivial instruction set processor :

b, c, d, e> And you can choose r10 as a;

rresponding to the trivial instruction set processor


tart mapping blocks to assembly language
the mind, work out the correctness of the solution

, d =2, e = 1;

oth and the first register gets to zero will be smaller so we can jump accordingly
pingmem pongmem

final output
STEP3
(own fsm)

ry is in read phase
y is in write phase
ong memory switch their roles.

Hardware Side Additions


IO's Consoles
SSD Disks Disk & CD's
Graphics Card Consoles
Wireless radio Networking
Virtualization and cloud Networking & Resource Allocation
Sensors Connectivity
Converters (USB 2 UART).. Connectivity
Bluetooth… Connectivity

Security Security in OS
peripheral.op_logic

lt to scale the system.


f adder not done

f adder done

ance is not critical

o involves multiplexing
us outputs and
ring for later use..
2:1 MUX output of the CLB
goes either to IO, or the FPGA interconnect

static ram of the CLB


CLB = Configurable Logic Block OUTPUT

mes you want only combinatorial logic


mes you want combinatorial logic + sequential cell
LOAD instruction : Loads data to an internal Processor Register. M(direct) = Value (Memory whos

Example of Rn = M(direct)

MOV R0, 0x5 Memory (M)


R0 <- M(0x5) = d5 Address
0x0
0x1
Assembly MOV R0, 0x5 0x2
nibble3 nibble2 nibble1 nibble0 0x3
Binary 0000 0000 0000 0101 0x4
0x5
nibble3 OPCODE --> because it helps us to understand 0x6
what kind of instruction it is. 0x7
0x8
nibble2 Rn --> It helps us to figure out which register 0x9
is being addressed… 0x10
Because it is 4bits wide field --> how many 0xA
registers can be addressed…. 0xB
We can address 16 registers, call it R0 to R15

{nibble1, nibble0} --> Gives the address to the memory location which needs to be accessed.

Assembly MOV direct, Rn --> stores data from a Register to Memory (whose address is directly given)
Example MOV 0x8, R15
nibble3 nibble2 nibble1 nibble0
Binary 0001 1111 0000 1000

Assembly MOV @Rn, Rm --> stores data from an internal Register Rm to the Memory (whose addes is giv
Example MOV @R5, R7
nibble3 nibble2 nibble1 nibble0
Binary 0010 XXXX 0101 0111

Assembly MOV Rn, #immed --> loads the #immediate value to the internal Register Rn
Example MOV R12, 0xB
nibble3 nibble2 nibble1 nibble0
Binary 0011 1100 0000 1011

Assembly ADD Rn, Rm --> Adds the values in the registers Rn and Rm and the result is stored in Rn
Example ADD R9, R10
nibble3 nibble2 nibble1 nibble0
Binary 0100 XXXX 1001 1010

Assembly SUB Rn, Rm --> Subtracts the values in the registers Rn and Rm (Rn - Rm) and result is stored
Example SUB R9, R10
nibble3 nibble2 nibble1 nibble0
Binary 0101 XXXX 1001 1010

Assembly JZ Rn, relative --> PC (Program Counter) is the pointer to the next instruction to be executed
--> The PC jumps to a new location when a condition happens.
--> What is condition : that the register Rn is '0'
--> What is the new PC value : PC = current PC + relative value
--> if condition is not zero, then PC will continue its operation..
--> what is normal operation : moves PC to next instruction.

Memory (M)
Instructio
n (in
Address binary)
0x0 i0
0x1 i1
0x2 i2
0x3 i3
0x4 i4
0x5 i5
0x6 i6
0x7 i7
0x8 i8
0x9 i9
0x10 i10
0xA i11
0xB i12
e Allocation
M(direct) = Value (Memory whose address is (Direct))

Memory (M)
Data (Value)
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12

which needs to be accessed.

hose address is directly given)

o the Memory (whose addes is given by internal Register Rn)

l Register Rn

m and the result is stored in Rn

d Rm (Rn - Rm) and result is stored in Rn


next instruction to be executed
ndition happens.

C + relative value
ue its operation..
ext instruction.
1 Embedded Systems, Embedded Software,
2 Microelectornics
3 Automotive Electronics 1
4 Process control Engg. 2
5 V&V (Validation and Verification) 3
6 Virtual Prototyping 4
7 Industrial process automation 5
8 VLSI, GNSS 6
9 Functional Safety. 7
10 IoT, IO controller (chips) 8
11 Telecom
12 STA
13 Space and Defence
14 ESS (Energy Storage Systems) & UPS
15 Train Control
16 Numerical Programming
17 Highspeed HW design
18 WLAN S/W
19 Automotive (Rail)
20 PCB Design for space, defence, telecom, automotive,
21 DRAM verification
22 Healthcare product
23 Firmware Development
24 LabVIEW
25 Battery Management Systems
26 Data Radio & LTE Radios
27 Bluetooth SW Validation
28 SoC Validation
29 E-Loco
30 System integration
31 Physical design of SoC
32 Audio processing DSP Embedded systgems
33 ATE (Automatic Test Equipment)
34 Civil Aviation product certification
35 ADAS (Automotive)
36 Radar SW Testing
37 Lighting
38 BootLoader Engineer
39 Automotive Hardware design
40 Software Test Automation
41 Platform Software AutoSar
42 Railway Signalling
43 Wire Harness Design and Installation
44 Industrial control box for HVAC application
45 Linux Kernel development, Networking, Security
45 Aero Engine Control,
46 Off Highway Vehicles (Tractors)
47 Avionics
48 Fuel and CNG Vehicle development
49 UFO and Drones

Switch statement MUX kind of implementation


if-then-else PRIORITY Encoding

C
a = 0xFFFF_FFFF
b = 0x1

e=a+b what happens 0 overflow condition !!

clock 1 0 1 0 1 0
rising edge of clock or any signal !! falling edge of clock or any signal !!
POSEDGE NEGEDGE

https://class.ece.uw.edu/371/peckol/doc/Always@.pdf

a
PLUS AS A
COMBO
LOGIC D
b Q
CLK
clk

set of 32 flipflops

Take it through
C / C++ / any other compilation
higher level language

0 profiling of the software code


1 can we able to calculate the processing time and the deadlines (rea
2 can we reduce the processing time by having inline assembly for ke
3 can we able to figure out if some function is called many-many time
build an accelearator out of it…. Or we can create a new instruction
processor to do this activity. (profiling and seeking HW functions)
HW a> Create an instruction in the Processor itse
b> Create a accelerator (push the data, wait
processing is done, read the processed outpu
4 Take care of loops in the sw to compute….. (foreach, while, loops….
For COLUMN 13

DSP1 MIPS

PERIPHERALS
For COLUMN 2

DSP1 MIPS

PERIPHERALS

XPU
DSP digital signal processor
GPU graphics processing unit
DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
NPU Neural ? Processing unit
SPU Security processor
VPU Vector Processing unit
CPU Central Processing unit
DomainSp
ecific VLIW Processor

INSTR- INSTR- INSTR- INSTR-


ALU1 ALU2 ALU3 ALU4
ALU1 ALU2 ALU3 ALU4

FPGA Architecture

PS = Programmable subsystem
1> set of Application Processors,
2> optional Real time processors, optional DSPs, optional GPUs, optional Hardw
3> set of peripherals -- [High speed -- pcie, GBEthernet, usb, DDR controller, DM
4> Interface to talk with the PL
PL = Programmable Logic subsystem
1> CLBs
2> clock block, memory blocks, dsp blocks, some HW Accelerators (ethernet, pc
3> basic Ios
4> Interface to talk with the PS

complexity with FPGAs.


1 both SW and HW to work…..
2 SW team reads code from top to bottom… That is the view….
3 HW team reads code from left to right… (waveform, concurrency)….
4 SW & HW team need to sit together to debug… Need for knowledge of the dom

Instruction Level Parallelizm Can happen if the "XPU" has multiple ALU pipelines
example 1 Parallelizm is possible as the two instructions are not dependent on each other
z=x+y
a=b+c

example 2 Parallelizm is not possible because second isntruction is dependent on the first
z=x+y
a=z+c
Instructions which are good to have to make it a bit more complete ISA
1 load indirect Rn = M(Rm)
2 jump unconditional JMP <relative> ; where label is a relative path PC = PC + relative
3 jump if not-zero JNZ Rn, <relative> ; jump to label if Rn is not zero. PC = PC + relative

Data Program
Memory Memory For Addition of 10 numbers
0 a0 0 Instr0 Method 1 Use unrolling of the loop
a1 Instr1 Method 2 Use the JZ when the JNZ is not present…like
a2 …
a3 Method 3 Say you will add a new Instruction which is R
a4
a5
a6
a7
a8
a9

… …
… …
255 255
Boards
Raspberry Pi
Terasic FPGA Development Kits
MicroSemi or Xilinx or Altera Dev Kits
STM 32 Discovry Kit
Beagle Bone black
Teensy ?
pico uC board
switch priority encoding

A0
A1
4:1 MUX 2:2MUX 2:1MUX
A2
A3
O0
S0 S1

How
you'll
define
Define Scope
A A A
B B A.B
C A.B.C
D A.B.C.D
D

BEHAVIORAL CODE
module alu ( input [31:0] a, inpu
);
wire [31:0] e;

assign e = a + b;

assign d = e - c;

endmodule

STRUCTURAL CODE

module plus ( input [31:0] i1, inpu


);

assign o1 = i1 + i2;

endmodule

module minus ( input [31:0] i1, inp


);

assign o1 = i1 - i2;

endmodule

module alu ( input [31:0] a, inpu


);

overflow condition !! wire [31:0] e;


// instantiate the module plus here
plus plus_i ( .i1 (a), .i2(b), .o1(
// instantiate the module minus her
minus minus_i (.i1( e) , .i2 ( c)

endmodule

[12:17 PM] BHAGWAT RAJESH MARUTI .

I found this playlist very useful for SystemC ::: covers both design nd

like 1
Learn SystemC (1) - Introduct…
11:14 | 80.7K views | 9 years ago
e of clock or any signal !!

set of 32 flip flops one behind the other but it will be numbered
31 down to 0

Assembly level code assembly instr1 1clock


and other object files instr2 2clock
like elf (executable instr3 1clock
and linkable format) instr4 6clocks
call functionN 10clocks

ssing time and the deadlines (real time system) functionN 1clock
e by having inline assembly for key functions to reduce processing time
unction is called many-many times, and we can instrN1 2clocks
r we can create a new instruction in the instrN2 1clocks
iling and seeking HW functions) ….
an instruction in the Processor itself. ….
a accelerator (push the data, wait till
is done, read the processed output back) return from function
mpute….. (foreach, while, loops….)

sum all the # clocks taken TOTAL NN clocks


how many ways you can do T1 1
T2 2
T3 3
T4 2
12

increment total time


al time taken
component taken (ms)
T1 MIPS 5ms 5ms
T2 DSP1 20ms 25ms
T3 DSP1 18ms 43ms
T4 DSP1 5ms 48ms

increment total time


al time taken
component taken (ms)
T1 MIPS 5ms 5ms
T2 DSP1 20ms
T3 DSP2 18ms 25 add worst performance of T2
T4 MIPS 2ms 27

SoC
Area = 440
DSP2 ROM RAM
Time taken to complete the task graph = 27ms

PERIPHERALS
SoC

ROM RAM Area = 320

Time taken to complete the task graph = 48ms


PERIPHERALS

128bit instruction = say 32bits per instruction * 4 ALU's.


increasing the parallelizm in the computation

Ps, optional GPUs, optional Hardware Accelerators for say H.261 or codec or similar.
BEthernet, usb, DDR controller, DMA controllers …] [Lower Speed -- uart, spi, flash interface, usb, i2c, gpio's ….]

me HW Accelerators (ethernet, pcie, …), serdes,

at is the view….
eform, concurrency)….
… Need for knowledge of the domain we are working….

pendent on each other

dependent on the first


elative path PC = PC + relative
f Rn is not zero. PC = PC + relative if Rn not-equal-to zero.

ng of the loop
when the JNZ is not present…like you start form i= 10, and do i-- until you reach I = 0…

l add a new Instruction which is Rm = M(Rn) then the load part becomes simpler…
101376 101K pixels 303K Bytes assume R, G, B is 8 bits each
in a JPG file it’s a lossy compression…. YET we do not see any visible differences… at least
look the same.

HW
R
G
B

VGA input = 640 x 480 this is frame size, 30 frames per second…
307200 ~307K pixels

For each pixel R, G, B value… to get one Y output 2 additions, 3 mul,


assume addition = 1 clock cycle
mul = 9 clock cycles
div = 9 clock cycles
So, to get one Y sample output (from R, G, B to Y out) minimum,

What should be my cpu frequency, so that I can do this VGA input R


within the 30frames per second…..
1548288000
~1.54GHz only to do RGB to YCbCr computation using SW….

We did not even consider the Memory requirements…

Coefficients INPUTS R, G, B
C00, C01, C02 R G B
C10, C11, C12
C20, C21, C22 R * C00 G * C01 B * C02

(R*C00) + (G* C01)

((R*C00) + (G* C01) )+ (B * C02)

OUTPUTS Y

Option 1 All the Y, Cb, Cr multipliers and adders are in parallel

Option 2 Do a one-by-one first Y, then Cb, then Cr… this will reduce the aread

0 1 2 3 4
columns
0 lines 0 1 2 3 4
1 11 12 13 14 15
2 22 23 24 25 26
3 33 34 35 36 37
4 44 45 46 47 48
5 55 56 57 58 59
6 66 67 68 69 70
7 77 78 79 80 81
8
9
10
11
12
13
14
15
16
17
18
19
20

DCT
HW When we need to do it in real time…

SW When we can do it in offline mode…

Processor which can


do memory, control
tasks,

Accelerator 0

Accelerator 2
priority encoding

input [31:0] a, input [31:0] b, input [31:0] c, output [31:0] d


nput [31:0] i1, input [31:0] i2, output [31:0] o1

input [31:0] i1, input [31:0] i2, output [31:0] o1

input [31:0] a, input [31:0] b, input [31:0] c, output [31:0] d

the module plus here


i1 (a), .i2(b), .o1( e) );
the module minus here
.i1( e) , .i2 ( c) , .o1 (d));

temC ::: covers both design nd testbech https://www.youtube.com/watch?v=NCFxBGLB5xs&list=PLcvQHr8v8MQLj9tCYyOw44X1PLisEsX-


Average of 10 numbers done completely in HW
n0 n1 n2 n3 n4

1st clock cycle Plus Plus Plus

funciton N 2nd clock cycle PLUS


"S" clocks
if it were done in SW
3rd clock cycle PLUS

function N
"H" clocks if it were 4th clock cycle

done in HW
add worst performance of T2 or T3….
G, B is 8 bits each
visible differences… at least visually both images

Y
Cb
Cr

e, 30 frames per second…

e Y output 2 additions, 3 mul, 3 div…

R, G, B to Y out) minimum, 56
168
that I can do this VGA input RGB to YCbCr computation
mputation using SW….

y requirements…

in HW

3 Multipliers9 clocks

1 Adder 1clock

1 Adder 1 clock

11 clock

are in parallel 11 clocks

Cr… this will reduce the aread (of the HW), But it will take more time… (3x of one computation)
33 clocks 8 x 8 block for
DCT purposes IN RAM MEMORY
Address Data
5 6 7 8 9 10 0
F 1
5 6 7 8 9 10 i 2
16 17 18 19 20 21 r 3
s
27 28 29 30 31 32 4
t
38 39 40 41 42 43 5
l
49 50 51 52 53 54 i 6
60 61 62 63 64 65 n 7
71 72 73 74 75 76 e 8
82 83 84 85 86 87 9
10
11
S 12
e
13
c
14
o
n 15
d 16
L 17
i 18
n 19
e
n
e 20
21
22
23
24

Bus structure to store and load data from memory


Address, data, clocking, control signals and sideband control signals

Processor which can


do memory, control
tasks, MEMORIES

Accelerator 0 Accelerator 1

Accelerator 2 Accelerator 3
Hr8v8MQLj9tCYyOw44X1PLisEsX-J
n5 n6 n7 n8 n9

Plus Plus Plus

PLUS

PLUS
PIPELINE DRIVEN CONCURRENCY : HOW IT IMPROVES TH
Assume it really takes 5 clock cycles to execute one instruction….
then it will take 15 clock cycles to execute 3 instructions ==> simplistic case….

Now if I have pipeline concept…..


1st instruction will take 5 clock cycles (Instruction Fetch on 1st clock, Instruction Decode o
2nd instruciton will immediately happen on the 6th clock cycle
3rd instruction will happen on the 7th clock cycle…
Typical Performance constraints of Hardware Synthesis Tools
1 clock creation
2 generated clock creation
Synthesis 3 setting of false paths
Related 4 setting of multicycle paths
Constraints 5 setting of any fixed max or min paths
6 setting input and output delays
7 Design for Testability constraints
8 Placement guidelines
Physical
9 Layout guidelines
Design
10 IO guidelines
Related
11 High Fanout Net guidelines
Constraints
12 Clock Net creation guidelines
W IT IMPROVES THE PERFORMANCE OF THE SYSTEM
==> simplistic case….

on 1st clock, Instruction Decode on 2nd clock, Execute on 3rd clock, mem access on 4th clock, register wb on 5th clock)
RTL

Synthesis and DFT insertion

LEC = Logic Equivalence Checking

Physical Design
Fix cell LEC
Fix clock LEC
Fix wire LEC
DRC LEC

TapeOut (GDS)

You might also like