Professional Documents
Culture Documents
Lecture Notes Workings
Lecture Notes Workings
60frames per second 18432000 ~18.4M pixels need to be sent to the proje
Complex Multiplication
Y = (a + ib) (c + id)
SW many-many assembly instructions Y =which
(ac - bd)
will+tke
i (bcclocks
+ ad)/ cycles
which will take clock cycles
r1=a
r2=b Accelerator a b c
r3=c
r4=d
* *
t1 = r1 * r3
t2 = bd
-
Real Part
If design is pipelined = initially there is latency of say 6 clock cycles, but then lat
interconnect bus
DMA
controller complexMulAccelerator
Combination1
T1 MIPS
T2 DSP
T3 FPGA
T4 MIPS
Area (200+120+240), Time(5+20
2clocks adder + +
Option2
6 multipliers and 3 adders ==> do it 3 times Reduce area from 18 multipliers
Reduce area from 9 adders to 3
Output obtained in 7 clock cycles x 3 times = 21 minimum… if our FSM is good enough….
Option3
2 multipliers and 1 adder ==> do it 9 times
Output obtained in 7 clock cycles x 9 times = 63 minimum… if our FSM is good enough (pip
Option4 SW
main () {
int A[3][2] = { {1,2}, {3,4}, {5,6}};
int B[2][3] = { {7,8,9}, {10,11,12}};
int C[3][3];
int i, j, k;
for (i=0; i < 3; i++) {
for (j=0; i < 3; j++) {
C[i][j] = 0;
Option5 for (k=0; k < 2; k++) { What if this loop in unrolled….
C[i][j] += A[i][k] * B[k][j]; c[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j];
}
}
}
}
Load to Register
Store to Memory
R0 is a register ==> this can be 8bits or 16bits or 32bits we yet don’t know….
But there are 16 such registers… which can be named as R0, R1, R2….., R15
Can I write an ISS given the instruction set itself in a document ?? Yes !!
I'm trivializing; but these instructions can be in a large case statement in 'C' to create an ISS out of it.
Use the Instruction Set and implement the 'C' program of "sum of 10 numbers"
i.e, 1st step : convert the 'C' program into assembly routine for the given InstructionSet.
2nd Step : assembly to executable conversion
3rd Step : run the executable on the ISS
Y = ab – 6d + dc
a b c
3 multipliers * *
DMA
BC DE
a b c d
* (3) - (2)
1st mult
1st sub
5 arrival time = 2
* (3)
2nd mult operation, shared hardware
arrival = 3 arrival = 5
+ (2)
critical path
arrival = 7
y
y
10marks 1a We will give MCQ quiz for say 8 marks or 7marks : basically 15 MCQ
1b Add subjective type Questions also.
2 For another 7 to 8 marks we will give an assignment which need to be submitte
resource HW a> You have a kit (arduino, beagle, pico, …) --> show two assignments
5marks resource Laptop/PC b> We have the 8085 instruction set.
(i) Create an InstructionSetSimulator (ISS) using 'C' or 'C++
(ii) Write an assembly program to sort 16 numbers. Use yo
5marks resource Lapotp/PC c> Create a domain-specific-architecture for say a EthernetLAN packe
(i) Student can decide the packet type [(header) (source) (
(ii) Parse the packet using a domain-specific-processor to r
(iii) Do some level of time-stamping in the packet [(header
resource HW d> Choose any DSP processor of your choice (TI, ADI, MaxLinear, Free
(i) Filter taps (<<>>), Filter coeffs (<<>>) --> <<goldenInput
5marks resource Laptop/PC e> Image Processing example YUV -> RGB conversion. Create a HW Pe
Verific, Icarus Verilog (open source), Quartus / ModelSim
3
https://fpgasoftware.intel.com/
boolean processor
boolean processor
boolean processor
Process Path Methodology
N clocks
frequency of the processor (F)
period of each clock = 1/(F)
need to be sent to the projector in one second….. Nclocks * period = N /(F) = some ms, s, usec, nsec
CEO XPU
CFO DSP digital signal processor
CTO GPU graphics processing unit
CXO DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
d a b c d 2clocks NPU Neural ? Processing unit
SPU Security processor
bc ad VPU Vector Processing unit
* * 2clocks CPU Central Processing unit
DomainSp
ecific VLIW Processor
bc+ad + 1clocks
Total 6clocks
ay 6 clock cycles, but then later on, there is sustained throughput…. Every clcok cycle, you get one ComplexMultiplication done
mplexMulAccelerator
Combination2 CombinationN
a (200+120+240), Time(5+20+2)
max(T2, T3) because T2 & T3 are executed in paralallel
FPGA is a HW
ASIC is a HW (HW-Accelerator)
FPGA with CPU + FABRIC
CPU == SW
FABRIC == HW
3x3 matrix
col1 col2
b01 b02 c00 = a00 * b00 + a01 * b10
b11 b12 For c00 --> 2 multiplier and 1 adder
5 6 7 8 9 10 11 12 13
* * * * * * * * *
our FSM is good enough (pipelining is good) to to proper read and writes to memory locations…
p in unrolled….
b[0][j] + a[i][1] * b[1][j];
Observations
1. Looks like 16bit instruction because it has 2 bytes.
2. Rn field is 4bits wide. Rn = internal Register… So number of internal registers can be max 16
3. MSB 4bits are opcodes of the Instruction….
4. Registers are R0 to R15 max…
5. M(direct) addressing is available….. What does it mean… (direct) is max 8bits, so memory location
at max can be 256 locations only… :(
6. Rn = M(direct) ??? >> read the memory location whose address is (direct) and copy it to
register Rn (where Rn = R0, or R1, or R2. ,,,,, R15)
7. M(direct) = Rn ??? >> write the contents of the Rn register to the Memory location whose address i
8. M(Rn) = Rm ??? >> write the contents of the Rm register to the Memory location whose address
is the value of the register Rn
C' to create an ISS out of it. 9. Rn = immediate >> Rn now has value "immediate"
10. Rn = Rn + Rm >> add instruction… I = I + J.. Similarly for subtract…
11. PC = PC +Relative (only if Rn is 0) >> Jump if Zero….to a relative location…
nstructionSet.
ab-d(6+c)
d 6
* 3 clocks d a b 6 c
* +
3 clocks
2 clocks
* reuse multiplier
6d
- 2 clocks
ab + cd - 6d 7 clocks
clock0 clock1 clock2 clock3
multiplier * a*b
ination pointer
while(1)
clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock 8 clock 9 clock 10
1 1 0 0 0 0 0 0 0
75%
8 clocks
a
5 MUX I1
*
b
MUX I2
(c-d)
basically 15 MCQ
KNOBS
Speed Debugability
une of 100kHz wall clock time per clock cycle of simulation. (server) 100kHz High (100%)
a super-scalar boolean processor, or an FPGA.
y of FPGAs…. Like a board which has 9+ FPGAs running simultaneously. 10MHz Low
y N x M) possibility for adding processors…. 4MHz High/Medium
FPGA0,0 FPGA0,1 FPGA0,2 8MHz Medium
FPGA1,0
olean processor
FPGA2,2
Average of 10 numbers done completely in HW
n0 n1 n2 n3 n4
function N
"H" clocks if it were 4th clock cycle
done in HW
5th clock cycle
al processor
ocessing unit
ocessing unit
ocessing unit
lexMultiplication done
te SW on this processor
r… We write SW on this
14 15 16 17 18
* * * * *
at what instruction
register is updated
ens to memory locations
multiplier to complete
ab - d(6+c)
clock 11 clock 12 clock 13 clock 14 clock 15
0 0 0 0 0
(pipelining)
(pipelining)
(pipelining)
Users
multiple
one
multiple
multiple
n5 n6 n7 n8 n9
PLUS
PLUS
DIV/10
`
What is Glue logic
1 uP working at reset active low, but peripheral working at reset active high….
so glue logic is needed to invert the reset signal before going either the uP or the peripheral.
2 the uP may have a different bus architecture, peripheral may have a different… So glue logic is needed (bus bridge
(P)PPAS(S)
(P) People
(S) Security
₹
2ns 6ns 8ns
first SW, and then HW to design a good product?.. vice versa is not good strategy sir?
6:07 PM
1 If product is already there, and it is HW based.. Then you can reduce cost by putting it in SW…
2
has the programmable logic… (LUT, DSP blocks, memory blocks, pll
PL PART
Timing construct
intrinsic delay 0
delay within the gate A changed
B
A->Y 1ns Y
B->Y 1ns Y (same Y)
1 C', 'C++'
2 Verilog HDL / VHDL
3 SystemC
Tools
1 g++ (gnu C / C++ compiler).
2 ModelSim / QuestaSim --> download it from Intel website… (check for license…
3 SystemC --> many freeware may be available… Don’t know for sure.
4 ModelSim / QuestaSim --> may be supporting mixed Verilog, VHDL and SystemC
5 People from ASIC world…. VCS (Synopsys), Xcelium (Cadence) all support mixed
6 Xsim (Xilinx) --> download it from the Xilinx website… (may support verilog, vhd
Complete Application
Hardware Software
Implementation Implementation
Verilog HDL C
SystemVerilog C++
Python
VHDL
1> Since tapeout is expensive -- we must do simulation and Co-debug.. (Debug of HW and SW) (Also includes Boar
2>
usb DUT
uart
pcie
eth
Simulation On a linux box or server, we simulate our Verilog or VHDL design with SW.
Slower (wall clock of 250kHz approx)
Observability is high
For basic bringup it's good
Emulation Usage of either FPGA, Array of Boolean Processors, Hybrid parallel processors t
Faster (wall clock of say 1MHz to 4MHz approx)
Observability is low
Post basic bringup for running higher level of sw or boot-rom-firmware
KNOBS
Low Simulation
Full RTL simulation of Processor each and every node of the processor RTL is simulated
Bus Functional Model (BFM) The processor internals are accelerated (by just modeling and it's no
accurate. (DONUT)
Cycle Accurate Model If an instruction takes say 3 clock cycles, the model allows 3 clock cy
Instruction Set Simulation of Processor Only Instructions are executed as if like a pure 'C' model of the proc
Boolean Equation Y = A* B + C * D
Addition operation
Subtraction operation
Registers
Immediate value
direct value
Memory range
PC range
Practice Create an assembly routine to do the folloiwng using the trivial instr
1 unsigned int x;
2 unsigned int y;
3 unsigned int a, b, c, d, e;
<in assembly, say r0, r1 are x, y respectively>
<in assembly, say r2, r3, r4, r5 are b, c, d, e> And you can choose r10
4 if (x < y) { a = b + c } else { a = d + e }
Hints
Gautam Amankumar .
we can keep subtracting 1 from both and the first register gets to ze
pingmem pongmem pingmem pongmem
STEP1 STEP2
(own fsm) (own fsm)
SW 1 function step1
in series 2 function step2
3 function step3
4 main {
call function 1 t0
call function 2 t1
call funciton 3 t2
}
Asynchronous
Synchronous
e peripheral.
So glue logic is needed (bus bridge, or bus gasket, ….) AXI3 to AXI4, AHB to AXI, ….
top
uP ROM
BUS
RAM Peripheral
is ok… then I will design with single adder… Requirement for Performance….
same, using better memory and data structures, better handling of interrupts, external Ios...
mask register]
s and B was stable, then Y will come out in 1ns… why A->Y = 1ns
B changes at time 0.5ns Y will change at what time --> 1.5ns…
0.5 1 1.5 2
stable stable stable stable
changed
Y chnge due to A
Y chnge due to B
between 0.5 and 1.5 the value of Y is unstable… or there might be glitches…
M1 M1 M2
Logi logic of
top level
spi
i2c
display interface
jesd
sw or boot-rom-firmware
Emulation
Simulation
FPGA
rated (by just modeling and it's not cycle accurate), but external world is cycle
cycles, the model allows 3 clock cycles to elapse, and is accurate to BFM
0 0
0 1
0 2
1 3
0 4
0 5 Y RST Q
0 6 output output of the LUT D
1 7 FF
0 8 CLK
0 9
0 10
1 11
1 12
1 13
1 14
1 15 MUX 16:1 1. Sometimes you want only combinatori
2. Sometimes you want combinatorial log
static
ram
of the LUT S3,S2,S1,S0
A,B,C,D input signals of the LUT
stimulus to the LUT & Design
R0 to R15
Immediate value 0 to 0xFF (255)
direct value 0 to 0xFF (255)
Memory range 0 to 0xFF (255)
0 to 0xFF (255)
, d =2, e = 1;
oth and the first register gets to zero will be smaller so we can jump accordingly
pingmem pongmem
final output
STEP3
(own fsm)
ry is in read phase
y is in write phase
ong memory switch their roles.
Security Security in OS
peripheral.op_logic
f adder done
o involves multiplexing
us outputs and
ring for later use..
2:1 MUX output of the CLB
goes either to IO, or the FPGA interconnect
Example of Rn = M(direct)
{nibble1, nibble0} --> Gives the address to the memory location which needs to be accessed.
Assembly MOV direct, Rn --> stores data from a Register to Memory (whose address is directly given)
Example MOV 0x8, R15
nibble3 nibble2 nibble1 nibble0
Binary 0001 1111 0000 1000
Assembly MOV @Rn, Rm --> stores data from an internal Register Rm to the Memory (whose addes is giv
Example MOV @R5, R7
nibble3 nibble2 nibble1 nibble0
Binary 0010 XXXX 0101 0111
Assembly MOV Rn, #immed --> loads the #immediate value to the internal Register Rn
Example MOV R12, 0xB
nibble3 nibble2 nibble1 nibble0
Binary 0011 1100 0000 1011
Assembly ADD Rn, Rm --> Adds the values in the registers Rn and Rm and the result is stored in Rn
Example ADD R9, R10
nibble3 nibble2 nibble1 nibble0
Binary 0100 XXXX 1001 1010
Assembly SUB Rn, Rm --> Subtracts the values in the registers Rn and Rm (Rn - Rm) and result is stored
Example SUB R9, R10
nibble3 nibble2 nibble1 nibble0
Binary 0101 XXXX 1001 1010
Assembly JZ Rn, relative --> PC (Program Counter) is the pointer to the next instruction to be executed
--> The PC jumps to a new location when a condition happens.
--> What is condition : that the register Rn is '0'
--> What is the new PC value : PC = current PC + relative value
--> if condition is not zero, then PC will continue its operation..
--> what is normal operation : moves PC to next instruction.
Memory (M)
Instructio
n (in
Address binary)
0x0 i0
0x1 i1
0x2 i2
0x3 i3
0x4 i4
0x5 i5
0x6 i6
0x7 i7
0x8 i8
0x9 i9
0x10 i10
0xA i11
0xB i12
e Allocation
M(direct) = Value (Memory whose address is (Direct))
Memory (M)
Data (Value)
d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
l Register Rn
C + relative value
ue its operation..
ext instruction.
1 Embedded Systems, Embedded Software,
2 Microelectornics
3 Automotive Electronics 1
4 Process control Engg. 2
5 V&V (Validation and Verification) 3
6 Virtual Prototyping 4
7 Industrial process automation 5
8 VLSI, GNSS 6
9 Functional Safety. 7
10 IoT, IO controller (chips) 8
11 Telecom
12 STA
13 Space and Defence
14 ESS (Energy Storage Systems) & UPS
15 Train Control
16 Numerical Programming
17 Highspeed HW design
18 WLAN S/W
19 Automotive (Rail)
20 PCB Design for space, defence, telecom, automotive,
21 DRAM verification
22 Healthcare product
23 Firmware Development
24 LabVIEW
25 Battery Management Systems
26 Data Radio & LTE Radios
27 Bluetooth SW Validation
28 SoC Validation
29 E-Loco
30 System integration
31 Physical design of SoC
32 Audio processing DSP Embedded systgems
33 ATE (Automatic Test Equipment)
34 Civil Aviation product certification
35 ADAS (Automotive)
36 Radar SW Testing
37 Lighting
38 BootLoader Engineer
39 Automotive Hardware design
40 Software Test Automation
41 Platform Software AutoSar
42 Railway Signalling
43 Wire Harness Design and Installation
44 Industrial control box for HVAC application
45 Linux Kernel development, Networking, Security
45 Aero Engine Control,
46 Off Highway Vehicles (Tractors)
47 Avionics
48 Fuel and CNG Vehicle development
49 UFO and Drones
C
a = 0xFFFF_FFFF
b = 0x1
clock 1 0 1 0 1 0
rising edge of clock or any signal !! falling edge of clock or any signal !!
POSEDGE NEGEDGE
https://class.ece.uw.edu/371/peckol/doc/Always@.pdf
a
PLUS AS A
COMBO
LOGIC D
b Q
CLK
clk
set of 32 flipflops
Take it through
C / C++ / any other compilation
higher level language
DSP1 MIPS
PERIPHERALS
For COLUMN 2
DSP1 MIPS
PERIPHERALS
XPU
DSP digital signal processor
GPU graphics processing unit
DPU data processing unit
TPU tensor processing unit
IPU Image processing unit
NPU Neural ? Processing unit
SPU Security processor
VPU Vector Processing unit
CPU Central Processing unit
DomainSp
ecific VLIW Processor
FPGA Architecture
PS = Programmable subsystem
1> set of Application Processors,
2> optional Real time processors, optional DSPs, optional GPUs, optional Hardw
3> set of peripherals -- [High speed -- pcie, GBEthernet, usb, DDR controller, DM
4> Interface to talk with the PL
PL = Programmable Logic subsystem
1> CLBs
2> clock block, memory blocks, dsp blocks, some HW Accelerators (ethernet, pc
3> basic Ios
4> Interface to talk with the PS
Instruction Level Parallelizm Can happen if the "XPU" has multiple ALU pipelines
example 1 Parallelizm is possible as the two instructions are not dependent on each other
z=x+y
a=b+c
example 2 Parallelizm is not possible because second isntruction is dependent on the first
z=x+y
a=z+c
Instructions which are good to have to make it a bit more complete ISA
1 load indirect Rn = M(Rm)
2 jump unconditional JMP <relative> ; where label is a relative path PC = PC + relative
3 jump if not-zero JNZ Rn, <relative> ; jump to label if Rn is not zero. PC = PC + relative
Data Program
Memory Memory For Addition of 10 numbers
0 a0 0 Instr0 Method 1 Use unrolling of the loop
a1 Instr1 Method 2 Use the JZ when the JNZ is not present…like
a2 …
a3 Method 3 Say you will add a new Instruction which is R
a4
a5
a6
a7
a8
a9
… …
… …
255 255
Boards
Raspberry Pi
Terasic FPGA Development Kits
MicroSemi or Xilinx or Altera Dev Kits
STM 32 Discovry Kit
Beagle Bone black
Teensy ?
pico uC board
switch priority encoding
A0
A1
4:1 MUX 2:2MUX 2:1MUX
A2
A3
O0
S0 S1
How
you'll
define
Define Scope
A A A
B B A.B
C A.B.C
D A.B.C.D
D
BEHAVIORAL CODE
module alu ( input [31:0] a, inpu
);
wire [31:0] e;
assign e = a + b;
assign d = e - c;
endmodule
STRUCTURAL CODE
assign o1 = i1 + i2;
endmodule
assign o1 = i1 - i2;
endmodule
endmodule
I found this playlist very useful for SystemC ::: covers both design nd
like 1
Learn SystemC (1) - Introduct…
11:14 | 80.7K views | 9 years ago
e of clock or any signal !!
set of 32 flip flops one behind the other but it will be numbered
31 down to 0
ssing time and the deadlines (real time system) functionN 1clock
e by having inline assembly for key functions to reduce processing time
unction is called many-many times, and we can instrN1 2clocks
r we can create a new instruction in the instrN2 1clocks
iling and seeking HW functions) ….
an instruction in the Processor itself. ….
a accelerator (push the data, wait till
is done, read the processed output back) return from function
mpute….. (foreach, while, loops….)
SoC
Area = 440
DSP2 ROM RAM
Time taken to complete the task graph = 27ms
PERIPHERALS
SoC
Ps, optional GPUs, optional Hardware Accelerators for say H.261 or codec or similar.
BEthernet, usb, DDR controller, DMA controllers …] [Lower Speed -- uart, spi, flash interface, usb, i2c, gpio's ….]
at is the view….
eform, concurrency)….
… Need for knowledge of the domain we are working….
ng of the loop
when the JNZ is not present…like you start form i= 10, and do i-- until you reach I = 0…
l add a new Instruction which is Rm = M(Rn) then the load part becomes simpler…
101376 101K pixels 303K Bytes assume R, G, B is 8 bits each
in a JPG file it’s a lossy compression…. YET we do not see any visible differences… at least
look the same.
HW
R
G
B
VGA input = 640 x 480 this is frame size, 30 frames per second…
307200 ~307K pixels
Coefficients INPUTS R, G, B
C00, C01, C02 R G B
C10, C11, C12
C20, C21, C22 R * C00 G * C01 B * C02
OUTPUTS Y
Option 2 Do a one-by-one first Y, then Cb, then Cr… this will reduce the aread
0 1 2 3 4
columns
0 lines 0 1 2 3 4
1 11 12 13 14 15
2 22 23 24 25 26
3 33 34 35 36 37
4 44 45 46 47 48
5 55 56 57 58 59
6 66 67 68 69 70
7 77 78 79 80 81
8
9
10
11
12
13
14
15
16
17
18
19
20
DCT
HW When we need to do it in real time…
Accelerator 0
Accelerator 2
priority encoding
function N
"H" clocks if it were 4th clock cycle
done in HW
add worst performance of T2 or T3….
G, B is 8 bits each
visible differences… at least visually both images
Y
Cb
Cr
R, G, B to Y out) minimum, 56
168
that I can do this VGA input RGB to YCbCr computation
mputation using SW….
y requirements…
in HW
3 Multipliers9 clocks
1 Adder 1clock
1 Adder 1 clock
11 clock
Cr… this will reduce the aread (of the HW), But it will take more time… (3x of one computation)
33 clocks 8 x 8 block for
DCT purposes IN RAM MEMORY
Address Data
5 6 7 8 9 10 0
F 1
5 6 7 8 9 10 i 2
16 17 18 19 20 21 r 3
s
27 28 29 30 31 32 4
t
38 39 40 41 42 43 5
l
49 50 51 52 53 54 i 6
60 61 62 63 64 65 n 7
71 72 73 74 75 76 e 8
82 83 84 85 86 87 9
10
11
S 12
e
13
c
14
o
n 15
d 16
L 17
i 18
n 19
e
n
e 20
21
22
23
24
Accelerator 0 Accelerator 1
Accelerator 2 Accelerator 3
Hr8v8MQLj9tCYyOw44X1PLisEsX-J
n5 n6 n7 n8 n9
PLUS
PLUS
PIPELINE DRIVEN CONCURRENCY : HOW IT IMPROVES TH
Assume it really takes 5 clock cycles to execute one instruction….
then it will take 15 clock cycles to execute 3 instructions ==> simplistic case….
on 1st clock, Instruction Decode on 2nd clock, Execute on 3rd clock, mem access on 4th clock, register wb on 5th clock)
RTL
Physical Design
Fix cell LEC
Fix clock LEC
Fix wire LEC
DRC LEC
TapeOut (GDS)