Professional Documents
Culture Documents
03 PULP Concepts
03 PULP Concepts
Summary
▪ Part 1 – Introduction to RISC-V ISA
▪ Part 2 – Advanced RISC-V Architectures
▪ Part 3 – PULP concepts
▪ Cores
▪ Clusters
▪ Heterogeneous systems
▪ Accelerators
|
ACACES 2020 - July 2020
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
|
3
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
Low-Power MCUs
1pJ/OP=1TOPS/W
InceptionV4 @1fps in 10mW
|
4
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
40 kGE M/D
70% RF+DP
RISC-V ALU
core
RF
Nice – But what about the GOPS? M7: 5.01 CoreMark/MHz-58.5 µW/MHz
Faster+Superscalar is not efficient! M4: 3.42 CoreMark/MHz-12.26 µW/MHz
|
5
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
ML is massively parallel
and scales well
(P/S with NN size)
|
6
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
CLUSTER
|
7
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
Logarithmic Interconnect
CLUSTER
|
8
High speed single clock logarithmic interconnect
Processors
P1 P2 P3 P4
Routing
Tree
Ultra-low latency → short wires + 1 clock cycle latency
Arbitration
Tree
N N+1 N+2 N+3 N+4 N+5 N+6 N+7
N+8 Memory
B7
B1
B2
B3
B4
B5
B6
B8
Banks
World-level bank interleaving «emulates» multiported mem
|
TCDM Interconnect
▪ Synchronous low latency crossbar based on binary trees
▪ Single or dual channel (Each with 2n master ports, n=[0,1,2 ...]
M
S
▪ Distributed Arbitration (Round robin)
Channel 0
M
S
▪ Combinational handshake (single phase)
TCDM INTERCO
M
▪ Test&Set Supported M
S
Channel 1
▪ Word-level Interleaving M S
M S
▪ Slaves are pure memories.
▪ Bank conflicts can be alleviated → higher Banking Factor (BF=1,2,4)
|
Peripheral Interconnect
Channel 0
M
Peripheral INTERCO
S
▪ Distributed Arbitration (Round robin)
Channel 1
M S
▪ Slaves are pure generic peripherals like bridges , timers etc (Req/grant)
▪ Mostly used to move data in and out from the cluster processors
|
FPU Interconnect (1/2)
From processors
FPU INTERCO
M
▪ 2n master ports ( n=[0,1,2 ...]) MUL
DIV
M S
▪ 2m slave ports ( m=[0,1,2 ...]) → General purpose ADD/SUB
▪ Allocator M
S
MUL
DIV
|
FPU Interconnect A
A
B ADD/SUB PIPE
▪ Features:
B OP
PIPE
OP
Core ID
Core ID
PIPE
▪ Two Operators (A,B), one command (OP) and the ID are Valid Resp
MUL
Core ID RESP
Valid Resp
PIPE
Valid Resp Valid Resp
-200%
-300%
-140%
-200%
Private FPU |
Private FPU
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
CLUSTER
|
14
PULP MCHAN DMA Engine
▪ A DMA engine optimized for integration in tightly
coupled processor clusters
▪ Dedicated, per-core non blocking programming channels
▪ Ultra low latency programming (~10 cycles)
▪ Small footprint (30Kgates) : avoid usage of large local FIFOs by
forwarding data directly to TCDM (no store and forward)
▪ Support for multiple outstanding transactions
▪ Parallel RX/TX channels allow achieving full bandwidth for concurrent
load and store operations
▪ Configurable parameters:
▪ # of core channels
▪ Size of command queues
▪ Size of RX/TX buffer
▪ # of outstanding transactions Integration in a PULP cluster |
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
CLUSTER
|
16
Synchronization & Events
18 |
Energy-efficient Event Handling
evt_data = evt_read(addr);
▪ Single instruction rules it all:
• event buffer content, •
•
trigger sw event,
▪ address determines action
• program entry point trigger barrier,
• mutex message • try mutex lock, ▪ read datum provides corresponding
• triggering core id • read entry point,
• … • auto buffer clear? information
• …
19 |
Results: Barrier
|
21
How do we work: Initiate a DMA transfer
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 22
Data copied from L2 into TCDM
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 23
Once data is transferred, event unit notifies cores
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 24
Cores can work on the data transferred
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 25
Once our work is done, DMA copies data back
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 26
During normal operation all of these occur concurrently
Ext.
Mem
Tightly Coupled Data Memory
Mem
Mem Mem Mem Mem Mem
Cont
L2
DMA Mem Mem Mem Mem Mem
interconnect
Mem
interconnect
RISC-V
core
Event RISC-V RISC-V RISC-V RISC-V
Unit core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
25.7.2018 27
Shared instruction cache with private “loop buffer”
Tightly Coupled Data Memory
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
I$ I$ I$ I$
CLUSTER
|
28
ULP (NT) Bottleneck: Memory
▪ “Standard” 6T SRAMs: 256x32 6T SRAMS vs. SCM
▪ High VDDMIN 2x-4x
▪ Bottleneck for energy efficiency
▪ >50% of energy can go here!!!
▪ Results
▪ Up to 40% better performance than private I$
▪ Up to 30% better energy efficiency
▪ Up to 20% better energy*area efficiency
|
31
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
0
1 CORE 1 CORE 2 CORES 4 CORES 8 CORES
|
32
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
RISC-V
core
RISC-V RISC-V RISC-V RISC-V
core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
|
33
uDMA Subsystem
APB UDMA_TX UDMA_RX
µDMA core
cfg_data cfg_data
CONFIG Registers CONFIG Registers
tx_data 2xtx_data
PERIPH TX PROTOCOL DSP
stream
rx_data
PERIPH RX PROTOCOL rx_data
|
Offload pipeline
DOUBLE BUFFERING
TIME
|
27.02.2019 111
PULP interrupts controller (INTC)
▪ It generates interrupt requests from 0 to 31
▪ Mapped to the APB bus
▪ Receives events in a FIFO from the SoC Event Generator (i.e. from
peripherals)
▪ Unique interrupt ID (26) but different event ID
▪ Mask, pending interrupts, acknowledged interrupts, event id registers
▪ Set, Clear, Read and Write operations by means of load and store
instructions (memory mapped operations)
▪ Interrupts come from:
▪ Timers
▪ GPIO (rise, fall events)
▪ HWCE
▪ Events i.e. uDMA |
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
RISC-V
core
RISC-V RISC-V RISC-V RISC-V
HWCE
core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
Acceleration with flexibility: zero-copy HW-SW cooperation |
42
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
Data
Address plane
Generation
TCDM Shared
Interconnect Memory
|
43
A toy HWPE
externally, uses memory accesses (master ports)
Ctrl
MUX/DEMUX
port 1
port 2
interface
HWPE Slave Module port 0
config
port 3
HWPE Register File
HWCE WRAPPER event
F. Conti and L. Benini, "A ultra-low-energy convolution engine for fast brain-inspired vision in multicore
clusters," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015, pp. 683-688. |
45
HWCE Sum-of-Products
1.11.2 1.k x_in
Linebuffer: min-bandwidth
1x 16-bit sliding window
Each input channel pixel is pixel
read NoutFeatures times x : sliding window of 5x5 16-bit pixels
W W[0] y_in
5x5 16-bit x[0] W[24] x[24]
1x 16-bit
filters pixel
* * ... *
* *
0.1 0.2
SHIFT
*
+
SHIFT
+
2.12.2 2.k
+
sum pixel is read
(NinFeatures -1)
+ times
foreach (out_feature y): NORM
foreach (in_feature x): SAT SoP
y += conv(W,x)
y_out 1x 16-bit pixel
|
46
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
RISC-V
core
RISC-V RISC-V RISC-V RISC-V
HWCE
core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
Smart Wakeup Module |
47
HD-Based smart Wake-Up Module
Tightly Coupled Data Memory
Ext. Mem
Mem Mem Mem Mem Mem
Mem Cont
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
RISC-V
core
RISC-V RISC-V RISC-V RISC-V
core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
Autonomous
Ext. I/O
ADC Preprocessor HD-Computing
(SPI)
Unit
Always-on Domain
|
48
HD-Based smart Wake-Up Module
Tightly Coupled Data Memory
Ext. Mem
Mem Mem Mem Mem Mem
Mem Cont
interconnect
L2
DMA Mem Mem Mem Mem Mem
Mem
Logarithmic Interconnect
RISC-V
core
RISC-V RISC-V RISC-V RISC-V
core core core core
I/O
I$ I$ I$ I$
PULPissimo CLUSTER
Wake Up
Autonomous
Ext. I/O
ADC Preprocessor HD-Computing
(SPI)
Unit
Always-on Domain
|
49
HD-Based smart Wake-Up Module
|
50
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
interconnect
RV I interconnect R5 R5 R5 RV
interconnect
O I interconnect
A RV RV RV R5 R5 R5 RV
cluster
A O A RV RV RV
cluster
cluster O cluster
Single Core RV Multi-core R5 Multi-cluster
• PULPino • Fulmine • Hero
• PULPissimo • Mr. Wolf • Open Piton
IOT HPC
Accelerators
HWCE Neurostream HWCrypt PULPO
|
(convolution) (ML) (crypto) (1st ord. opt)
51
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
|
52
Open Parallel Ultra-Low Power Platforms for Extreme Edge AI
HDL
Simu- Logic
Netlist P&R GDSII FAB Chip
lator Synth.
Sim. Std.
Models Cells
Other
IP
Phy IP Providers
Permissive, Copyright (e.g APACHE) License is Key for industrial adoption
|
53
http://asic.ethz.ch PULP is silicon proven