Professional Documents
Culture Documents
Using RISC-V in High Computing, Ultra-Low Power, Programmable Circuits For Inference On Battery Operated Edge Devices
Using RISC-V in High Computing, Ultra-Low Power, Programmable Circuits For Inference On Battery Operated Edge Devices
24-bit @
Vibration analysis
50kHz = 1.2 Fault detection
Mbit/s
Face detection
8-bit,
160x120 @
B/day to kB/day Presence detection
10 fps = Battery operated Counting
4.6 Mbit/s Emotion detection
sensors
30 June 2017 RISC-V Foundation 2
What this talk is about?
Market Demand
Rich sensor data The IoT pipe
NB-IoT, LTE-M, Sigfox,
Linear PCM CNN LoRa, etc.
=
1.4 Mbit/s SVM
Bayesian
Boosting
24-bit @ Cepstral
50kHz = 1.2
Mbit/s analysis
8-bit,
160x120 @ B/day to kB/day
B/day to kB/day
10 fps =
4.6 Mbit/s
Battery operated
sensors
30 June 2017 RISC-V Foundation 3
What this talk is about?
Market Demand
Rich sensor data The IoT pipe
NB-IoT, LTE-M, Sigfox,
Linear PCM CNN LoRa, etc.
=
SVM
1.4 Mbit/s
Issue:Bayesian
way more MIPS
thanBoosting
an MCU can
24-bit @
50kHz = 1.2
deliverCepstral
but needs to be
analysis
Mbit/s
within an MCU power
envelope ?
8-bit,
160x120 @ B/day to kB/day
B/day to kB/day
10 fps =
4.6 Mbit/s
Battery operated
sensors
30 June 2017 RISC-V Foundation 4
General Patterns for content
understanding
• Extract descriptors from raw data
• 2D: Corners, blobs, HOG, DOG, …
• 1D: LPC coefficients, Cepstral coeffs, …
GAP8
Best in class Instruction Set Open Source Computing Platform Engineered as Ultra-low
Architecture (ISA) created by ETHZ and UniBo power IoT Application
UC Berkeley originated Processor
LVDS
Cluster Shared L1 Memory
Serial I/Q DMA
L2
UART Memory
Micro DMA
H/W Logarithmic Interconnect
SPI
SYNC
I2C
I$
I2S
HWCE
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
CPI Fabric
Controller
HyperBus
30 June 2017 RISC-V Foundation Arm and Mbed are registered trademarks or trademarks of Arm 9
Limited (or its subsidiaries) in the US and/or elsewhere.
Automated Memory Management
How to handle a parametric tile
Basic Kernels • Vectorization + Parallelization
• No assumption on where actual data are located
Usually seen as libraries
Passing actual data to basic kernels and having data
circulating between them
• A multi dimensional iteration space (2D; 3D; 4D) and a
traversal order
• Each argument is a sub space of the iteration space and
has actual dimensions, location (L2, external) and
User Kernels properties
Can be grouped and organized
• Given a memory budget the auto tiler “tiles” each
as generators argument and generates a fully pipelined implementation
interleaving processing and data movements
• Basic Kernels are inserted at defined locations in the
iteration space (prologue, body, epilog, …)
• Generated tiles are passed to Basic Kernels
30 June 2017 RISC-V Foundation 10
Automated Memory Management
User Kernels Autotiler Library
BasicKernels Group of User Kernels
C Libraries Generators (Constraints Solver, C Code Generator)
C Programs, calls to Autotiler’s
Model API
7.1
efficiency
energy ARM Power 40nm
CMSIS-CNN library versus GAP8
implementation of identical CNN STM32 F7 216Mhz 99.1ms 21 400 000 60mW
graph trained on CIFAR-10 images 11 X 16 X reduction
Source: ARM processors blog
GAP8 * 15.4Mhz 99.1ms 1 500 000 3.7mW
Running on GAP8 cluster
* No Hardware Convolution Engine GAP8 * 175Mhz 8.7ms 1 500 000 70mW
** With Hardware Convolution
Engine GAP8 ** 4.7Mhz 99.1ms 460 000 0.8mW
uW range
ü
1 mW range
ü Voltage can dynamically change
ü One clock gen active, frequency can dynamically
change
ü Systematic clock gating
10-40 mW range
ü Embedded DC/DC, high current
ü Voltage can dynamically change
ü Two clock gen active, frequencies can
dynamically change
ü Systematic Clock Gating
Ultra fast switching time from one mode to another Highly optimized system level
Ultra fast voltage and frequency change time power consumption
30 June 2017 RISC-V Foundation 22
Source of Energy Efficiency?
3-5x
Cluster
L2 Memory Shared L1 Memory
DMA
UART
eRISC-V
eRISC-V
eRISC-V
eRISC-V
eRISC-V
eRISC-V
eRISC-V
eRISC-V
eRISC-V
SPI
I2S HW Sync
I2C L1
// 10b Rom Dbg Unit
GPIOs Shared Instruction Cache
HyperBus Dbg Clk
System-On-a-Chip
High integration
2-3X
GAP8