Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Using RISC-V in high computing, ultra-low power,

programmable circuits for inference on battery


operated edge devices

Martin Croome, VP Business Development, GreenWaves Technologies

RISC-V Day in Shanghai, 30 June 2018 1


What this talk is about?
Market Demand
Rich sensor data The IoT pipe
NB-IoT, LTE-M, Sigfox, Keyword Spotting
Linear PCM
=
LoRa, etc. Beam forming
1.4 Mbit/s Speech pre-processing

24-bit @
Vibration analysis
50kHz = 1.2 Fault detection
Mbit/s

Face detection
8-bit,
160x120 @
B/day to kB/day Presence detection
10 fps = Battery operated Counting
4.6 Mbit/s Emotion detection
sensors
30 June 2017 RISC-V Foundation 2
What this talk is about?
Market Demand
Rich sensor data The IoT pipe
NB-IoT, LTE-M, Sigfox,
Linear PCM CNN LoRa, etc.
=
1.4 Mbit/s SVM
Bayesian
Boosting
24-bit @ Cepstral
50kHz = 1.2
Mbit/s analysis

8-bit,
160x120 @ B/day to kB/day
B/day to kB/day
10 fps =
4.6 Mbit/s
Battery operated
sensors
30 June 2017 RISC-V Foundation 3
What this talk is about?
Market Demand
Rich sensor data The IoT pipe
NB-IoT, LTE-M, Sigfox,
Linear PCM CNN LoRa, etc.
=
SVM
1.4 Mbit/s
Issue:Bayesian
way more MIPS
thanBoosting
an MCU can
24-bit @
50kHz = 1.2
deliverCepstral
but needs to be
analysis
Mbit/s
within an MCU power
envelope ?
8-bit,
160x120 @ B/day to kB/day
B/day to kB/day
10 fps =
4.6 Mbit/s
Battery operated
sensors
30 June 2017 RISC-V Foundation 4
General Patterns for content
understanding
• Extract descriptors from raw data
• 2D: Corners, blobs, HOG, DOG, …
• 1D: LPC coefficients, Cepstral coeffs, …

Usually highly parallel

• Use descriptors to classify data among representative


families
• Machine learning (CNN, SVM, Boost), Bayesian, ….
Also highly parallel
30 June 2017 RISC-V Foundation 5
GAP8: Ultra Low Power IoT Processor
Performance Architecture efficiency HW features
• up to 12GOPS • Extended RISC-V ISA • Smart IOs
• up to 0.4GOPS @ 1mW, • Low contention shared memory 8 +1 core • Voltage regulator/DVFS
• up to 40MOPS @ 300uW clustered architecture • RTC
• 3 uWatt stand-by power • Tight synchronization • Secured execution
consumption • CNN based pattern matching engine (HWCE)

30 June 2017 RISC-V Foundation 6


GAP8 hierarchical power architecture
monitoring event qualification, data analysis & classification
protocol stack,
system control
Smart I/Os extended RISC-V extended RISC-V
voltage regulator & RTC efficient 8 core parallelization
SRAM in retentive mode HW synchronization
shared instruction cache
CNN HW engine
Quasi stand-by Low computing power High computing power
uWs mWs 10 to 50 mWs
primary energy consumption primary energy consumption

30 June 2017 RISC-V Foundation 7


GAP8: Open Source Origin

GAP8
Best in class Instruction Set Open Source Computing Platform Engineered as Ultra-low
Architecture (ISA) created by ETHZ and UniBo power IoT Application
UC Berkeley originated Processor

30 June 2017 RISC-V Foundation 8


SW development flow
FC clock & voltage domain Cluster clock & voltage domain

LVDS
Cluster Shared L1 Memory
Serial I/Q DMA
L2
UART Memory

Micro DMA
H/W Logarithmic Interconnect
SPI
SYNC
I2C
I$
I2S

HWCE
Core 0

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7
CPI Fabric
Controller
HyperBus

GPIO / PWM L1 GAPUINO


Shared Instruction Cache
PMU RTC Debug ROM Debug
development board.

Identical cores – Single GCC/GDB toolchain


(including support for extended ISA)
Classic MCU development
PULP OS, ARM™ Mbed, CNN graph translators GAP8 AutoTiler
(TF2GAP8, ONNX2GAP8 in development) Separates kernel parallelization / vectorization
FreeRTOS, Other OS’s in
Code generators for common and data flow
development Automatic code generation for data flow
Drivers algorithms OpenMP or Native API
(CNN layers, Matrix, FIR, FFT, HoG, MFCC, …)
Cluster APIs

30 June 2017 RISC-V Foundation Arm and Mbed are registered trademarks or trademarks of Arm 9
Limited (or its subsidiaries) in the US and/or elsewhere.
Automated Memory Management
How to handle a parametric tile
Basic Kernels • Vectorization + Parallelization
• No assumption on where actual data are located
Usually seen as libraries
Passing actual data to basic kernels and having data
circulating between them
• A multi dimensional iteration space (2D; 3D; 4D) and a
traversal order
• Each argument is a sub space of the iteration space and
has actual dimensions, location (L2, external) and
User Kernels properties
Can be grouped and organized
• Given a memory budget the auto tiler “tiles” each
as generators argument and generates a fully pipelined implementation
interleaving processing and data movements
• Basic Kernels are inserted at defined locations in the
iteration space (prologue, body, epilog, …)
• Generated tiles are passed to Basic Kernels
30 June 2017 RISC-V Foundation 10
Automated Memory Management
User Kernels Autotiler Library
BasicKernels Group of User Kernels
C Libraries Generators (Constraints Solver, C Code Generator)
C Programs, calls to Autotiler’s
Model API

Compile & Run on PC #include "AutoTilerLib.h"


#include "CNN_Generator.h"
void Mnist()
{
C code for the target handling data movements CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1);
and Basic Kernels dispatch on cluster’s cores CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1);
CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0);
}

30 June 2017 RISC-V Foundation 11


Algorithm Benchmarks
Cores
Application
1 2 4 8
1D FFT1024 Radix4 28.2 14.3 7.8 4.7
2D FFT 256 x 256 Radix4 78.9 41.9 22.6 13.3 0.88 MHz/Frame
Byte 5x5 Conv 18.5 9.3 4.7 2.2
Short 5x5 Conv 37.8 18.9 9.5 4.6
Binary 5x5 Conv 20.8 10.5 5.3 2.8
Short MaxPool2x2 8.2 4.2 2.1 1.1
Short MatMult 32x32 41.9 20.9 14.0 5.2
Short 2048 to 1 Fully Connected 3112.0 1616.0 847.0 495.0
CannyEdge 99.5 50.9 26.2 12.7 VGA: 3.9 MHz/Frame
AES-CTR 128b 15.3 7.7 4.0 2.1 0.47 MHz/Mbs-1
64 Mel Coefficients 542.7 299.4 176.7 101.3 10ms slots 0.64MHz
HoG, 8x8 Cells, 2x2Blocks, 9 Bins 65.0 35.0 18.0 9.0 VGA: 2.76 MHz/Frame

Cycles per produced output


30 June 2017 RISC-V Foundation 12
Algorithm Benchmarks

7.1

30 June 2017 RISC-V Foundation 13


CNN based text recognition

33ms per image

Trainable Par: 421 263


Neurons: 1 511 904

30 June 2017 RISC-V Foundation 14


Dronet – Autonomous Drone

Power envelope breakdown @ 165MHz 12 images/sec

30 June 2017 RISC-V Foundation 15


Unique energy efficiency vs performance
Comparison of Latest optimized Target Clock Time Cycles Active STM 32 H7 216Mhz

efficiency
energy ARM Power 40nm
CMSIS-CNN library versus GAP8
implementation of identical CNN STM32 F7 216Mhz 99.1ms 21 400 000 60mW
graph trained on CIFAR-10 images 11 X 16 X reduction
Source: ARM processors blog
GAP8 * 15.4Mhz 99.1ms 1 500 000 3.7mW
Running on GAP8 cluster
* No Hardware Convolution Engine GAP8 * 175Mhz 8.7ms 1 500 000 70mW
** With Hardware Convolution
Engine GAP8 ** 4.7Mhz 99.1ms 460 000 0.8mW

best in class ULP


MCUs
uAs asleep
mWs awake
GAP8 10s of mWs

Embedded vision processors


Dedicated CNN processors

Extended Instruction Set (ISA)


20X
Efficient parallelization
Shared instruction cache
HW Convolution Engine high end low power MCUs,
Ultra fast HW state changes mid-range application processors

100s of MOPS several GOPS TFLOPS computing


power 16
Unique energy efficiency vs performance
HWCE: Boosted convolution

@1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5


SW time 129.7 us 332.1 us
SW Power 12.58 mW 12.80 mW
HWCE time 69.2 us 60.8 us
HWCE Power 4.95 mW 5.1 mW

@1.0V, 50 MHz. Input: W=32, H=100 Conv 3x3 Conv 5x5


Speed gain 1.87 5.46
Power gain intrinsic 2.54 2.51
Power gain combined with speed gain 4.76 13.71

30 June 2017 RISC-V Foundation 17


Conclusion
GAP8’s Extended RISC-V ISA and
flexible, programmable architecture
enables massive deployment of edge
intelligence
Architectural Innovation
by dramatically reducing rich sensing
device installation costs enabled by PULP, RISC-V
through true autonomy

and by reducing solution cost and Open Source


with system on a chip integration

Built on top of 2 major HW open source


initiatives
30 June 2017 RISC-V Foundation 18
Thank You!

30 June 2017 RISC-V Foundation 19


Backup Slides

30 June 2017 RISC-V Foundation 20


People Counting

30 June 2017 RISC-V Foundation 21


Advanced Power Management
MCU sleep mode

Embedded DC/DC, low current

uW range
ü

ü Real Time Clock 32KHz only


ü L2 Memory partially retentive

MCU active mode


ü Embedded DC/DC, high current

1 mW range
ü Voltage can dynamically change
ü One clock gen active, frequency can dynamically
change
ü Systematic clock gating

MCU + Parallel processor active mode

10-40 mW range
ü Embedded DC/DC, high current
ü Voltage can dynamically change
ü Two clock gen active, frequencies can
dynamically change
ü Systematic Clock Gating

Ultra fast switching time from one mode to another Highly optimized system level
Ultra fast voltage and frequency change time power consumption
30 June 2017 RISC-V Foundation 22
Source of Energy Efficiency?
3-5x

data analysis & classification,


1.4x overall, in practice on
targeted algorithms,
extended RISC-V
1.5x typically 20x
efficient 8 core parallelization
HW synchronization
shared instruction cache
CNN HW engine 4x

Cluster
L2 Memory Shared L1 Memory
DMA

LVDS I$ CNN-HWE Logarithmic Interconnect


Micro DMA

UART
eRISC-V

eRISC-V
eRISC-V
eRISC-V
eRISC-V

eRISC-V

eRISC-V
eRISC-V

eRISC-V
SPI
I2S HW Sync
I2C L1
// 10b Rom Dbg Unit
GPIOs Shared Instruction Cache
HyperBus Dbg Clk

30 June 2017 RISC-V Foundation 23


System Cost
System cost
high end low power MCUs,
Embedded vision processors
mid-range application
Dedicated CNN processors
processors

System-On-a-Chip
High integration

2-3X

GAP8

best in class ULP


MCUs

100s of MOPS several GOPS TFLOPS computing power

30 June 2017 RISC-V Foundation 24

You might also like