Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

SNN Accelerator

Central DRAM
(Spike, Membrane Potential, SNN Parameters)

DMA
Controller

SPIKE INPUT
RAM

One-Seen
Detector

One-Seen
Detector
address

address

LOCAL DRAM
(Synaptic Weight)

LOCAL DRAM
(Synaptic Weight)

LANE1

B
U
F
F
E
R

B
U
F
F
E
R

B
U
F
F
E
R

B
U
F
F
E
R

B
U
F
F
E
R

B
U
F
F
E
R

B
U
F
F
E
R

Wallace Tree

ADD

One TILE
( Layer)

B
U
F
F
E
R

B
U
F
F
E
R

Stochastic
Controller

SNN Parameter
RAM

LANE2

CMP
O/P
Spike

Components:
Local DRAM:
Synaptic Weights of each layer.
Multiple bank to increase Parallelism
Put un-used bank in Sleep Mode for low power.
Spike Input S-RAM:
Loads all Input Spikes of a layer from Central DRAM on every tick.
Two read port for parallel access.
Sleep Mode control logic to put un-used bank in low power mode/power down.
SNN parameter S-RAM:
Loads SNN membrane, leak and other stochastic parameter from central DRAM on every
tick.
Stores Updated membrane potential to Central DRAM.
O/P Spike Buffer:
Stores the final spike value of each output.
Update Central DRAM in a burst along with membrane potential.
One-Seen Detector:
Detects valid spike in input Neurons.
Generate address for Local DRAM to fetch S.weights.
Detects one valid spike in every cycle.
64-bit Spike-Search Buffer.
Lanes:
Each lane stores 4 x 16 valid synaptic weights for partial summation of 4 neuron outputs.
4 buffers in each lane helps input spike re-use for four o/p neuron.
Two lanes work in ping-pong fashion to maximize Wallace tree though-put.
Can work on 2x clock to feel 16 weights in 8 Cycle.
Wallace Tree Adder:
Works in pipe-line
Need ~ 8cycle to complete addition of one lane.
Swap lane every 8 cycle
Partial SUM is fed to respective Lane-Buffer for further addition.
Stochastic Controller:
Generate the threshold potential based on stochastic parameters
Generate leak voltage based on stochastic parameter.
Generate final membrane potential based on stochastic parameters.
Comparator:
Generates spike based on threshold reference for each output neuron.

Working Principle:
1.
2.
3.
4.
5.

Each tile is dedicated to each layer.


On every tick input spikes are loaded in Local SRAM.
Membrane potential and other spikes are loaded based on Demand.
One-Seen detector starts detecting valid spike on every 2x-clk cycle.
After 8 sys-clk cycles (running on 2x clock) all the buffers with valid weights (for
addition) of both the lane are filled.
6. Once Lane1 is full, it triggers Wallace tree adder.
7. The adder works in pipeline and takes ~8 cycle to complete partial summation of 4 o/p
neuron.
8. It switches to lane2 and does partial summation in next 8 cycle.
9. In parallel Lane1 is getting filled with valid weights.
10. The Adder switches input in ping-pong fashion.
11. Once the summation of 8 o/p neuron is complete it gets compared serially with respect to
reference potential and generates spike if required.
12. O/p spikes are stored in local buffer and updates central DRAM in burst mode along with
membrane potential.
13. In one iteration it completes the entire summation of 8 output neuron and then switches to
next set of o/p Neuron

Novelty:

It has been observed that the spike inputs are very much sparsed. The one-seen detector
helps to utilize the input sparsity by fetching weights of valid spike and feeding Lane
buffer with valid inputs for addition. It reduces the memory access and number of
operations.
The Wallace tree adder improves the overall throughput by speeding the addition.
Multiple lanes help to improve the throughput.
The local Memories can be implemented with multiple banks to improve parallelism.
Un-used bank (Input Buffer) and Local DRAM can be placed in low power mode.
The device is scalable and performance can be improved by introducing more number of
Adders/Comparators if required.

Speed-up w.r.t True-North


If we consider 50% valid spike on a tick, we need (128x256)/16 = 2048 Cycle to compute the
output of a layer. When the pipeline is full , In each cycle it computes 16 operation. If we have
10ns clock, it will take ~20us for a tick. The number can be minimized further by adding more
parallel adder or by increasing the width of Wallace tree adder.

You might also like