Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Designing Bespoke

CNNs for Target Hardware


Woonhyun Nam
Director, Algorithms
StradVision
September, 2020

© 2020 StradVision
StradVision

• We develop deep learning-based perception software for Advanced Driver Assistance


Systems (ADAS) and Autonomous Vehicles

For more detailed information, please refer to https://youtu.be/tK7F8yvpiGA and https://stradvision.com/

© 2020 StradVision 2
Agenda

• Development Process of Deep Learning-based Software


• Scalability Challenge
• Layer Transformations
• Case Studies
• Conclusions

© 2020 StradVision 3
Development Process of
Deep Learning-based Software

© 2020 StradVision 4
Development Process

Repeated for each task (e.g., object detection) and target platform

Design Train Deploy Validate

Network Weights Binary Report

• Reuse existing CNN designs


(Pros)
• Tweak some layers
• Retrain from scratch • Revalidate from scratch
(Cons) even for small changes if output changes

© 2020 StradVision 5
Scalability Challenge

© 2020 StradVision 6
Development Process

Scalability challenge
• Number of networks to maintain grows rapidly
• More than 100 networks become difficult to manage

Task Target platform

# of edges = # of networks

© 2020 StradVision 7
Scalable Development Process

Main idea
• Grouping target platforms based on similar computational characteristics (or
capabilities) with respect to a given CNN transformation

Task Target platform

# of edges = # of networks
Same group

© 2020 StradVision 8
Scalable Development Process

• Design/Train CNNs for each group


• Transform CNNs without retraining within same group

Design Train Transform Deploy Validate

Network Weights Binary Report

Network
Binary Report
Weights
Same group

© 2020 StradVision 9
Scalable Development Process

Cost-effective method
• Increasing computational efficiency
• E.g., lower latency, higher throughput, cheaper hardware
• Increasing engineering efficiency
• E.g., shorter development time, less engineers, more projects
• Increasing economic efficiency
• E.g., lower usage of cloud services for training, lower production-cost

© 2020 StradVision 10
Layer Transformations

© 2020 StradVision 11
Layer Transformations

• Basic operations
• Case studies
• YUV Input
• Fixed-size Kernel Acceleration in Convolution
• No FC Support
• No Custom Layer Support in Quantization
• Block-level Accumulation in Convolution

© 2020 StradVision 12
Layer Transformations

Basic operations
𝑦 = 𝑎𝑥 + 𝑏 and 𝑧 = 𝑐𝑦 + 𝑑
• (Linear) layer fusion
→ 𝑧 = (𝑎𝑐)𝑥 + (𝑏𝑐 + 𝑑)
• Conv & Conv → Conv New weight New bias

• Memory layout change


• For Input, Weight, and Output tensors 1 2 3 4 5 6
Reshape
• Reshape, Transpose 1 2 3
1 4
4 5 6
• Zero-filling 2 5
3 6

Transpose
© 2020 StradVision 13
Case Studies

© 2020 StradVision 14
Case Study – YUV Input

Hypothetical target hardware


𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 + 𝑏𝑌
• YUV input, instead of RGB 𝑅
𝐺 = 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏𝑈
𝐵 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 + 𝑏𝑉
• YUV2RGB equation is given
YUV2RGB equation

→ Fusion of YUV2RGB and the first Conv


YUV Input
RGB Input YUV Input
YUV2RGB

Conv1 Conv1 New Conv1

© 2020 StradVision
Fusion 15
Case Study – YUV Input

Equations for new weights and new bias

𝑅 𝑅 𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 + 𝑏𝑌


𝑍 = 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝐺 +𝑏 𝐺 = 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏𝑈
𝐵 𝐵 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 + 𝑏𝑉
Conv1 equation YUV2RGB equation
Fusion
𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑏𝑌
𝑍 = 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏 + 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑏𝑈
𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑏𝑉
New weight New bias
New Conv1 equation

© 2020 StradVision 16
Case Study – Fixed-size Kernel Acceleration in Convolution

Hypothetical target hardware


• Only 5x5 kernel acceleration supported
• Applying zero-filling for smaller kernels
• Computational utilization 0 0 0 0 0

• 1x1 Conv ~ 4%, 3x3 Conv ~ 36% 0 w11 w12 w13 0

w11 w12 w13 0 w21 w22 w23 0

w21 w22 w23 0 w31 w32 w33 0

Zero filling
w31 w32 w33 0 0 0 0 0

© 2020 StradVision 17
Case Study – Fixed-size Kernel Acceleration in Convolution
iw=input width kw=kernel width ow=output width
ih=input height kh=kernel height oh=output height
→ Convert 1x1 Conv into 4(kh)x1(kw) Conv ic=input channel oc=output channel

Input Weight Output.reshape(oc,oh,ow)


(ic)x(ih)x(iw) (oc)x(ic)x(1)x(1) (oc)x(oh)x(ow)

1x1 Convolution

Reshape Reshape Reshape

Input.reshape(ic/4,4,ih*iw) Weight.reshape(oc,ic/4,4,1) Output


(ic/4)x(4)x(ih*iw) (oc)x(ic/4)x(4)x(1) (oc)x(1)x(oh*ow)

4x1 Convolution

© 2020 StradVision 18
Case Study – Fixed-size Kernel Acceleration in Convolution
iw=input width kw=kernel width ow=output width
ih=input height kh=kernel height oh=output height
→ Convert 1x1 Conv into 4(kh)x1(kw) Conv ic=input channel oc=output channel
iw=3 kw=1
ow=3
1 2 3
ih=2 7 8 9
kh=1 1
1 2 3
2
4 5 6
13 14 15 1x1 Convolution oh=2
10 11 12 3 4 5 6
19 20 21
16 17 18 ic=4 4
ic=4 22 23 24 oc=1

Reshape Reshape Reshape


iw’=ih*iw=6 kw’=1
1 2 3 4 5 6 1
kh’=4 ow’=iw’=6
7 8 9 10 11 12 2
ih’=ic=4 4x1 Convolution 1 2 3 4 5 6 oh’=1
13 14 15 16 17 18 3
oc’=1
19 20 21 22 23 24
ic’=ic/4=1 4 oc’=oc=1
ic’=ic/4=1 Input Weight Output
© 2020 StradVision 19
Case Study – No FC Support

Hypothetical target hardware


• No support for FC layers, or
• Only single-batched FC layers supported

oc=2
ic=3 oc=2
1 4

batch=2
batch=2

1 2 3 1 2
* =

ic=3
2 5
4 5 6 3 4
3 6
Input Weight Output

Matrix Multiplication in FC layer

© 2020 StradVision 20
Case Study – No FC Support

→ Convert FC into 1x1 Conv


transpose(Input) transpose(Weight) transpose(Output)
Input * Weight = Output .reshape(ic,1,batch) * .reshape(oc,ic,1,1) = .reshape(oc,1,batch)
(batch)x(ic) (ic)x(oc) (batch)x(oc) (ic)x(1)x(batch) (oc)x(ic)x(1)x(1) (oc)x(1)x(batch)

Transpose & Reshape kw’=1

kh’=1
1
oc=2 iw’=batch=2 2 ow’=batch=2

oh’=1
ih’=1
ic=3 oc=2 3

oc’=oc=2
1 4 1 4 1 3
batch=2
batch=2

1 2 3 1 2
* = 2 5 4 2 4
ic=3

2 5
4 5 6 3 4 3 6 5
3 6 6
Input Weight Output Input Weight Output

FC 1x1 Conv
© 2020 StradVision 21
Case Study – No Custom Layer Support in Quantization

Hypothetical target hardware


• Quantization should be done by a tool provided by hardware vendor
• No support for custom layers
• The tool accepts only image files as quantization inputs
→ Convert input tensors of custom layers into image files
• Useful for two-stage detection networks (E.g., Faster RCNN)

© 2020 StradVision 22
Case Study – No Custom Layer Support in Quantization

Convert 4D tensors into image files Input.reshape(batch*c,h*w) + 128


(batch*c)x(h*w)
w=3
1 2 3 w’=h*w=6
h=2 7 8 9 1 2 3 4 5 6
4 5 6
13 14 15
10 11 12
Reshape & Add 128
19 20 21 7 8 9 10 11 12
16 17 18
Input

h’=batch*c=8
22 23 24 13 14 15 16 17 18
(batch)x(c)x(h)x(w) 19 20 21 22 23 24

batch=2
25 26 27 25 26 27 28 29 30
31 32 33 Subtract 128 & Reshape
28 29 30 31 32 33 34 35 36
37 38 39
34 35 36
43 44 45 37 38 39 40 41 42
40 41 42
46 47 48 43 44 45 46 47 48

2(batch)x4(c)x2(h)x3(w) 8(h’)x6(w’)
Tensor in Int8 format Gray image in Uint8 format
© 2020 StradVision 23
Case Study – Block-level Accumulation in Convolution

Hypothetical target hardware


• Accumulating intermediate results of Conv in 8(oc)x8(ic) channels
• For example,
Input Weight
Conv ( 24(ic)x32(ih)x32(iw)
, 8(oc)x24(ic)x1(kh)x1(kw) )

Input[0:8,:,:] Weight[:,0:8,:,:]
= Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)
, 8(oc)x8(ic)x1(kh)x1(kw) )
Input[8:16,:,:] Weight[:,8:16,:,:]
+ Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)
, 8(oc)x8(ic)x1(kh)x1(kw) )
Input[16:24,:,:] Weight[:,16:24,:,:]
+ Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)
, 8(oc)x8(ic)x1(kh)x1(kw) )
© 2020 StradVision 24
Case Study – Block-level Accumulation in Convolution

Hypothetical target hardware


• Accumulating intermediate results of Conv in 8(oc)x8(ic) channels
• For example,
ic=24 oh*ow
ih*iw kh=kw=1
A B C 1A + 2B + 3C
1
D E F 1D + 2E + 3F
ic=24 2 1x1 Convolution oc=32
G H I 1G + 2H + 3I
3 oc=32
J K L 1J + 2K + 3L

Input Weight Output

© 2020 StradVision 25
Case Study – Block-level Accumulation in Convolution

→ 8(oc)x8(ic) block-level sparsification

Weightsparse(i-th block) = { 0 if Σ|Weightdense(i-th block)| < θ


Weightdense(i-th block) otherwise
ic=64 ic=64
oc=32

oc=32
1x1 Conv Weight (dense) 1x1 Conv Weight (53% sparse)
© 2020 StradVision 26
Conclusions

© 2020 StradVision 27
Conclusions

• Layer transformation techniques can reduce production-cost especially


for retraining of networks while supporting a diverse array of target
platforms
• It may also be beneficial for deep learning hardware manufacturers to
support a wide variety of networks not yet natively supported by their
hardware

© 2020 StradVision 28
Resources

• Sparsification
• “Exploring the Granularity of Sparsity in Convolution Neural Networks”, CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017_workshops/w29/html/Mao_Ex
ploring_the_Granularity_CVPR_2017_paper.html

© 2020 StradVision 29

You might also like