Nam StradVision 2020 Embedded Vision Summit Slides Final

Designing Bespoke
CNNs for Target Hardware

Woonhyun Nam
Director, Algorithms
StradVision
September, 2020
© 2020 StradVision
StradVision
• We develop deep learning-based perception software for Advanced Driver Assistance

Systems (ADAS) and Autonomous Vehicles
For more detailed information, please refer to https://youtu.be/tK7F8yvpiGA and https://stradvision.com/
© 2020 StradVision 2
Agenda
• Development Process of Deep Learning-based Software

• Scalability Challenge
• Layer Transformations
• Case Studies
• Conclusions
Development Process of
Deep Learning-based Software
Development Process
Repeated for each task (e.g., object detection) and target platform
Design Train Deploy Validate
Network Weights Binary Report
• Reuse existing CNN designs

(Pros)
• Tweak some layers
• Retrain from scratch • Revalidate from scratch
(Cons) even for small changes if output changes
Scalability Challenge
Development Process
Scalability challenge
• Number of networks to maintain grows rapidly
• More than 100 networks become difficult to manage
Task Target platform
# of edges = # of networks
Scalable Development Process
Main idea
• Grouping target platforms based on similar computational characteristics (or
capabilities) with respect to a given CNN transformation
Task Target platform
# of edges = # of networks
Same group
• Design/Train CNNs for each group

• Transform CNNs without retraining within same group
Design Train Transform Deploy Validate
Network Weights Binary Report
Network
Binary Report
Weights
Same group
Cost-effective method
• Increasing computational efficiency
• E.g., lower latency, higher throughput, cheaper hardware
• Increasing engineering efficiency
• E.g., shorter development time, less engineers, more projects
• Increasing economic efficiency
• E.g., lower usage of cloud services for training, lower production-cost
Layer Transformations
• Basic operations
• Case studies
• YUV Input
• Fixed-size Kernel Acceleration in Convolution
• No FC Support
• No Custom Layer Support in Quantization
• Block-level Accumulation in Convolution
Basic operations
𝑦 = 𝑎𝑥 + 𝑏 and 𝑧 = 𝑐𝑦 + 𝑑
• (Linear) layer fusion
→ 𝑧 = (𝑎𝑐)𝑥 + (𝑏𝑐 + 𝑑)
• Conv & Conv → Conv New weight New bias
• Memory layout change

• For Input, Weight, and Output tensors 1 2 3 4 5 6
Reshape
• Reshape, Transpose 1 2 3
1 4
4 5 6
• Zero-filling 2 5
3 6
Transpose
Case Studies
Case Study – YUV Input
Hypothetical target hardware

𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 + 𝑏𝑌
• YUV input, instead of RGB 𝑅
𝐺 = 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏𝑈
𝐵 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 + 𝑏𝑉
• YUV2RGB equation is given
YUV2RGB equation
→ Fusion of YUV2RGB and the first Conv

YUV Input
RGB Input YUV Input
YUV2RGB
Conv1 Conv1 New Conv1
© 2020 StradVision
Fusion 15
Case Study – YUV Input
Equations for new weights and new bias
𝑅 𝑅 𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 + 𝑏𝑌

𝑍 = 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝐺 +𝑏 𝐺 = 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏𝑈
𝐵 𝐵 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 + 𝑏𝑉
Conv1 equation YUV2RGB equation
Fusion
𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑏𝑌
𝑍 = 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑈 + 𝑏 + 𝑤𝑅 𝑤𝐺 𝑤𝐵 𝑤𝐺𝑌 𝑤𝐺𝑈 𝑤𝐺𝑉 𝑏𝑈
𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑉 𝑤𝐵𝑌 𝑤𝐵𝑈 𝑤𝐵𝑉 𝑏𝑉
New weight New bias
New Conv1 equation
Case Study – Fixed-size Kernel Acceleration in Convolution

• Only 5x5 kernel acceleration supported
• Applying zero-filling for smaller kernels
• Computational utilization 0 0 0 0 0
• 1x1 Conv ~ 4%, 3x3 Conv ~ 36% 0 w11 w12 w13 0
w11 w12 w13 0 w21 w22 w23 0
w21 w22 w23 0 w31 w32 w33 0
Zero filling
w31 w32 w33 0 0 0 0 0
iw=input width kw=kernel width ow=output width
ih=input height kh=kernel height oh=output height
→ Convert 1x1 Conv into 4(kh)x1(kw) Conv ic=input channel oc=output channel
Input Weight Output.reshape(oc,oh,ow)

(ic)x(ih)x(iw) (oc)x(ic)x(1)x(1) (oc)x(oh)x(ow)
1x1 Convolution
Reshape Reshape Reshape
Input.reshape(ic/4,4,ih*iw) Weight.reshape(oc,ic/4,4,1) Output

(ic/4)x(4)x(ih*iw) (oc)x(ic/4)x(4)x(1) (oc)x(1)x(oh*ow)
4x1 Convolution
iw=input width kw=kernel width ow=output width
ih=input height kh=kernel height oh=output height
→ Convert 1x1 Conv into 4(kh)x1(kw) Conv ic=input channel oc=output channel
iw=3 kw=1
ow=3
1 2 3
ih=2 7 8 9
kh=1 1
1 2 3
2
4 5 6
13 14 15 1x1 Convolution oh=2
10 11 12 3 4 5 6
19 20 21
16 17 18 ic=4 4
ic=4 22 23 24 oc=1
Reshape Reshape Reshape

iw’=ih*iw=6 kw’=1
1 2 3 4 5 6 1
kh’=4 ow’=iw’=6
7 8 9 10 11 12 2
ih’=ic=4 4x1 Convolution 1 2 3 4 5 6 oh’=1
13 14 15 16 17 18 3
oc’=1
19 20 21 22 23 24
ic’=ic/4=1 4 oc’=oc=1
ic’=ic/4=1 Input Weight Output
Case Study – No FC Support

• No support for FC layers, or
• Only single-batched FC layers supported
oc=2
ic=3 oc=2
1 4
batch=2
batch=2
1 2 3 1 2
* =
ic=3
2 5
4 5 6 3 4
3 6
Input Weight Output
Matrix Multiplication in FC layer
Case Study – No FC Support
→ Convert FC into 1x1 Conv

transpose(Input) transpose(Weight) transpose(Output)
Input * Weight = Output .reshape(ic,1,batch) * .reshape(oc,ic,1,1) = .reshape(oc,1,batch)
(batch)x(ic) (ic)x(oc) (batch)x(oc) (ic)x(1)x(batch) (oc)x(ic)x(1)x(1) (oc)x(1)x(batch)
Transpose & Reshape kw’=1
kh’=1
1
oc=2 iw’=batch=2 2 ow’=batch=2
oh’=1
ih’=1
ic=3 oc=2 3
oc’=oc=2
1 4 1 4 1 3
batch=2
batch=2
1 2 3 1 2
* = 2 5 4 2 4
ic=3
2 5
4 5 6 3 4 3 6 5
3 6 6
Input Weight Output Input Weight Output
FC 1x1 Conv
Case Study – No Custom Layer Support in Quantization

• Quantization should be done by a tool provided by hardware vendor
• No support for custom layers
• The tool accepts only image files as quantization inputs
→ Convert input tensors of custom layers into image files
• Useful for two-stage detection networks (E.g., Faster RCNN)
Case Study – No Custom Layer Support in Quantization
Convert 4D tensors into image files Input.reshape(batch*c,h*w) + 128

(batch*c)x(h*w)
w=3
1 2 3 w’=h*w=6
h=2 7 8 9 1 2 3 4 5 6
4 5 6
13 14 15
10 11 12
Reshape & Add 128
19 20 21 7 8 9 10 11 12
16 17 18
Input
h’=batch*c=8
22 23 24 13 14 15 16 17 18
(batch)x(c)x(h)x(w) 19 20 21 22 23 24
batch=2
25 26 27 25 26 27 28 29 30
31 32 33 Subtract 128 & Reshape
28 29 30 31 32 33 34 35 36
37 38 39
34 35 36
43 44 45 37 38 39 40 41 42
40 41 42
46 47 48 43 44 45 46 47 48
2(batch)x4(c)x2(h)x3(w) 8(h’)x6(w’)
Tensor in Int8 format Gray image in Uint8 format
Case Study – Block-level Accumulation in Convolution

• Accumulating intermediate results of Conv in 8(oc)x8(ic) channels
• For example,
Input Weight
Conv ( 24(ic)x32(ih)x32(iw)
, 8(oc)x24(ic)x1(kh)x1(kw) )
Input[0:8,:,:] Weight[:,0:8,:,:]
= Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)
, 8(oc)x8(ic)x1(kh)x1(kw) )
Input[8:16,:,:] Weight[:,8:16,:,:]
+ Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)
Input[16:24,:,:] Weight[:,16:24,:,:]
+ Conv8(oc)x8(ic) ( 8(ic)x32(ih)x32(iw)

• Accumulating intermediate results of Conv in 8(oc)x8(ic) channels
• For example,
ic=24 oh*ow
ih*iw kh=kw=1
A B C 1A + 2B + 3C
1
D E F 1D + 2E + 3F
ic=24 2 1x1 Convolution oc=32
G H I 1G + 2H + 3I
3 oc=32
J K L 1J + 2K + 3L
Input Weight Output
→ 8(oc)x8(ic) block-level sparsification
Weightsparse(i-th block) = { 0 if Σ|Weightdense(i-th block)| < θ

Weightdense(i-th block) otherwise
ic=64 ic=64
oc=32
oc=32
1x1 Conv Weight (dense) 1x1 Conv Weight (53% sparse)
Conclusions
Conclusions
• Layer transformation techniques can reduce production-cost especially

for retraining of networks while supporting a diverse array of target
platforms
• It may also be beneficial for deep learning hardware manufacturers to
support a wide variety of networks not yet natively supported by their
hardware
Resources
• Sparsification
• “Exploring the Granularity of Sparsity in Convolution Neural Networks”, CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017_workshops/w29/html/Mao_Ex
ploring_the_Granularity_CVPR_2017_paper.html

Nam StradVision 2020 Embedded Vision Summit Slides Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nam StradVision 2020 Embedded Vision Summit Slides Final

Uploaded by

Copyright:

Available Formats

Designing Bespoke

CNNs for Target Hardware

• We develop deep learning-based perception software for Advanced Driver Assistance

For more detailed information, please refer to https://youtu.be/tK7F8yvpiGA and https://stradvision.com/

• Development Process of Deep Learning-based Software

Design Train Deploy Validate

Network Weights Binary Report

• Reuse existing CNN designs

Task Target platform

Task Target platform

• Design/Train CNNs for each group

Design Train Transform Deploy Validate

Network Weights Binary Report

• Memory layout change

Hypothetical target hardware

→ Fusion of YUV2RGB and the first Conv

Conv1 Conv1 New Conv1

Equations for new weights and new bias

𝑅 𝑅 𝑤𝑅𝑌 𝑤𝑅𝑈 𝑤𝑅𝑉 𝑌 + 𝑏𝑌

Hypothetical target hardware

• 1x1 Conv ~ 4%, 3x3 Conv ~ 36% 0 w11 w12 w13 0

w11 w12 w13 0 w21 w22 w23 0

w21 w22 w23 0 w31 w32 w33 0

Input Weight Output.reshape(oc,oh,ow)

Reshape Reshape Reshape

Input.reshape(ic/4,4,ih*iw) Weight.reshape(oc,ic/4,4,1) Output

Reshape Reshape Reshape

Hypothetical target hardware

Matrix Multiplication in FC layer

→ Convert FC into 1x1 Conv

Transpose & Reshape kw’=1

Hypothetical target hardware

Convert 4D tensors into image files Input.reshape(batch*c,h*w) + 128

Hypothetical target hardware

Hypothetical target hardware

Input Weight Output

→ 8(oc)x8(ic) block-level sparsification

Weightsparse(i-th block) = { 0 if Σ|Weightdense(i-th block)| < θ

• Layer transformation techniques can reduce production-cost especially

You might also like

Convert 4D tensors into image files Input.reshape(batchc,hw) + 128