sensors-24-00181-v2

sensors
Article
Adaptive Global Power-of-Two Ternary Quantization Algorithm
Based on Unfixed Boundary Thresholds
Xuefu Sui 1,† , Qunbo Lv 1,2,† , Changjun Ke 1 , Mingshan Li 1 , Mingjin Zhuang 1 , Haiyang Yu 1 and Zheng Tan 1,2, *
1 Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Dengzhuang South Road,
Haidian District, Beijing 100094, China; suixuefu19@mails.ucas.ac.cn (X.S.); lvqb@aircas.ac.cn (Q.L.);
kecj@aircas.ac.cn (C.K.); lims@aircas.ac.cn (M.L.); zhuangmj@aircas.ac.cn (M.Z.); yuhy@aircas.ac.cn (H.Y.)
2 Department of Key Laboratory of Computational Optical Imagine Technology, Chinese Academy of Sciences,
No. 9 Dengzhuang South Road, Haidian District, Beijing 100094, China
* Correspondence: tanzheng@aircas.ac.cn; Tel.: +86-1591-063-9123
† These authors contributed equally to this work.
Abstract: In the field of edge computing, quantizing convolutional neural networks (CNNs) using
extremely low bit widths can significantly alleviate the associated storage and computational burdens
in embedded hardware, thereby improving computational efficiency. However, such quantization
also presents a challenge related to substantial decreases in detection accuracy. This paper pro-
poses an innovative method, called Adaptive Global Power-of-Two Ternary Quantization Based
on Unfixed Boundary Thresholds (APTQ). APTQ achieves adaptive quantization by quantizing
each filter into two binary subfilters represented as power-of-two values, thereby addressing the
accuracy degradation caused by a lack of expression ability of low-bit-width weight values and the
contradiction between fixed quantization boundaries and the uneven actual weight distribution. It
effectively reduces the accuracy loss while at the same time presenting strong hardware-friendly
characteristics because of the power-of-two quantization. This paper extends the APTQ algorithm
to propose the APQ quantization algorithm, which can adapt to arbitrary quantization bit widths.
Furthermore, this paper designs dedicated edge deployment convolutional computation modules
for the obtained quantized models. Through quantization comparison experiments with multiple
commonly used CNN models utilized on the CIFAR10, CIFAR100, and Mini-ImageNet data sets,
it is verified that the APTQ and APQ algorithms possess better accuracy performance than most
Citation: Sui, X.; Lv, Q.; Ke, C.; Li, M.;
Zhuang, M.; Yu, H.; Tan, Z. Adaptive
state-of-the-art quantization algorithms and can achieve results with very low accuracy loss in certain
Global Power-of-Two Ternary CNNs (e.g., the accuracy loss of the APTQ ternary ResNet-56 model on CIFAR10 is 0.13%). The
Quantization Algorithm Based on dedicated convolutional computation modules enable the corresponding quantized models to oc-
Unfixed Boundary Thresholds. cupy fewer on-chip hardware resources in edge chips, thereby effectively improving computational
Sensors 2024, 24, 181. https:// efficiency. This adaptive CNN quantization method, combined with the power-of-two quantization
doi.org/10.3390/s24010181 results, strikes a balance between the quantization accuracy performance and deployment efficiency
Academic Editor: Hai Dong
in embedded hardware. As such, valuable insights for the industrial edge computing domain can
be gained.
Received: 20 November 2023
Revised: 16 December 2023
Keywords: convolutional neural network; power of two; unfixed boundary thresholds; ternary
Accepted: 25 December 2023
quantization; low accuracy loss
Published: 28 December 2023
Copyright: © 2023 by the authors. 1. Introduction

Licensee MDPI, Basel, Switzerland. 1.1. Background
This article is an open access article Since the 21st century, with the increasing number of real-time application scenar-
distributed under the terms and
ios [1–4] such as drones, the Internet of Things (IoTs), intelligent cars, and satellite data
conditions of the Creative Commons
in-orbit processing, which require the data generated by different sensors to be processed
Attribution (CC BY) license (https://
quickly, the field of edge computing has developed rapidly, which can reduce the trans-
creativecommons.org/licenses/by/
mission delay of data and improve the real-time performance by completing the relevant
4.0/).
Sensors 2024, 24, 181. https://doi.org/10.3390/s24010181 https://www.mdpi.com/journal/sensors

Sensors 2024, 24, 181 2 of 26
processing of the data directly in embedded chips [5,6]. Furthermore, deep convolutional
neural networks (DCNNs) are the most crucial intelligent data processing algorithms in the
field of artificial intelligence [7–9]. Leveraging their powerful capabilities for representation
learning and information perception, relevant industries have been exploring the extensive
applications of convolutional neural networks in edge computing scenarios with the aim of
achieving more efficient data processing [10–13]. However, the superior performance of
CNNs relies on their large number of parameters, thereby leading to high computation vol-
umes. For instance, VGG-16 [7] consists of 135 million parameters and requires 30.8 billion
calculations to complete a single image detection task on a 224 × 224 pixel image. Such
demands necessitate hardware with abundant computational and storage resources, such
as GPUs (e.g., Nvidia 4090Ti and Nvidia A100) [14], to provide the necessary computing
support. However, conventional embedded chips (e.g., FPGAs and ASICs) provide fewer
on-chip computation and storage resources due to their limited size, energy consumption,
and external heat dissipation conditions [15,16]. The edge deployment of CNN models
often requires external memory like Double Data Rate SDRAM (DDR) due to the large
number of parameters. This necessitates frequent reads from external memory during
the forward inference process, thus increasing deployment complexity. Moreover, due
to the transmission delay of the external memory, the data supply to the computation
unit may not be timely, thereby leading to a mismatch between the data reading rate and
the computing rate, which thus affects the computing efficiency and system performance
of the convolution computation module in the chip [17,18]. At the same time, the large
number of high-bit-width floating-point multiplication and addition calculations will lead
to insufficient computing resources, thereby increasing the computing time and energy
consumption. Therefore, it is necessary to reduce the calculation parallelism in the chip or
perform low-precision calculations by cutting parameters during the computation process
in order to reduce the pressure on the computation and bandwidth in the chip and ensure
the normal embedded operation of CNN algorithms [10,19]. These issues are currently the
main factors that make it challenging for CNN algorithms to achieve both real-time and
high-accuracy performance in industrial edge computing applications [4,20,21].
1.2. Existing Methods and Problems

In all edge computing chips (e.g., CPU, GPU, FPGA, and ASIC), reducing memory
and increasing parallelism is the fundamental solution to the problem of the poor real-time
performance of CNNs [22–24]. Therefore, the compression of CNN models is the most
meaningful research field for the application of CNNs in the edge computing field. At
present, mainstream research directions mainly include new lightweight CNN architecture
design [25,26], model pruning [27,28], low-rank matrix factorization [29], and parameter
quantization [30–34]. Among them, the parameter quantization method maps the network
weights from 32-bit floating points to low-bit-width fixed points, thus converting the
network computing mode to low-bit-width fixed-point computing, which can effectively
save hardware bandwidth and energy consumption, reduce storage space, and increase
computing parallelism in theory. It is suitable for improving computational efficiency in
edge computing chips with few resources [23]. Whether adopting a new architecture design
or conducting model pruning optimization, it is necessary to quantize the weights in new
models. This is the main reason why many edge applications use parameter quantization
as the preferred method of algorithm optimization [35,36]. Generally, the reduction in bit
width weight can lead to a higher compression ratio and faster processing speed in the
hardware; however, at the same time, the expressive ability of the weight is also reduced,
thereby decreasing the accuracy of the CNN accordingly. To date, many studies have proved
that there is high redundancy in 32-bit floating-point CNN models, and 8-bit quantization
can achieve quantization without accuracy loss in most CNNs [36–38]. However, lower-bit
quantization often leads to unacceptable accuracy loss, especially in quantization studies
with extremely low bit widths, such as below 3-bit [31,32,39–41]. Courbariaux et al. [31]
and Rastegari et al. [32] successively proposed the ultimate 1-bit quantization method: a
Sensors 2024, 24, 181 3 of 26
binary weight approximate neural network and XNOR-Net that quantize weights to −1
and +1 and can use addition or subtraction operations and bitwise operations to replace
the original multiplication calculations in convolutional neural networks. Both studies
compressed the model storage by 32× and achieved a 58× forward inference increase in
the CPU platform. However, the accuracy losses of the two algorithms both exceeded
10%. In order to balance the quantization accuracy and hardware performance of binary
CNNs, Li et al. [40] proposed a ternary (2-bit) CNN quantization method called TWN. By
adding a value of 0 on the basis of binary quantization, the TWN quantized the weights
into three values {−1,0,1}, which improved the expressive ability of the model. In small
classification data sets such as MNIST [42], it achieved an accuracy similar to that of
the baseline network, while, in medium and large data sets, the accuracy loss was still
significant and maintained at around 4%. Later, Zhu et al. [41] proposed a more general
ternary quantization method, called TTQ, to obtain a higher accuracy. This approach is not
limited to the quantization of values to ±1 but quantizes
n the weightoof each layer to zero
p
and the other two positive and negative 32-bit values −Wnl , 0, Wl . The latest research
on ternary quantization, such as those introducing the LCQ [34] and RTN [43] algorithms,
has controlled the ternary quantization accuracy loss to within 2%, thereby achieving
excellent performance at the software level. However, the abovementioned studies, as well
as most other low-bit quantization studies [44,45], all involved nonglobal quantization,
which leads the obtained quantization models to retain some 32-bit floating-point weights
without quantization. If all weights were fully quantized, the accuracy loss would be
unacceptable. This poses significant challenges for deploying these quantized CNN models
onto embedded chips. Similar to the original 32-bit models, they still require substantial
storage and computational resources, thereby resulting in a lack of computational efficiency
and lower practicality.
In our previous study [46], in order to solve the problem of nonglobal quantization,
we proposed a fine-grained global low-bit quantization architecture that can achieve quan-
tization with no accuracy loss compared to the 32-bit baseline models, at 3-bit bit widths
and above while quantizing all weights in the network. However, in the case of binary
quantization (2-bit), although our previous study achieved better performance than most
other quantization algorithms, the accuracy loss was still around 2% compared to the 32-bit
baseline models. It can be seen that there are currently few methods that can achieve
minimal accuracy loss in global ternary quantization. Therefore, studying how to further
approach the accuracy of the baseline model while carrying out global ternary quantization
is the focus of our research. Some previous research has treated parameter quantization as
an optimization problem, such as studies involving TWN and TTQ approaches [40,41,47].
These approaches determine the quantization scaling factor by calculating the minimum
Euclidean distance (i.e., L2 norm) between the original floating-point weights and the cor-
responding quantization weights, while Nvidia [48] determined the quantization scaling
factor by calculating the minimum KL divergence between them. On the other hand, more
studies have set quantization ranges and values based on demand directly [34,46] or have
linearly quantized the network according to the following formula [35,36,49–51]:
wmax − wmin
S = , (1)
qmax − qmin
wmin
Z = qmax − , (2)
S
w
q= clip (round ( + Z), 0, 2n − 1), (3)
S
where w represents the unquantized floating-point weights, q represents the quantized
fixed-point weights, S represents the quantization normalization relation (the quantization
scaling factor), Z represents the quantized fixed-point value corresponding to the 0 value
of the unquantized floating-point numbers, clip (·) denotes the boundary clipping function,
Sensors 2024, 24, x FOR PEER REVIEW 4 of 27
Sensors 2024, 24, 181 4 of 26

quantization scaling factor), Z represents the quantized fixed-point value corresponding
to the 0 value of the unquantized floating-point numbers, clip (·) denotes the boundary
clipping
round (·)function,
denotes round
rounding (·) denotes rounding
to the nearest to the nearest
calculation, and ncalculation,
indicates the andquantization
n indicates
the quantization
bit width. bit width.
However,
However, regardless
regardless of the kind of quantization method adopted, the fundamental fundamental
essence
essence of current quantization
quantization studies
studies is to partition the floating-point weights within
each layer based on on their
their actual
actualdistribution
distributionandandtotosetsetseveral
severalfixed
fixedboundary
boundary thresholds
thresholds θi
θ
for
i for each
each layer.
layer. ThisThis process
process creates
creates multiple
multiple fixed
fixed intervals
intervals according
according to to
thesethese thresh-
thresholds,
olds, with interval
with each each interval corresponding
corresponding to a fixed-point
to a fixed-point quantization
quantization value.value.
This This
means meansthat
that weights
weights near near the boundary
the boundary threshold
threshold and weights
and weights near the near the corresponding
corresponding interval interval
center
are quantized
center to the to
are quantized same
the value. Due to
same value. Duethetoabsence
the absence of predetermined
of predetermined patterns
patternsin the in
weight
the distribution
weight distribution of CNN
of CNN models,
models, especially
especiallyconcerning
concerningfine finedetails
details[52],
[52], ifif there are
massive weights
massive weightsnearnearthe thefixed
fixedboundary
boundarythreshold,
threshold, thisthis situation
situation maymay result
result in some
in some ir-
irrationality,which
rationality, whichwillwilllead
leadtotocertain
certainquantization
quantizationerrors
errors[50,53].
[50,53].ForForternary
ternaryquantization,
quantization,
Figure 1 shows the weight distribution
Figure distribution ofof a convolutional
convolutional layer. The The red
red line
line represents
represents
two fixed quantization boundary thresholds. Normal Normal quantization
quantization methods
methods quantize the
−0.125,0,
weights as {{−0.125, 0, +0.125}
+0.125}based
based onon two
two red
red lines such that all weights between −
weights between 0.125
−0.125
and 0.125
and 0.125 are
are quantized;
quantized; for for example,
example, aa 0.124
0.124 weight
weight is is theoretically
theoretically closer
closer to to 0.125,
0.125, andand
the local error quantized as 0.125 will be smaller. However, in reality, it is quantified as as
the local error quantized as 0.125 will be smaller. However, in reality, it is quantified 0.
0. Furthermore,
Furthermore, thethe
twotwo boundary
boundary thresholdsfor
thresholds forternary
ternaryquantization
quantizationare areoften
oftenset set in
in the
the
area with
area with the
the highest
highest weight
weight in in that
that layer,
layer, which
which can can amplify
amplify the the error
error caused
caused by by fixed
fixed
boundary thresholds.
17.5 0.12 0.0 0.12
15.0
12.5
10.0
7.5
5.0
2.5
0.0 0.125 0.125

0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5
Figure 1.
Figure Ternary quantization
1. Ternary quantization based
based on
on fixed
fixed thresholds.
thresholds.
In addition, for the deployment of quantized models, Abhinav et al. [54] applied
In addition, for the deployment of quantized models, Abhinav et al. [54] applied pa-
parameter quantization to pathological image analysis and validated it on a Raspberry Pi
rameter quantization to pathological image analysis and validated it on a Raspberry Pi
chip, thereby demonstrating that an 8-bit quantized model, under the same deployment
chip, thereby demonstrating that an 8-bit quantized model, under the same deployment
architecture, could improve computational efficiency by 30% compared to the baseline
architecture, could improve computational efficiency by 30% compared to the baseline
model. Meng et al. [55] verified on an FPGA that, even with comparable resource utilization,
model. Meng et al. [55] verified on an FPGA that, even with comparable resource utiliza-
high-bit-width models, even at higher operating frequencies, exhibited significantly lower
tion, high-bit-width models, even at higher operating frequencies, exhibited significantly
computational efficiency than models with 4 bits or fewer. Li et al. [56] also validated on
lower computational efficiency than models with 4 bits or fewer. Li et al. [56] also vali-
an Intel CPU that a 2-bit quantized model could achieve more than a 4× improvement in
dated on an Intel CPU that a 2-bit quantized model could achieve more than a 4× improve-
computational efficiency compared to the baseline model. These studies provide compelling
ment in computational
evidence efficiency
for the advantages compared
of parameter to the baseline
quantization model.
in the edge These studiesofprovide
applications CNNs.
compelling evidence for the advantages of parameter quantization
However, some more extreme quantization deployment efforts have shown in the edge that,
applica-
due
tions
to theof CNNs. computational workload in CNNs, the scarce on-chip DSP computing units
enormous
usedHowever, some more
for multiplication extreme quantization
calculations are often fullydeployment effortssignificantly
utilized, thereby have shownincreasing
that, due
to the enormous computational workload in CNNs, the scarce on-chip DSP
the computational and power pressure on the chip [17,57,58]. Moreover, the actual compu-computing
tational efficiency of on-chip DSP units is approximately one-fourth that of other on-chip
computing resources [57]. It is evident that the on-chip DSP resources greatly limit the com-
Sensors 2024, 24, 181 5 of 26
putational efficiency of CNNs in edge deployment. Power-of-two quantization methods

such as INQ [33] and APoT [59] involve quantizing all the CNN weights to powers of two
or zero. In the binary computation process on edge chips, the multiplication calculations of
power-of-two values can be replaced by shift calculations [46]. Since edge chips often adopt
a register architecture, this allows for the rapid and cost-effective implementation of shift
calculations. This approach can achieve the deployment of quantized networks with fewer
on-chip hardware resources, thereby alleviating the limitation imposed by the number of
on-chip DSP multipliers on computational efficiency. However, all current power-of-two
quantization algorithms are still based on fixed quantization boundary thresholds, thereby
preventing the attainment of minimal accuracy loss in 2-bit quantization.
1.3. Contributions of This Paper

In order to avoid the constraint of fixed quantization boundary thresholds and enable
quantized CNN models with both high accuracy and strong hardware adaptability, this pa-
per focuses on a novel quantization strategy using unfixed boundary thresholds. Research
was conducted on the global power-of-two ternary CNN quantization algorithm, and we
propose an effective method: the Adaptive Global Power-of-Two Ternary Quantization
Algorithm Based on Unfixed Boundary Thresholds (APTQ). The APTQ algorithm intro-
duces a novel adaptive quantization strategy, which, based on the principle of minimizing
Euclidean distance, divides the process of directly quantizing weights into power-of-two
ternary values, which are divided into two groups of binary quantization processes. This
approach aims to achieve nonfixed quantization boundary thresholds, minimize spatial
differences between the CNN model before and after quantization, and ensure that the
quantized model exhibits hardware-friendly characteristics. At the same time, the APTQ
algorithm improves the overall quantization framework on the basis of the fine-grained
layered-grouping iteration global quantization framework proposed in the GSNQ algo-
rithm, and it adopts a fine-grained method of layer-by-layer grouping quantization and
retraining iterations adapted to the new quantization strategy to complete global CNN
quantization and to realize global power-of-two ternary quantization with unfixed bound-
ary thresholds.
Furthermore, the quantization method proposed in this paper based on unfixed
boundary thresholds is not limited to ternary quantization, but it can be extended to
arbitrary bit width quantization. Based on APTQ, we propose the Adaptive Power-of-Two
Quantization algorithm (APT) with wider applicability, which adaptively determines all
quantization thresholds and can improve the performance of power-of-two quantization to
a certain extent. In summary, this paper builds upon the excellent hardware performance of
existing global power-of-two quantization algorithms and further optimizes their software
performance. It provides valuable insights and references for the embedded application of
large-scale CNNs in the field of edge computing. The main contributions of this paper are
as follows:
1. This paper analyzes the nonglobal quantization and fixed quantization threshold
problems in existing ternary quantization methods and formulates a new power-of-
two ternary quantization strategy with unfixed boundary thresholds based on the
global CNN quantization architecture proposed in our previous study [46]. This new
quantization strategy decomposes each filter in a CNN model into two subfilters. By
minimizing the Euclidean distance, the two subfilters are binarized into a power-of-
two form. According to the matrix additivity, the two binary filters are combined
into one ternary filter to complete the power-of-two CNN ternary quantization, and
the restrictions on CNN performance due to fixed boundary thresholds and intervals
are removed.
2. This paper formulates a general power-of-two quantization strategy based on unfixed
thresholds. By decomposing each filter into multiple filters and performing binariza-
tion and accumulation, the power-of-two ternary quantization strategy with unfixed
thresholds can be extended to any bit width quantization.
Sensors 2024, 24, 181 6 of 26
3. Ternary and other bit width quantization experiments were conducted on main-
stream CNN models, such as VGG-16 [7], ResNet-18, ResNet-20, ResNet-56 [60],
and GoogLeNet [61], in two image classification data sets: CIFAR10 [62] and Mini-
ImageNet [63]. The results were compared and evaluated quantitatively and qualita-
tively with some state-of-the-art algorithms in order to verify the effectiveness and
versatility of the proposed APTQ and APQ algorithms.
The remaining parts of the paper are structured as follows: Section 2 provides a
detailed description of the implementation of the APTQ and APQ algorithms. Section 3
lists and evaluates the comparative experimental results. Finally, Section 4 summarizes the
research presented in this paper.
2. Proposed Method
2.1. Global Power-of-Two Ternary Quantization Based on Unfixed Boundary Thresholds (APTQ)
2.1.1. APTQ Quantization Strategy
According to Formulas (1)–(3), the conventional linear CNN quantization process
based on fixed boundary thresholds and intervals can be approximated as
w ≈ Sq, (4)
where w represents the original floating-point weights to be quantized, q represents the

corresponding quantized fixed-point weights, and S represents the quantization scale factor,
which is calculated using Formula (1). In order to calculate S, it is necessary to determine
the actual range of the floating-point weight value corresponding to the quantization
value. Therefore, it can be said that the function of S is to divide a given quantization area
into several intervals. After selecting the intervals and the corresponding floating-point
boundary thresholds θ i , the floating-point weights in each range can be approximated
to the corresponding fixed-point number. Previous studies [40,64,65] have all quantized
CNN models based on this foundation. However, only three quantization intervals are
used in ternary quantization, which means that there are only two boundary values, θ 1
and θ 2 , in a layer. These two values are equal in symmetric quantization, and, thus, the
quantization process of conventional symmetric ternary quantization algorithms such as
TWN and TTQ [40,41] can be defined as

 +1, if Wl > θ
qi = 0, if |Wl | ≤ θ , (5)
−1, if Wl < −θ

where Wl represents the set of weights to be quantized in each CNN layer, and θ and
−θ represent the fixed boundary thresholds, which split the floating-point weights into
{−1, 0, +1}. Both algorithms use the Euclidean distance as the standard to measure the
quantization performance, and the whole ternary quantization process can be summarized
as an optimization problem as follows:
S∗ , q∗ = argminJ(S, q) = ∥w − Sq∥22
(
S,q , (6)
&s.t.S ≥ 0, q ∈ {−1, 0, +1}
where arg min (·) indicates the value of the variable when the objective function takes
the minimum value. ∥·∥22 represents the square of the Euclidean distance, and w, S, and q
are defined as in Formula (4). As the algorithm cannot directly determine the boundary
threshold θ, TWN defines θ as
0.7 N
N i∑
θ = 0.7·(|w|) ≈ | wi | , (7)
=1
Sensors 2024, 24, 181 7 of 26
where N represents the number of weights. It can be seen that θ here is a fixed boundary
threshold. In our previous research on power-of-two quantization, we also adopted the
method of customizing the fixed floating-point boundary thresholds according to the
range of floating-point values in each CNN layer to complete quantization [46]. The
essence of most quantization studies is to seek several of the best floating-point boundary
thresholds θ i in the quantization process and to quantize the weights in the fixed region to
their corresponding θ i values. These fixed boundary thresholds will become the primary
constraint factor for the CNN ternary quantization performance due to the uneven weight
distribution, especially in the case of extremely low bit quantization. Based on the TWN
and GSNQ algorithms, this paper proposes a global power-of-two ternary quantization
algorithm based on an unfixed boundary threshold, APTQ, to remove these restrictions.
We introduce the quantization strategy and the complete quantization process of the
APTQ algorithm through the ternary quantization of convolutional layers. As the fully
connected layers in CNNs can also be expressed in the form of convolution calculations, the
quantization method for fully connected layers is the same as that for convolutional layers.
Assuming a floating-point CNN model with a total of L layers {Wl : 1 ≤ l ≤ L}, the weight
set in the lth convolutional layer is expressed as Wl , including a total number of i floating-
point filters Fi . The proposed APTQ algorithm requires initially dividing the network to be
quantized into two structurally identical subnetworks. Specifically, for individual filters,
all the filters Fi to be quantized are correspondingly split into two subfilters, Fi1 and Fi2 .
These filters and their weights simultaneously satisfy the following formulas:
Fi = Fi1 + Fi2 , (8)

a = a1 + a2
, (9)
s.t.|a1 |, |a2 | < min{max(Wl ), | min(Wl )|}
where a, a1 , and a2 refer to the corresponding floating-point weight values in the same
position in Fi I, Fi1 , and Fi2 , respectively; max (·) means to take the maximum value; and
min (·) means to take the minimum value. The restriction on a1 and a2 in Formula (9) is
very important. Without this restriction, large errors can easily occur, thereby resulting in a
large accuracy loss. Subsequently, the two subnetworks undergo binary quantization, and
Fi1 and Fi2 are optimized for binary quantization according to Formula (6) to obtain two
binary fixed-point filters, Bi1 and Bi2 , respectively, the weights of which belong to {−1, 1}.
If one wishes to recombine the two binary subnetworks into a new ternary quantized
network, it is necessary to ensure that the weights in the two binary matrices derived from
the same original matrix have the same absolute values, i.e., the quantization scale factor S
is the same, to ensure that their sum can be represented in ternary form. Thus, the binary
quantization process for the two subnetworks can be expressed as
J(Si1 , Si2 , Bi1 , Bi2 ) = ∥Fi1 − Si1 Bi1 ∥22 + ∥Fi2 − Si2 Bi2 ∥22

, (10)
s.t.Si1 = Si2 = Si ; Si ≥ 0
where Si1 and Si2 represent the quantization scale factors of Bi1 and Bi2 , respectively, which
can be uniformly denoted as Si , and J (·) represents the objective to be obtained in this
optimization problem. Although each binary quantization process is also based on a fixed
boundary threshold quantization method, the combined quantized values can allow certain
weights to exceed the constraints of the boundary thresholds. Formula (10) can be expanded
to obtain the values as follows:
J (Si1 , Si2 , Bi1 , Bi2 ) =S2 (BTi1 Bi1 + BTi2 Bi2 ) − 2S(BTi1 Fi1 + BTi2 Fi2 ) + (FTi1 Fi1 + FTi2 Fi2 ). (11)
As Fi1 and Fi2 are known, FTi1 Fi1 + FTi2 Fi2 and BTi1 Bi1 + BTi2 Bi2 are known constants. In
order to bring the spatial distribution of the ternary convolutional layer closest to that
before quantization, it is necessary to minimize the sum of squares of the Euclidean distance;
then, BTi1 Fi1 + BTi2 Fi2 needs to keep the maximum value. When Bi1 and the corresponding Fi1
Sensors 2024, 24, 181 8 of 26
have the same sign at the same position, the largest BTi1 Fi1 + BTi2 Fi2 can be obtained. Thus,
the solution formulas are as follows:
∗
Bi1 = sgn (Fi1 ), (12)
∗
Bi2 = sgn (Fi2 ), (13)
∗ ∗
where sgn (·) represents the sign function. According to the obtained Bi1 and Bi2 , the
∗
optimal quantization scale factor Si can also be calculated using the following formula:
∑N N
k=1 |Fi1k | + ∑k=1 |Fi2k |
Si∗ = , (14)
2N
where Fi1k and Fi2k represent the elements in filters Fi1 and Fi2 , respectively, and N repre-
sents the total number of weights in this convolutional layer, which means that all filters
in the entire convolutional layer will have the same optimal scaling factor. At this time,
we can define the part Ti that completes the ternary quantization as in Formula (15). This
completes ternary quantization with unfixed boundary thresholds, thereby allowing for
higher quantization accuracy than fixed boundary threshold ternary quantization.
Ti = Si∗ Bi1 + Si∗ Bi2 . (15)
However, the ternary quantization result obtained using Formula (15) is the same as
that of the TTQ algorithm [41], which quantizes the weights in each layer to 0 and two
other positive and negative 32-bit floating-point values. Although this can result in better
software performance, it will complicate the implementation of the original CNN model
when deploying to embedded chips in edge computing application devices. It still requires
considerable on-chip computation and storage resources, and the computing efficiency
is low, thereby resulting in poor practicability and hardware-unfriendly characteristics.
To achieve hardware-friendly power-of-two quantization, after calculating the optimal Si∗
using Formula (14), this paper further approximates Si∗ to the nearest power of two to
obtain the final quantization scale factor Si∗∗ :
sub = min(|list(p) − Si∗ |)

, (16)
s.t.list(p) = {1, 0.5, 0.25, 0.125, . . .}
Si∗∗ = return [ list(p)] → sub, (17)

where list(p) represents an array of positive power-of-two values arranged from largest
to smallest starting from one. This is because the weights in a pretrained CNN model
are generally less than one [52]. Furthermore, sub represents the absolute value of the
minimum difference between the values in the list(p) arrays and Si∗ , and Si∗∗ is the power-
of-two value corresponding to sub. Finally, the ternary quantization is completed according
to Formula (18):
Ti = Si∗∗ Bi1 + Si∗∗ Bi2 = Si∗∗ (Bi1 + Bi2 ) (18)
Here, since both Bi1 and Bi2 belong to {−1, +1}, Ti must be a power-of-two or 0 value,
thereby completing the entire layer’s power-of-two ternary quantization.
In summary, the basis of the APTQ algorithm is to use Formulas (8), (9), (12)–(14),
and (16)–(18) to decompose each filter into two subfilters, complete the binary quantization
of the subfilters according to the obtained optimal power-of-two quantization scale factor,
and finally add them together to resynthesize a power-of-two ternary filter with unfixed
quantization boundary thresholds. Figure 2 takes a 3 × 3 × 3 filter as an example to
intuitively illustrate the quantization process of the proposed APTQ algorithm.
Sensors2024,
Sensors 2024,24,
24,xxFOR
FORPEER
PEERREVIEW
REVIEW 99ofof27
27
factor,and
factor, andfinally
finallyadd
addthem
themtogether
togethertotoresynthesize
resynthesizeaapower-of-two
power-of-twoternary
ternaryfilter
filterwith
with
Sensors 2024, 24, 181 unfixed quantization boundary thresholds. Figure 2 takes a 3 × 3 × 3 filter as
unfixed quantization boundary thresholds. Figure 2 takes a 3 × 3 × 3 filter as an examplean example
9 of 26
totointuitively
intuitivelyillustrate
illustratethe
thequantization
quantizationprocess
processofofthe
theproposed
proposedAPTQ
APTQalgorithm.
algorithm.
floating-pointsub-filter
floating-point sub-filter11 binaryfixed-point
binary fixed-pointfilter
filter11
quantize
quantize
quantize
quantize
floating-pointfilter
floating-point filter ternaryfixed-point
ternary fixed-pointfilter
filter
floating-pointsub-filter
floating-point sub-filter22 binaryfixed-point
binary fixed-pointfilter
filter22
== == == ++ ==00
== == == ++ ==
== ++ ==
Figure2.2.Strategy
Figure Strategy of power-of-twoternary
ternary quantizationbased
based onunfixed
unfixed boundarythresholds.
thresholds.
Figure 2. Strategyofofpower-of-two
power-of-two ternaryquantization
quantization basedon
on unfixedboundary
2.1.2.Weight
2.1.2. Weight DistributionCharacteristics
WeightDistribution CharacteristicsofofAPTQ APTQQuantization
QuantizationStrategy
Strategy
Inthe
In theentire
entirequantization
quantization
quantizationprocess,process,dueduetotothetheinherent randomnessinindecomposing
inherentrandomness decomposing
floating-point filters
filters into
into twotwo subfilters,
subfilters, this this
is theis the fundamental
fundamental
floating-point filters into two subfilters, this is the fundamental reason why reason reason
why why certain
certain certain
weights
weights
are are
released released
from the from the
constraintsconstraints
of fixed of fixed quantization
quantization boundary
boundary
weights are released from the constraints of fixed quantization boundary thresholds. The thresholds.
thresholds. The The
final
final distribution
distribution of theof the quantized
quantized weights
weights differs
differs to to some
some extent
extent from
from conventional
conventional
final distribution of the quantized weights differs to some extent from conventional quan- quan-
quanti-
tization
tization algorithms.As
zation algorithms.
algorithms. Asillustrated
As illustratedin
illustrated inFigure
in Figure3,3,ititcompares
comparesthe thequantization results
resultsofof
quantizationresults ofaaa
single convolutional kernel
convolutionalkernel
single convolutional after
kernelafter power-of-two
afterpower-of-two ternary
power-of-twoternary quantization
ternaryquantization using
quantizationusing the conventional
usingthe
theconven-
conven-
quantization
tional algorithm
quantization based
algorithm on fixed
based on quantization
fixed boundary
quantization thresholds
boundary
tional quantization algorithm based on fixed quantization boundary thresholds and and
thresholds the APTQ
and the
the
algorithm
APTQ proposed
algorithm in
proposed this
APTQ algorithm proposed in this paper.paper.
in this paper.
0.5 0.7
0.5 0.7 0.8
0.8 0.5 0.5
0.5 0.5 0.5
0.5 0.5 0.7
0.5 0.7 0.8
0.8 0.5 0.5
0.5 0.5 0.5
0.5
GSNQ APTQ
APTQ
GSNQ 0.1 0.1
0.1 0.3
0.3 0.5
0.1 0.1
0.1 0.1 0.3
0.3 00 00 0.5
0.5 0.1 0.5 00 00
0.7 0.5
0.7 0.5 0.4
0.4 0.5 0.5
0.5 0.5 0.5
0.5 0.7 0.5
0.7 0.5 0.4
0.4 0.5 0.5
0.5 0.5 0.5
0.5
(a)
(a) (b)
(b)
Figure
Figure3.3.
Figure Quantization
Quantization
3.Quantization comparison
comparison
comparison ofofsingle
single
of singleconvolution kernel
convolution
convolution based
kernel
kernel onfixed
based
based on fixed boundary
on fixed thresholds
boundary
boundary thresh-
thresholds
and
and unfixed
oldsunfixed boundary
boundary
and unfixed thresholds:
thresholds:
boundary (a) schematic
(a) schematic
thresholds: diagram of GSNQ
diagramdiagram
(a) schematic of GSNQof quantization
quantization result and (b)sche-
result andresult
GSNQ quantization (b) sche-
and
matic
matic diagram of APTQ
diagram diagram
of APTQof quantization
quantization result.
result. result.
(b) schematic APTQ quantization
ItIt can
Itcan
canbe be seen
beseen from
seenfrom thisexample
fromthis
this examplethat,
example that, from
that,from
fromthe the perspectiveofof
theperspective
perspective ofaaasingle
singleconvolution
single convolution
convolution
kernel
kernel or filter,
kernelororfilter, the
filter,the APTQ
theAPTQ
APTQ algorithm
algorithm
algorithm can
cancan quantize
quantize
quantize a floating-point
a floating-point
a floating-point weight
weight
weight to aato
to power-of-
a power-
power-of-
two value
of-two
two value or
value 0 value with
or 0 value
or 0 value a
with awithfarther
farther L1
a farther distance, such
L1 distance,
L1 distance, as quantizing 0.3
such as quantizing
such as quantizing to a
0.3 to a moremore distant
0.3 distant
to a more0,0,
which
distant
which seems
0, which
seems unreasonable. Andthis
seems unreasonable.
unreasonable. And thisisisprecisely
precisely
And this is the
the characteristic
precisely ofofthe
theAPTQ
APTQ
the characteristic
characteristic algorithm:
ofalgorithm:
the APTQ
breaking the inherent constraints of fixed quantization boundaries.
breaking the inherent constraints of fixed quantization boundaries. While ensuringWhile
algorithm: breaking the inherent constraints of fixed quantization While ensuring
boundaries. that
that
the spatial
ensuring Euclidean
that the distance
spatial is
Euclidean minimized
distance for
is the network
minimized layers
for the
the spatial Euclidean distance is minimized for the network layers and the entire network and
networkthe entire
layers network
and the
before
entire and
networkafter quantization,
before and it
after ensures that
quantization, theit CNN
ensures can be
that quantized
before and after quantization, it ensures that the CNN can be quantized more reasonably.the CNN more
can bereasonably.
quantized
This
more
This involves
reasonably.
involves quantizing aaportion
This involves
quantizing portion ofofthe
theweights
quantizing weights
a portion within
withinof thetheweights
the samerange
same range
withintotoone
onesame
the quantized
range
quantized
value
to one
value and
and anothervalue
quantized
another portion
portion andtotoanother
aadifferent
different quantized
portion value,thereby
to a different
quantized value, thereby
quantizedmore
more effectively
value, thereby
effectively reduc-
more
reduc-
ing the accuracy loss caused by low-bit-width quantization.
ing the accuracy loss caused by low-bit-width quantization. Figure 4 shows a ternary4
effectively reducing the accuracy loss caused by low-bit-width Figure 4 shows
quantization. a ternary
Figure
shows a ternary quantization weight distribution comparison of the GSNQ and APTQ
algorithms in the eighth convolutional layer of VGG-16.
Sensors 2024, 24, 181 quantization weight distribution comparison of the GSNQ and APTQ algorithms in the
10 of 26
eighth convolutional layer of VGG-16.
400,000 400,000
3 0,000 3 0,000
300,000 300,000
2 0,000 2 0,000
200,000 200,000
1 0,000 1 0,000
100,000 100,000
0,000 0,000
0 0
0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0 0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0
400,000 400,000
3 0,000 3 0,000
300,000 300,000
2 0,000 2 0,000
200,000 200,000
1 0,000 1 0,000
100,000 100,000
0,000 0,000
0 0
0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0 0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0
400,000 400,000
3 0,000 3 0,000
300,000 300,000
2 0,000 2 0,000
200,000 200,000
1 0,000 1 0,000
100,000 100,000
0,000 0,000
0 0
0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0 0.0 0.04 0.02 0.00 0.00 0.00 0.02 0.04 0.0
(a) (b)
Figure 4.
Figure 4. Comparison
Comparisonof ofquantization
quantizationweight
weightdistribution
distributionbetween GSNQ
between GSNQand APTQ
and APTQalgorithms in
algorithms
a single convolutional layer: (a) ternary quantization based on fixed boundary thresholds and (b)
in a single convolutional layer: (a) ternary quantization based on fixed boundary thresholds and
ternary quantization based on unfixed boundary thresholds.
(b) ternary quantization based on unfixed boundary thresholds.
The horizontal
The horizontal coordinates
coordinates in in the figure indicate
the figure indicate thethe original
original floating-point
floating-point weight
weight
values, and the vertical coordinates indicate the distribution number of
values, and the vertical coordinates indicate the distribution number ofthe
thecorresponding
corresponding
weights. The
weights. The blue
blue part
part indicates
indicates the
the overall
overall weight
weight distribution
distributionof ofthetheoriginal
originalfloating-
floating-
point weights in this convolutional layer. Both algorithms quantized all the weights
point weights in this convolutional layer. Both algorithms quantized all the weightsin inthis
this
−6 6, 0, 2−6
layer to {−2 , 0, 2 }. In order to illustrate the difference between the quantizationthe
layer to {−2 − − }.
6 In order to illustrate the difference between the quantization of of
algorithm
the algorithm based on fixed
based thresholds
on fixed thresholdsand the
andalgorithm
the algorithmbasedbased
on unfixed thresholds
on unfixed more
thresholds
intuitively,
more we compared
intuitively, each ofeach
we compared the three
of thedistributions of quantized
three distributions values after
of quantized employ-
values after
ing quantization separately. The green, gray, and red parts indicate the
employing quantization separately. The green, gray, and red parts indicate the distribution distribution of the
−6 − 6 2−6 , respectively.
−
weights that were quantized to −2 , 0, and 6 As
of the weights that were quantized to −2 , 0, and 2 , respectively. As indicated by the indicated by the gray
area in
gray Figure
area 3a, whereas
in Figure the other
3a, whereas quantization
the other methods
quantization quantized
methods all theallweights
quantized to 0,
the weights
−6 −6 −6
− 6
the0,APTQ
to the APTQalgorithm quantified
algorithm a portion
quantified a portion −2−2, another
to to , another portion
portiontoto22 , ,andandanother
another
portion to
portion to 0.
0. The same logic was applied to the green and red sections. This This minimized
minimized
the Euclidean
the Euclidean distance for the entire network before and after quantization. quantization. This This means
means
that the quantization result of APTQ does not have fixed boundary thresholds, thereby
breaking through the limitation of the fixed boundary threshold, which leads to a higher
accuracy of the ternary CNN model.
that the quantization result of APTQ does not have fixed boundary thresholds, thereby
Sensors 2024, 24, 181 breaking through the limitation of the fixed boundary threshold, which leads to a higher
11 of 26
accuracy of the ternary CNN model.
2.1.3.
2.1.3. APTQ
APTQ Global
Global Retraining
Retraining Process
Process
Regardless
Regardless of the kind of quantization strategy,
of the kind of quantization the expression
strategy, the expression ability
ability of
of the
the values
values
must decrease significantly when the weights change from 32-bit floating-point
must decrease significantly when the weights change from 32-bit floating-point numbers numbers
to
to 2-bit
2-bit fixed-point
fixed-point numbers,
numbers,thus
thusmaking
makingsome
someaccuracy
accuracyloss lossunavoidable.
unavoidable. Therefore,
Therefore,it
is necessary
it is necessary to retrain thethe
to retrain quantization
quantizationof the CNN
of the CNNmodels to compensate
models to compensatefor the
forloss. The
the loss.
accuracy
The loss after
accuracy one-time
loss after globalglobal
one-time quantization is often
quantization is unrecoverable, and this
often unrecoverable, andpaper
this
proposes
paper a fine-grained
proposes global quantization
a fine-grained retraining
global quantization architecture
retraining to maximally
architecture compen-
to maximally
sate for the low-bit
compensate quantization
for the low-bit accuracy
quantization loss, which
accuracy loss, includes weights
which includes grouped
weights by layer,
grouped by
layer-by-layer grouped quantization, and network retraining to gradually
layer, layer-by-layer grouped quantization, and network retraining to gradually completecomplete quan-
tization of theofentire
quantization CNNCNN
the entire model. The specific
model. process
The specific is shown
process in Figure
is shown 5. 5.
in Figure
ertical uantization
Horizontal Horizontal Horizontal

uantization uantization uantization
Global uantization
Horizontal uantization
Quantized weights
Un uantized weights
Figure 5.. APTQ global fine-grained quantization architecture.
This figure
figureshows
showsthe thewhole
wholequantization
quantization process
process of aofCNN
a CNN modelmodelwithwith L layers.
L layers. Wl
W l represents
represents the the weight
weight set ofsetthe
oflth lth layer.
thelayer. Blue indicates
Blue indicates unquantized
unquantized portions,
portions, while
while green
green represents
represents quantized
quantized portions.
portions. The wholeThe whole architecture
architecture dividesdivides
the CNNthe CNN quantization
quantization pro-
process
cess intointotwotwo steps:
steps: horizontal
horizontal (intralayer)
(intralayer) andand vertical
vertical (different
(different layers)
layers) quantization.
quantization. In
In the process of grouping weights by layer, the architecture takes
the process of grouping weights by layer, the architecture takes filters as units, sorts the filters as units, sorts
the filters
filters in each
in each layerlayer
fromfromlargelarge to small
to small according
according to their
to their L1 norm,
L1 norm, and divides
and divides themthem
into
into different numbers of groups (generally four groups per layer)
different numbers of groups (generally four groups per layer) according to the actual according to the actual
sit-
situation
uation of different
of different networks.
networks. A largerA larger
L1 norm L1 means
norm that
meansthisthat
filterthis
hasfilter has influence
a greater a greater
influence
on the CNN on the CNN
model model
[27,52]. [27,52]. Therefore,
Therefore, in the subsequent
in the subsequent quantization quantization
process, the process,
filter
the filter group with a larger L1 norm is given priority in quantization.
group with a larger L1 norm is given priority in quantization. As each CNN layer needs As each CNN
layer
to needs the
maintain to maintain
same three thequantization
same three quantization
values, for eachvalues, forall
layer, each layer,
of the all of
filters in the
the filters
same
in the same layer are split into two corresponding subfilters in
layer are split into two corresponding subfilters in the process of grouped quantization, the process of grouped
quantization,
as in Formulasas(8) in and
Formulas (8) and
(9), after which (9),power-of-two
after which power-of-two ternary is
ternary quantization quantization
performed
is performed based on the unfixed boundary thresholds in the order of grouping. After
based on the unfixed boundary thresholds in the order of grouping. After quantizing a
quantizing a group, a CNN retraining process is performed to restore accuracy. In the
group, a CNN retraining process is performed to restore accuracy. In the retraining pro-
retraining process, it is necessary to remerge the subfilters of the unquantized part in the
cess, it is necessary to remerge the subfilters of the unquantized part in the layer where
layer where the current quantization group is located, keep all data in this layer and the
the current quantization group is located, keep all data in this layer and the previously
previously quantized layer fixed during the SGD weight update process in backpropagation,
quantized layer fixed during the SGD weight update process in backpropagation, and up-
and update all the unquantized weights to compensate for the accuracy loss caused by the
date all the unquantized weights to compensate for the accuracy loss caused by the quan-
quantized part. Then, in the quantization process of the next set, the remaining filters in
tized part. Then, in the quantization process of the next set, the remaining filters in this
this layer must be resplit into two parts (as in the first split) then quantized and retrained.
layer must be resplit into two parts (as in the first split) then quantized and retrained.
Through continuous iteration, the whole CNN model can be finally quantized in a step-
by-step manner, and the accuracy loss can be kept within an acceptable range as much
as possible.
The process of the proposed APTQ algorithm is summarized in Algorithm 1:
Sensors 2024, 24, 181 12 of 26
Algorithm 1: Adaptive Global Power-of-two Ternary Quantization Based on Unfixed Boundary

Thresholds (APTQ)
Input: 32-bit floating-point CNN model {Wl : 1 ≤ l ≤ L}
Output: Power-of-two ternary quantization CNN model {Ql : 1 ≤ l ≤ L}
1: Grouping weights by layer: Sort filters by their L1 norm and divide them into M groups
{Dlm : 1 ≤ m ≤ M}
2: for l ∈ [1, . . . , L] do
3: Split all filters in the same layer into two subfilters using Formulas (8) and (9)
4: for m ∈ [1, . . . , M] do
5: Determine the optimal quantization scale factor using Formulas (12)–(14), (16)
and (17), and complete power-of-two binary quantization of the two subfilters
6: Remerge the subfilters using Formula (18) to complete the power-of-two ternary
quantization based on unfixed boundary thresholds
7: Retrain the network, keep the quantized layers fixed, and update unquantized
weights in other layers
8: end for
9: end for
2.2. Universal Global Power-of-Two Quantization Based on Unfixed Boundary Thresholds (APQ)
The quantization strategy based on unfixed boundary thresholds proposed in this
paper can relieve the limitation of quantization performance caused by the fixed quanti-
zation boundaries in ternary quantization, and it can effectively reduce the accuracy loss
due to ternary quantization. At the same time, although weights with a higher bit width
have superior expressive abilities and the quantization intervals are more tightly divided,
many works have achieved higher accuracy results than the original CNN model with 4-bit
and 5-bit quantization [43,46,59]. However, the limitation of fixed boundary thresholds
with respect to the CNN quantization performance may still exist. In order to further
improve the accuracy, this paper extends the APTQ algorithm mentioned in Section 2.1 to
any bit width and proposes a new universal power-of-two quantization algorithm based
on unfixed boundary thresholds, which is called APQ.
The main difference between m-bit and ternary quantization in the proposed quanti-
zation strategy lies in the number of subfilters divided by each filter and in the way the
quantization scale factor is approximated to a power-of-two value. First, the APQ algorithm
divides the filter Fi in the first layer into H = 2h − 2 subfilters {F i1 , Fi2 , . . . , FiH }, where
h is the bit width to be quantized. At the same time, these subfilters need to satisfy the
following conditions:
Fi = Fi1 + Fi2 + . . . + FiH , (19)

a = a1 + a2 + . . . + aH
. (20)
s.t.|a1 |, |a2 |, . . . , |aH | < h−1 1 min{max(Wl ), | min(Wl )|}
Similar to the APTQ algorithm, a1 to aH here refer to the corresponding floating-point
weights in the same position in terms of subfilters Fi1 to FiM , respectively. After the split
is completed, we quantize all the divided subfilters based on the principle of minimizing
Euclidean distance according to Formula (21):

H
J(Si1 , . . . , SiH ; Bi1 , . . . , BiH ) = ∑ ∥Fik –Sik Bik ∥22

k =1 , (21)
s.t.Si1 = . . . = SiH = Si ; Si ≥ 0

∗
In the APQ algorithm, BiH also maintains the same sign at each corresponding position
with the respective subfilter, and all quantized weights belong to {−1, +1}, as shown in
Formula (22):
∗
BiH = sgn(FiH ), (22)
The optimal quantization scale factor Si∗ of the APQ algorithm changes based on
the number of divided subfilters, which differs from the APTQ algorithm, as shown in
Sensors 2024, 24, 181 13 of 26
Formula (23). When the quantization bit width h is 2-bit, it is equivalent to the Si∗ of the
APTQ algorithm:
∑H N
j ∑k =1 |Fik |
Si∗ = . (23)
HN
Here, N represents the total number of weights in the convolutional layer, and H is the
number of subfilters into which the original filter is divided.
As the APQ algorithm divides the filters into a variable number of subfilters, which
may not necessarily be a multiple of two—such as 14 subfilters when quantized to 4-bit
and 6 subfilters when quantized to 3-bit—if Si∗ is directly approximated to a power-of-two
value and then added together, the quantization result in the power-of-two form may not
be obtained. Therefore, after obtaining Si∗ , in contrast to the APTQ algorithm, it is necessary
to first remerge the binary subfilters and then approximate their weights to the nearest
power-of-two value to achieve generalized power-of-two quantization based on unfixed
boundary thresholds in the APQ. The specific process is shown in the Formulas (24)–(26):
M
Qi = Si∗ ∑ Bik , (24)
k=1

sub = min(|list(p) − Qi |)
, (25)
s.t.list(p) = {1, 0.5, 0.25, 0.125, . . .}
Qi∗ = return [ list(p)] → sub, (26)
where Qi is the binary quantization filter obtained by remerging the subfilters, and Qi∗ is
the final quantized power-of-two result of Qi approximated to the nearest power-of-two
value. The remaining quantization steps and retraining architecture remain the same as in
the APTQ algorithm. The quantization process of the APQ algorithm can be summarized
in Algorithm 2:
Algorithm 2: Universal Global Power-of-Two Quantization Based on Unfixed Boundary

Thresholds (APQ)
Input: 32-bit floating-point CNN model {Wl : 1 ≤ l ≤ L}
Output: Power-of-two h-bit quantization CNN model {Ql : 1 ≤ l ≤ L}
1:
n Grouping weights by layer: Sort filters by their L1 norm and divide them into M groups
Dlm : 1 ≤ m ≤ M}
2: for l ∈ [1, . . . , L] do
3: Split all filters in the same layer into H = 2h − 2 subfilters using Formulas (19) and (20)
4: for m ∈ [1, . . . , M] do
5: Determine the optimal quantization scale factor using Formulas (21)–(23), and
complete power-of-two binary quantization of the H subfilters
6: Merge the binary subfilters and approximate all values to the nearest power-of-two by
using Formulas (24)–(26), completing the h-bit power-of-two quantization of the original filter.
7: Retrain the network, keep the quantized layers fixed, and update unquantized
weights in other layers
8: end for
9: end for
2.3. APTQ and APQ Dedicated Convolutional Computation Module in Edge Chips
The two power-of-two quantization algorithms proposed in this paper quantize all the
parameters in a CNN into power-of-two values. This characteristic enables the quantized
model to be deployed in embedded chips for edge computing applications using simple
shift operations instead of complex multiplication operations. Before deployment, the
power-of-two weights need to be recoded into 2-bit binary numbers for storage in the edge
chips as new weights for hardware calculations, as shown in Tables 1 and 2.
The two power-of-two quantization algorithms proposed in this paper quantize all
the parameters in a CNN into power-of-two values. This characteristic enables the quan-
tized model to be deployed in embedded chips for edge computing applications using
simple shift operations instead of complex multiplication operations. Before deployment,
Sensors 2024, 24, 181 the power-of-two weights need to be recoded into 2-bit binary numbers for storage14inofthe
26
edge chips as new weights for hardware calculations, as shown in Tables 1 and 2.
Table 1.
Table 1. Hardware
Hardware deployment
deployment recoding
recoding of
of APTQ
APTQ ternary
ternary quantized
quantized weights.
weights.
Quantized
Quantized
Weight
Weight
Recoding Weight
Recoding Weight
2nn 01
2 01
00 00
00
n
−2
− 2n 11
11
Table 2.
Table 2. Hardware
Hardware deployment
deployment recoding
recoding of
of APQ
APQ 3-bit
3-bit quantized
quantized weights.
weights.
Quantized Weight Recoding Weight

Quantized Weight Recoding Weight
2n+2 001
+2
2nn+1 001
2 010
2n+n1 010
22n 011
011
0 000
000
−
−2 2nn 111
111
− +1
2nn+1 110
110
−2n+2
−2 n+2 101
−2 101
For the ternary CNN model quantized with APTQ, the chip determines whether the
input binary-encodedweights
input binary-encoded weightsand
andpixel
pixel values
values contain
contain a zero
a zero value
value andand identifies
identifies the
the sign
sign of the quantized weights. It then performs a pre-set n-bit shift operation or directly
of the quantized weights. It then performs a pre-set n-bit shift operation or directly out-
outputs a zero
puts a zero value.
value. Figure
Figure 6 shows
6 shows the ternary
the ternary quantization
quantization dedicated
dedicated multiplication
multiplication pro-
processing unit designed in an FPGA for the APTQ-quantized
cessing unit designed in an FPGA for the APTQ-quantized CNN model toCNN model to perform
perform the
the
multiplication
multiplication operation
operation for
for one
one input
input pixel
pixel with
with one
one weight,
weight, thereby
thereby fully
fully leveraging
leveraging the
the
advantages
advantages ofof ternary
ternary power-of-two
power-of-two quantization.
quantization.
Input Output
pixel 0 value
0
[7:0] judgment
0
and {0, pixel>>n}
1 0
weight 0 value
Weight[1]
[1:0] judgment
1
{1, pixel>>n}
Figure 6.. Dedicated multiplication processing

processing unit
unit for
for ternary
ternary quantization
quantization CNN
CNN models.
models.
For other bit width CNN models quantized usingusing APQ, a more
more versatile
versatile shift-based
multiplication processing unit was designed in this paper. As shown
shown in Figure 7, it is a
schematic diagram of a multiplication processing unit for a 4-bit quantized model. Each
unit, through judgment modules, detects whether there are zero data in the input. An
enable signal is generated by an AND gate, and when the enable signal is high, the unit
performs shift-based calculations, thereby outputting results, including the sign bit and
reserved bits. Otherwise, it directly outputs 0. This multiplication processing unit can be
similarly adopted for models with other bit widths.
The specialized multiplication processing units designed for the above two approaches
aim to leverage the advantages of APTQ and APQ quantization processes, thereby achiev-
ing zero occupation in the on-chip DSP computing units for edge chips. Utilizing this simple
judgment and shift operation for multiplication in edge chips can be completed within
three clock cycles, thereby minimizing on-chip computational resource utilization, enhanc-
ing computational efficiency, and facilitating the construction of a pipelined processing
architecture.
schematic
Input diagram of a multiplication processing unit for a 4-bit quantizedOutputmodel. Each
unit, through judgment modules, detects whether there are zero data in the input. An
enable signal is generated by an AND gate, and when the enable signal is high, the unit
performs
pixel shift-based
0 value calculations, thereby outputting results, including the sign bit and
Sensors 2024, 24, 181 [7:0] judgment Shift module result 15 of be
26
reserved bits. Otherwise, it directly outputs 0. This multiplication processing unit can
similarly adopted for modelsand with other bit widths.reserved Shifter
en sign bit
bit (LUT)
weight 0 value 0
Input
[2:0] judgment weight [2] weight [1:0] Output
pixel 0 value
Figure
[7:0] . Dedicated multiplication processing unit for
judgment other
Shift bit width quantization CNN models.
module result
reserved Shifter
and
The specialized multiplication en processing
sign bit units designed for the above two ap-
bit (LUT)
proaches
weight aim to leverage
0 value the advantages of APTQ and APQ quantization 0processes,
thereby
[2:0] achieving zero occupationweight
judgment in the[2]on-chip DSPweight [1:0]
computing units for edge chips.
Utilizing this simple judgment and shift operation for multiplication in edge chips can be
completed within three clock cycles, thereby minimizing on-chip computational resource
Figure 7.. Dedicated
Figure Dedicated
utilization, enhancing multiplication processing
computational
multiplication unit for
efficiency,
processing unit forand
other bit width
widthquantization
facilitating
other bit quantization CNNof
the construction
CNN models.
a pipe-
models.
lined processing architecture.
The structure
This
This specialized
structure canmultiplication
can simplify
simplify the processingof
the processing
processing ofunits designedoperations
multiplication
multiplication for the above
operations in two ap-
in embedded
embedded
proaches
chips
chips suchaim
such as to leverage
as FPGAs
FPGAs and the advantages
and ASICs.
ASICs. Furthermore,
Furthermore, of APTQ
asasthe and
themain
mainAPQ quantization
computational
computational processes,
complexity
complexity in
thereby
in CNNs achieving
derives zero
from occupation
the in
convolutional the on-chip
computations DSP computing
in the
CNNs derives from the convolutional computations in the convolutional layers, this units for
convolutional edge
layers,chips.
this
means that
Utilizing
means that in
this order
order to
in simple verify
verify the
tojudgment effectiveness
theand of
of APTQ
shift operation
effectiveness for
for CNN
CNN hardware
for multiplication
APTQ deployment,
in edge
hardware chips can be
deployment, aa
complete
completed convolutional
within three computation
clock cycles, architecture
thereby was
minimizing also built
on-chipfor this study.
computational
complete convolutional computation architecture was also built for this study. The overall The overall
resource
architecture adopted
adopted the
utilization, enhancing
architecture same
same pipelined
computational
the architecture
efficiency,
pipelined as
as in
in Reference
and facilitating
architecture [46],
[46], as
the construction
Reference as shown in
of a pipe-
shown in
Figure
Figure 8.
lined processing
8. architecture.
This structure can simplify the processing of multiplication operations in embedded
chips such as FPGAs and ASICs. Furthermore, weights as the BROM main computational complexity in
CNNs derives from the convolutional computations in the convolutional layers, this
3 3 convolution window
means that in order to verify the effectiveness of APTQ for CNN hardware deployment, a
Input input data flow
complete
Feature convolutional computation architecture was also built for this study. The overall
architecture
Map adopted the same pipelined architecture bias as in Reference [46], as shown in
Figure 8. Output
ReLU
buffer1 + Feature
Truncation
Map
weights BROM
buffer2
3 3 convolution window
Input input data flow
Feature
Figure
Figure 8.. Convolutional
Convolutional computation
computation module.
module.
Map bias
Output
Due
Due toto the
the fact
fact that
that only
only the
the forward
forward inference
inference process
process of ReLU
of CNN
CNN models
models needs
needs to
to be
be
buffer1 + Feature
Truncation
completed
completed in in the
the embedded
embedded chip,chip, the
the function
function ofof this
this computation
computation module
module is to
to complete
isMap complete
the
the convolutional
convolutional computation
computation of of an
an input
input feature
feature map withaa33×× 33 convolution
map with convolution kernel
kernel
through buffer2
through an activation function, thereby finally obtaining an output feature map that can
an activation function, thereby finally obtaining an output feature map that can
be
be used
used asas an
an input
input for
for the
the next layer. PE
next layer. PE here
here represents
represents the
the multiplication
multiplication processing
processing
unit
Figure
unit shown in
in Figure
Figure 77computation
. Convolutional
shown or
or Figure
Figure 8, and ∑
and
module.
8, ∑represents
representsaamultiway
multiwayaccumulator.
accumulator. Through
Through
the
the design of multibuffer and convolution windows, the multiplexing of
design of multibuffer and convolution windows, the multiplexing of the
the input
input feature
feature
pixels can
pixelsDuecanbe the
to realized,
be fact thereby
that
realized, only forming an efficient
the forward
thereby forming an pipeline
inference computing
process
efficient of CNN mode.
pipeline models Furthermore,
computing needsmode.
to be
this convolution
completed in thecomputation
embedded chip, module can serveofasthis
the function a universal module
computation for various
module CNNs.
is to complete
Changing the number
the convolutional of PEs in the
computation of anconvolution
input featurewindow and the
map with a 3 corresponding
× 3 convolution number
kernel
of buffersan
through can realize the
activation convolution
function, therebycomputation for different
finally obtaining convolution
an output kernels
feature map sizes.
that can
Simultaneously, when on-chip hardware resources allow, multiple convolution
be used as an input for the next layer. PE here represents the multiplication processing compu-
tation modules
unit shown can be7parallelized
in Figure or Figure 8,to andachieve convolution
∑ represents calculations
a multiway for multichannel
accumulator. Through
feature maps.
the design of multibuffer and convolution windows, the multiplexing of the input feature
pixels can be realized, thereby forming an efficient pipeline computing mode.
3. Experiments
In this section, we compare the proposed ternary quantization algorithm APTQ and
the general quantization algorithm APQ with traditional ternary quantization and other
bit width quantization algorithms in multisample and multiangle scenarios in order to
verify the effectiveness and universality of the quantization strategy based on the unfixed
quantization boundary thresholds proposed in this paper.
Sensors 2024, 24, 181 16 of 26
3.1. APTQ Quantization Performance Testing

3.1.1. Implementation Details
Several commonly used convolutional neural network models, namely, AlexNet [66],
VGG-16 [7], ResNet-18, ResNet-20, ResNet-56 [60] and GoogLeNet [61], were compared
with APTQ on small sample classification data sets, including CIFAR10 [62], CIFAR100 [62],
and Mini-ImageNet [51].
For the baseline CNN models, pretraining of the initial CNN models for 300 epochs
in the data sets was completed, and the baseline models to be quantized for each CNN
were obtained. The baseline models of the two CNNs ResNet-20 and ResNet-56 with
respect to the CIFAR10 data set are given in Reference [60], which we directly tested. To
enhance the processing of the data sets, we added a column or row of 0-value pixels to
each side of the CIFAR10 and CIFAR100 images; we then performed random cropping and
inversion. We completed the images in the Mini-ImageNet data set to a size of 256 × 256
and then performed random cropping and inversion. For the APTQ algorithm, the ternary
quantization process followed the procedure shown in Algorithm 1. In the weight-grouping
process, since the ResNet-20 and ResNet-56 models have a small number of parameters, a
slight change will also have a large impact on their precision performance, so we divided
the filters in each layer of these two networks into five groups and quantized them in
a group-by-group manner. For the other four networks, due to their larger model size
and larger number of parameters, thus resulting in stronger resistance to interference, we
divided the filters in each layer into four groups and then quantized them group by group.
During the process of CNN retraining, 20 epoch updates were set in each retraining process
to fully recover the accuracy loss. The other important training hyperparameters were
basically consistent with those mentioned in the GSNQ algorithm [46], as shown in Table 3.
Table 3. Quantization retraining hyperparameters for various CNN models.
CNN Weight Decay Momentum Learning Rate Batch Size

AlexNet 0.0005 0.9 0.01 256
VGG-16 0.0005 0.9 0.01 128
ResNet-18 0.0005 0.9 0.01 128
ResNet-20 0.0001 0.9 0.1 256
ResNet-56 0.0001 0.9 0.1 128
GoogLeNet 0.0002 0.9 0.01 128
All quantization performance testing experiments in this paper were conducted using
Pycharm Community Edition 2022.1.1 as the development environment with software
coding based on the PyTorch library [67], and they were performed on an Nvidia GeForce
RTX 3070Ti GPU (Santa Clara, CA, USA).
3.1.2. APTQ Quantization Performance Comparison

Quantization research was first conducted on multiple samples using the proposed
APTQ algorithm and the GSNQ algorithm proposed in our previous work on different
CNN models and data sets. As a global power-of-two quantization algorithm, the GSNQ
algorithm already has better quantization performance than most existing algorithms at
3-bit and above, but it performs poorly in the case of ternary quantization (2-bit), with
accuracy losses exceeding 1.5% in many cases. The APTQ algorithm’s ternary quantization
strategy based on unfixed boundary thresholds is designed to address this issue, and it
adopts a global quantization architecture similar to the GSNQ algorithm while changing the
quantization strategy. Therefore, comparing the APTQ algorithm with the GSNQ algorithm
is the most intuitive way to verify the effectiveness of the quantization strategy based
on unfixed quantization boundary thresholds. Tables 4–6 provide the comparison results
regarding the quantization accuracy between the two algorithms for various networks with
respect to the CIFAR10, CIFAR100, and Mini-ImageNet data sets, respectively, in order to
verify whether the APTQ algorithm can generally and effectively improve the CNN ternary
Sensors 2024, 24, 181 17 of 26
quantization performance. Among them, bold represents the result of the algorithm in
this paper.
Table 4. Ternary quantization results for the CIFAR10 data set.
Decrease in
CNN Method Top-1 Accuracy Top-5 Accuracy
Top-1/Top-5 Error
Baseline 82.96% 99.09%
AlexNet GSNQ 80.95% 98.71% −2.01%/−0.38%
APTQ 82.25% 99.01% −0.71%/−0.08%
Baseline 88.74% 99.59%
VGG-16 GSNQ 87.14% 99.28% −1.60%/−0.31%
APTQ 88.18% 99.46% −0.56%/−0.13%
Baseline 89.72% 99.69%
ResNet-18 GSNQ 88.91% 99.40% −0.81%/−0.29%
APTQ 89.20% 99.60% −0.52%/−0.09%
Baseline 91.60% 99.76%
ResNet-20 GSNQ 90.91% 99.61% −0.69%/−0.15%
APTQ 91.21% 99.66% −0.39%/−0.10%
Baseline 93.20% 99.80%
ResNet-56 GSNQ 92.92% 99.69% −0.28%/−0.11%
APTQ 93.07% 99.74% −0.13%/−0.06%
Baseline 90.04% 99.91%
GoogLeNet GSNQ 89.02% 99.66% −1.02%/−0.25%
APTQ 89.63% 99.75% −0.41%/−0.16%
Table 5. Ternary quantization results for the CIFAR100 data set.
Decrease in
Top-1/Top-5 Error
Baseline 70.11% 88.18%
AlexNet GSNQ 67.12% 87.77% −2.99%/−0.41%
APTQ 68.99% 88.18% −1.12%/0.00%
Baseline 72.03% 91.25%
VGG-16 GSNQ 70.00% 90.85% −2.03%/−0.40%
APTQ 71.11% 91.10% −0.92%/−0.15%
Baseline 74.16% 91.96%
ResNet-18 GSNQ 73.17% 91.40% −0.99%/−0.56%
APTQ 73.79% 91.71% −0.37%/−0.25%
Baseline 76.68% 92.01%
GoogLeNet GSNQ 75.52% 91.58% −1.16%/−0.43%
APTQ 75.98% 91.95% −0.70%/−0.06%
Based on the comparison results, it can be seen that, in the different data sets and CNN
models, the ternary quantization results of the GSNQ and APTQ algorithms both presented
a certain accuracy loss when compared to the baseline models. However, the accuracy loss
of the APTQ algorithm was smaller than that of the GSNQ algorithm in all cases, and the
APTQ algorithm maintained the loss of the top-1 accuracy within 1%. For example, ResNet-
56 in the CIFAR10 data set showed an astonishing accuracy loss of only 0.13% after APTQ
ternary quantization, thereby achieving almost zero accuracy loss quantization. These two
sets of experiments fully verify that the quantization strategy based on unfixed boundary
Sensors 2024, 24, 181 18 of 26
thresholds in APTQ plays a significant role in ternary quantization, thus presenting strong
universality.
Table 6. Ternary quantization results for the Mini-ImageNet data set.
Decrease in
Top-1/Top-5 Error
Baseline 61.21% 86.99%
AlexNet GSNQ 59.03% 84.65% −2.18%/−0.51%
APTQ 60.49% 84.99% −0.72%/−0.20%
Baseline 75.09% 91.56%
VGG-16 GSNQ 73.51% 90.01% −1.58%/−1.55%
APTQ 74.66% 90.52% −0.43%/−1.04%
Baseline 76.76% 92.16%
ResNet-18 GSNQ 75.19% 91.06% −1.57%/−1.10%
APTQ 75.98% 92.00% −0.78%/−0.16%
Baseline 78.91% 93.10%
GoogLeNet GSNQ 77.89% 92.61% −1.02%/−0.49%
APTQ 78.29% 92.96% −0.62%/−0.14%
In order to further verify the performance of APTQ ternary quantization, it was

compared with some other state-of-the-art ternary quantization algorithms in the literature,
including TWN [40], DSQ [44], LQ-Net [68], DoReFa-Net [69], PACT [39], ProxQuant [70],
APoT [59], and CSQ [71]. Among them, the APoT algorithm is also a power-of-two
quantization algorithm, while the other algorithms are conventional ternary algorithms.
The results are shown in Table 7.
The results demonstrate that the existing state-of-the-art quantization algorithms were
all basically unable to achieve ternary quantization without resulting in accuracy loss.
However, the performance of the proposed APTQ algorithm was better than that of most
other methods and was only slightly lower than some methods in rare cases, such as
the CSQ algorithm for ResNet-20. On the other hand, among the power-of-two ternary
quantization algorithms (APTQ, GSNQ, and APoT), APTQ presented the best quantization
performance. APTQ implements global quantization, while most other methods still retain
some weights that are not quantized or 8-bit quantized.
The above experiments were all direct quantitative tests of quantization accuracy,
which were conducted in order to intuitively analyze the software performance of APTQ.
We also calculated the sum of the L2 distances for all filters in the whole CNN model
between the ternary models quantized by some different algorithms and their original
baseline models in order to qualitatively analyze the performance of APTQ. The results
are shown in Table 8. It can be seen that, among the five methods of TWN, LQ-Net, APoT,
GSNQ, and APTQ, the L2 distance of the APTQ method was the shortest, which indicates
that the ternary CNN model developed with APTQ had the smallest gap with respect to
the baseline model and was the most similar in terms of space. This result proves that the
APTQ algorithm theoretically has the best performance.
In summary, a substantial amount of quantitative and qualitative comparative analysis
has demonstrated the superior quantization performance of the APTQ algorithm at the
software level compared to the majority of existing ternary quantization algorithms. The
quantization strategy based on nonfixed boundary thresholds does indeed universally and
effectively reduce the precision loss caused by fixed boundary thresholds during ternary
quantization.
Sensors 2024, 24, 181 19 of 26
Table 7. Comparison with other ternary quantization algorithms on the CIFAR10 data set.
CNN Method Top-1 Accuracy Decrease in Top-1 Error

Baseline 88.74%
TWN 86.19% −2.55%
VGG-16 DSQ 88.09% −0.65%
LQ-Net 88.00% −0.74%
APTQ 88.18% −0.56%
Baseline 89.72%
TWN 87.11% −2.61%
ResNet-18 LQ-Net 87.16% −2.56%
DSQ 89.25% −0.47%
APTQ 89.20% −0.52%
Baseline 91.60%
DoReFa-Net 88.20% −3.40%
PACT 89.70% −1.90%
LQ-Net 90.20% −1.40%
ResNet-20
ProxQuant 90.06% −1.54%
APoT 91.00% −0.60%
CSQ 91.22% −0.38%
APTQ 91.21% −0.39%
Baseline 93.20%
PACT 92.50% −0.70%
ResNet-56
APoT 92.90% −0.30%
APTQ 93.07% −0.13%
Table 8. The sum of ResNet-20 L2 distance before and after ternary quantization.
ResNet-20 TWN LQ-Net APoT GSNQ APTQ

L2 Distance 12.15 11.30 11.31 10.65 9.38
3.2. APQ Quantization Performance Testing

The implementation details of the APQ algorithm are the same as those of the APTQ
algorithm. As 5-bit bit quantization and above basically achieved an accuracy equal to or
even higher than the CNN baseline model in previous studies [33,59,71,72]. This section
mainly focuses on quantization comparison experiments of the APQ algorithm with other
state-of-the-art algorithms with respect to the CIFAR10 data set for 4-bit (15 quantization
values per layer) and 3-bit (seven quantization values per layer) scenarios. The results
are shown in Table 9, where the quantization accuracy values of the DoReFa-Net, PACT,
LQ-Net, APoT, and GSNQ algorithms were taken from their corresponding studies.
Table 9. Comparison of APQ with other quantization algorithms with respect to the CIFAR10 data set.
Top-1 Accuracy
CNN Method
5-bit 4-bit 3-bit
DoReFa-Net — 90.5% 89.9%
PACT — 91.7% 91.1%
ResNet-20 LQ-Net — — 91.6%
(baseline: 91.60%) APoT — 92.3% 92.2%
GSNQ — 92.42% 91.96%
APQ 92.42% 92.36% 92.16%
APoT — 94.0% 93.9%
ResNet-56
GSNQ — 94.0% 93.62%
(baseline: 93.20%)
APQ 93.99% 93.88% 93.67%
Sensors 2024, 24, 181 20 of 26
The results in the table indicate that the performance of APQ at 4-bit and 3-bit quanti-
zation was almost the same as that of the APoT and GSNQ algorithms, with a small increase
in accuracy, but it was higher than several other algorithms. Significantly, the 4-bit quan-
tization performance of APQ showed a small increase compared to its 3-bit quantization
performance, while the 5-bit quantization performance showed an even smaller increase
compared to 4-bit. Therefore, upon synthesizing the experimental results considering
APTQ ternary quantization and APQ universal quantization, it can be recognized that the
approach for quantization based on unfixed quantization boundary thresholds proposed in
this paper is more suitable for the quantization case of very low bit widths (e.g., ternary
quantization), and it can greatly optimize the quantization performance. However, when
considering higher quantization bit widths with more quantization values and stronger
weight expression ability, it has little impact on the quantization results, and the enhance-
ment effect is not obvious. However, although the 2-bit quantization performance of APTQ
is better than that of most existing algorithms, when compared with quantization with
higher bit widths, the performance degradation caused by less-reserved information at low
bit widths is still inevitable. Therefore, in practical applications, it is necessary to weigh the
quantitative performance according to the requirements of the corresponding scenario. If
the edge application scenario is sensitive to latency, it is necessary to sacrifice some CNN
accuracy performance and choose quantization with low bit width to reduce inference time.
Conversely, quantization ranging from 4-bit to 8-bit can be chosen to ensure CNN model
performance. For scenarios that require low latency and high precision, such as assisted
driving, 3-bit or 4-bit quantization can be considered.
3.3. Hardware Performance Evaluation

3.3.1. Implementation Details
After completing the quantization of APTQ and APQ on the GPU, this paper con-
ducted preliminary deployment of the quantized power-of-two models on an FPGA plat-
form. A comparison of the hardware resource utilization of the dedicated modules designed
in Figures 6 and 7 was performed to validate the assistance of the APTQ and APQ algo-
rithms for the edge deployment of CNNs. All performance experiments related to hardware
deployment used Vivado Design Suite—HLx Editions 2019.2 as the development environ-
ment, which were written in the Verilog hardware description language and compiled in
the programmable logic side (FPGA side) of a ZYNQ XC7Z035FFG676-2I chip (Xilinx, San
Jose, CA, USA). This chip contains 171,900 on-chip lookup tables (LUTs), 343,800 on-chip
flip-flops (FFs), and 900 on-chip DSPs.
3.3.2. Hardware Design Comparative Testing

The hardware advantage of the APTQ and APQ quantization algorithm is most
intuitively reflected in terms of model storage space. Table 10 shows the comparison of
storage space occupied by the CNN models before and after quantization.
Table 10. APTQ and APQ quantization model storage space comparison.
CNN Models Storage Space

CNNs
Baseline After APQ (3-Bit) After APTQ (2-Bit)
VGG-16 114.4 Mb 10.7 Mb 7.2 Mb
ResNet-20 4.5 Mb 0.4 Mb 0.3 Mb
ResNet-56 14.2 Mb 1.3 Mb 0.9 Mb
It can be seen that, after encoding the APTQ quantization model using the method
shown in Table 1, the size of the model deployed to the chip was significantly reduced.
Models that do not exceed 10 Mb can be easily stored in the on-chip memory of the
chip, thereby fundamentally avoiding the decrease in computing efficiency and energy
consumption caused by frequent access to external memory by the chip.
Sensors 2024, 24, 181 21 of 26
Based on the architecture mentioned in Section 2.2, the PEs were changed to shift-based
PEs for the ternary power-of-two quantization (see Figure 6 based on APTQ) designed in
this paper, shift-based PEs for the 3-bit power-of-two quantization (see Figure 7 based on
APQ) designed in this paper, PEs using on-chip DSP to implement multiplication calcula-
tions [15], and PEs based on the method provided by the Xilinx official documentation for
implementing multiplication computation using lookup tables (LUTs) [73], respectively.
And then, the on-chip hardware resource occupancy of the four modules on the FPGA
platform was compared, as shown in Table 11.
Table 11. The hardware resource occupancy of a single 3 × 3 convolution calculation module.
Modules Bit Width LUT FF DSP

Module 1: based on multiplication (implemented
8-bit 428 392 9
using on-chip DSP)
3-bit 250 268 9
using on-chip DSP)
3-bit 402 226 0
using on-chip LUT)
Module 3: based on APQ 3-bit 263 237 0
2-bit 158 191 9
using on-chip DSP)
2-bit 262 158 0
using on-chip LUT)
Module 4: based on APTQ 2-bit 168 167 0
Module 1 in the table represents the most common design for convolutional calcula-
tion [15,74], where the on-chip DSP resource usage depends on the number of processing
elements (PEs) within the module. The scarcity of on-chip DSP resources hindered the
parallel processing capability of the CNNs in the edge chips. Module 2 utilizes LUTs
for multiplication through the Vivado 2019.2 compilation software, thereby relieving the
restrictions imposed by the on-chip DSP. However, as a trade-off, there was a significant
increase in the consumption of other on-chip resources. In contrast, Modules 3 and 4, which
were designed based on the quadratic quantization characteristics of APTQ and APQ,
achieved zero DSP resource utilization. Moreover, compared to Modules 1 and 2, these
modules effectively reduced the total occupancy of all other on-chip hardware resources.
Module 4, in comparison to Module 3, demonstrates a 2-bit model (ternary) that can further
reduce resource usage by 1/3 compared to the 3-bit model, thereby offering an optimal
saving of computational resources in embedded chips.
In most cases, the number of input and output channels for the convolutional layers in
the convolutional neural networks ranged between 32 and 512 and was always maintained
as a multiple of 32. Therefore, in FPGA deployment, regardless of the method used to
accelerate convolutional neural networks, it is necessary to construct an architecture for
a convolution computation accelerator with at least 32 parallel convolution acceleration
modules, which serve as the fundamental processing units for convolutional layer com-
putations [75,76]. In this paper, the designed convolution acceleration module was also
subjected to 32-way parallel processing. Four acceleration modules were compared to vali-
date the performance of the proposed quantization algorithm and dedicated computation
modules in practical applications. Table 12 presents the hardware resource consumption of
the different convolution acceleration modules for a 3 × 3 convolution kernel in the case of
32-way parallel processing.
Sensors 2024, 24, 181 22 of 26
Table 12. The hardware resource occupancy of a 32-way 3 × 3 convolution calculation module.
Modules Bit Width LUT FF DSP

Module 1: based on multiplication
8-bit 14,548/8.46% 11,226/3.27% 288/32.00%
(implemented using on-chip DSP)
3-bit 8675/5.05% 9332/2.71% 288/32.00%
3-bit 14,028/8.16% 7901/2.30% 0/0.00%
(implemented using on-chip LUT)
Module 3: based on APQ 3-bit 8510/4.95% 8270/2.41% 0/0.00%
2-bit 5243/3.05% 6479/1.88% 288/32.00%
2-bit 8475/4.93% 5370/1.56% 0/0.00%
(implemented using on-chip LUT)
Module 4: based on APTQ 2-bit 5862/3.41% 5799/1.69% 0/0.00%
After processing in a 32-way parallel manner, the utilization situation of different

hardware resources became more pronounced. It is evident that the on-chip DSP resources
were heavily consumed, while other hardware resources experienced less consumption.
If, on top of these 32 parallel processing units, further parallel stacking is applied, the
on-chip DSP resources on the XC7Z035 chip may become insufficient, thereby becoming
a limiting factor for network deployment. This necessitates the use of FPGA chips with
more on-chip resources, thereby leading to an inevitable increase in costs. Similar to the
comparison results with a single convolution computation module, the two proposed
quadratic algorithms in this paper effectively address the limitations imposed by on-chip
DSP resources on CNN edge deployment performance. Additionally, the APTQ algorithm
maximally assists CNNs in increasing their parallel processing capability (theoretically
having the potential to achieve nearly 2.5 times the computational parallelism compared
to the current most commonly used 8-bit model deployment), thereby enhancing the
computational efficiency of edge deployment. The computational advantage, coupled
with the significant reduction in storage volume from low-bit-width quantization and
the high-precision advantages of the quantized network, enables the APTQ algorithm to
balance the performance of quantized convolutional neural networks in terms of both
software and hardware. As a result, it can be applied to a wide range of edge computing
scenarios, thereby demonstrating strong industrial application value.
4. Limitation Discussion
The APTQ and APQ algorithms proposed in this paper effectively balance the software
quantization performance of CNNs and hardware deployment efficiency. Experiments on
various datasets and CNN models have validated their robust performance. However, to
ensure the precision of low-bit quantization, this paper employed a fine-grained global
retraining architecture during the quantization process. While this architecture minimizes
precision loss through fine-grained group quantization and retraining, it introduces signifi-
cant time costs and imposes higher demands on GPU computational power. When dealing
with extremely large datasets such as ImageNet, a GPU with substantial computational
power is required to complete the entire quantization process, thereby potentially also
incurring significant time costs. The iterative network retraining process constitutes a major
contributor to the extensive time costs. Without increasing the GPU computational power,
it may be possible to reduce this limitation in training time costs through methods like
learning rate rewinding [77], thereby making the proposed algorithm applicable to a wider
range of scenarios.
5. Conclusions
This paper proposed an innovative power-of-two quantization method based on
unfixed quantization boundary thresholds, called APTQ, in order to address the storage
Sensors 2024, 24, 181 23 of 26
and computational pressure of CNNs on hardware that exists in edge applications through
quantization at extremely low bit widths. APTQ employs a fine-grained quantization
architecture with three-step iterations of filter grouping by layer, group quantization, and
network retraining. In the quantization process, the optimal power-of-two quantization
scale factor is calculated adaptively, according to the weight distribution of each layer, and
each filter is quantized into the form of combining two power-of-two binary subfilters such
that there is no need to set fixed boundary thresholds. In this way, a power-of-two ternary
network model is adaptively obtained. It is worth noting that the APTQ algorithm not only
performs well in terms of its hardware-friendly characteristics, but it also achieved excellent
quantization performance for CNN models commonly used in industrial applications. The
ternary quantization experiments on a variety of CNN models fully validate that the
accuracy loss of APTQ can be maintained within 0.5% in most cases, thereby possessing
almost state-of-the-art ternary quantization performance. By extending the quantization
strategy of APTQ to any bit width, the proposed universal APQ quantization algorithm
also has good quantization performance, thereby providing a choice for a wider range
of application scenarios. In the next phase of our work, we intend to first optimize the
performance of APQ in the context of high-bit-width quantization, and then we intend to
extend the APTQ and APQ algorithms to other object detection models (e.g., YOLO and R-
CNN), explore their adaptability to most CNN models, and further optimize and promote
the standardization of the algorithm. Finally, we hope to fully deploy the quantified low-bit
CNN model on an FPGA to complete the entire process of CNN edge deployment.
In conclusion, the APTQ and APQ algorithms can provide excellent quantization
accuracy performance, less bandwidth pressure, lower storage and computational pressure,
and higher computing efficiency of convolutional neural networks during deployment in
embedded hardware. The two methods have good reference value for the field of industrial
edge computing, and they can provide feasible and effective solutions for the efficient
deployment of convolutional neural networks in edge devices.
Author Contributions: Conceptualization, Q.L., Z.T. and X.S.; methodology, X.S.; software, X.S., M.L.
and M.Z; validation, X.S., Q.L. and C.K.; investigation, X.S., Q.L. and M.Z.; data curation, X.S. and
C.K.; writing—original draft preparation, X.S. and Q.L.; writing—review and editing, X.S., M.L. and
H.Y.; project administration, Z.T.; funding acquisition, Z.T. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by the Key Program Project of Science and Technology Innovation
of the Chinese Academy of Sciences (No. KGFZD-135-20-03-02) and the Innovation Fund Program of
the Chinese Academy of Sciences (No. CXJJ-23S016).
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Yan, S.-R.; Pirooznia, S.; Heidari, A.; Navimipour, N.J.; Unal, M. Implementation of a Product-Recommender System in an
IoT-Based Smart Shopping Using Fuzzy Logic and Apriori Algorithm. In IEEE Transactions on Engineering Management; IEEE:
Toulouse, France, 2022.
2. Garcia, A.J.; Aouto, A.; Lee, J.-M.; Kim, D.-S. CNN-32DC: An Improved Radar-Based Drone Recognition System Based on
Convolutional Neural Network. ICT Express 2022, 8, 606–610. [CrossRef]
3. Saha, D.; De, S. Practical Self-Driving Cars: Survey of the State-of-the-Art. Preprints 2022. [CrossRef]
4. Lyu, Y.; Bai, L.; Huang, X. ChipNet: Real-Time LiDAR Processing for Drivable Region Segmentation on an FPGA. IEEE Trans.
Circuits Syst. I Regul. Pap. 2019, 66, 1769–1779. [CrossRef]
5. Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. Internet Things J. IEEE 2016, 3, 637–646.
[CrossRef]
6. McEnroe, P.; Wang, S.; Liyanage, M. A Survey on the Convergence of Edge Computing and AI for UAVs: Opportunities and
Challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [CrossRef]
7. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556.
Sensors 2024, 24, 181 24 of 26
8. Thakur, P.S.; Sheorey, T.; Ojha, A. VGG-ICNN: A Lightweight CNN Model for Crop Disease Identification. Multimed. Tools Appl.
2023, 82, 497–520. [CrossRef]
9. Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote
Sensing Images. Remote Sens. 2022, 14, 1956. [CrossRef]
10. Liu, X.; Yang, J.; Zou, C.; Chen, Q.; Cai, C. Collaborative Edge Computing With FPGA-Based CNN Accelerators for Energy-
Efficient and Time-Aware Face Tracking System. IEEE Trans. Comput. Soc. Syst. 2021, 9, 252–266. [CrossRef]
11. Saranya, M.; Archana, N.; Reshma, J.; Sangeetha, S.; Varalakshmi, M. Object Detection and Lane Changing for Self Driving Car
Using Cnn. In Proceedings of the 2022 International Conference on Communication, Computing and Internet of Things (IC3IoT),
Chennai, India, 10–11 March 2022; pp. 1–7.
12. Rashid, N.; Demirel, B.U.; Al Faruque, M.A. AHAR: Adaptive CNN for Energy-Efficient Human Activity Recognition in
Low-Power Edge Devices. IEEE Internet Things J. 2022, 9, 13041–13051. [CrossRef]
13. Yu, Z.; Lu, Y.; An, Q.; Chen, C.; Li, Y.; Wang, Y. Real-Time Multiple Gesture Recognition: Application of a Lightweight
Individualized 1D CNN Model to an Edge Computing System. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 30, 990–998. [CrossRef]
[PubMed]
14. Choquette, J.; Gandhi, W.; Giroux, O.; Stam, N.; Krashinsky, R. NVIDIA A100 Tensor Core GPU: Performance and Innovation.
IEEE Micro 2021, 41, 29–35. [CrossRef]
15. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Cong, J. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks.
In Proceedings of the 2015 ACM/SIGDA International Symposium, Monterey, CA, USA, 22–24 February 2015.
16. Xilinx. 7 Series FPGAs Configuration User Guide (UG470); Xilinx: San Jose, CA, USA, 2018.
17. Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator
With High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4069–4083. [CrossRef] [PubMed]
18. Wong, D.L.T.; Li, Y.; John, D.; Ho, W.K.; Heng, C.-H. An Energy Efficient ECG Ventricular Ectopic Beat Classifier Using Binarized
CNN for Edge AI Devices. IEEE Trans. Biomed. Circuits Syst. 2022, 16, 222–232. [CrossRef] [PubMed]
19. Yan, P.; Xiang, Z. Acceleration and Optimization of Artificial Intelligence CNN Image Recognition Based on FPGA. In Proceedings
of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March
2022; Volume 6, pp. 1946–1950.
20. Pan, H.; Sun, W. Nonlinear Output Feedback Finite-Time Control for Vehicle Active Suspension Systems. IEEE Trans. Ind. Inform.
2019, 15, 2073–2082. [CrossRef]
21. Kim, H.; Choi, K.K. A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile
FPGA; IEEE Access: Toulouse, France, 2023.
22. Sze, V.; Chen, Y.-H.; Yang, T.-J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 2017,
105, 2295–2329. [CrossRef]
23. Rizqyawan, M.I.; Munandar, A.; Amri, M.F.; Utoro, R.K.; Pratondo, A. Quantized Convolutional Neural Network toward Real-
Time Arrhythmia Detection in Edge Device. In Proceedings of the 2020 International conference on radar, antenna, microwave,
electronics, and telecommunications (ICRAMET), Tangerang, Indonesia, 18–20 November 2020; pp. 234–239.
24. Capotondi, A.; Rusci, M.; Fariselli, M.; Benini, L. CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge
Devices. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 871–875. [CrossRef]
25. Zhang, X.; Zhou, X.; Lin, M.; Sun, R. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices; IEEE: Salt
Lake City, UT, USA, 2018; pp. 6848–6856.
26. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
27. Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks; MIT Press: Cambridge, MA,
USA, 2015.
28. Gao, S.; Huang, F.; Cai, W.; Huang, H. Network Pruning via Performance Maximization. In Proceedings of the 2021 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Kuala Lumpur, Malaysia, 20–25 June 2021.
29. Jaderberg, M.; Vedaldi, A.; Zisserman, A. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv 2014,
arXiv:1405.3866.
30. Dettmers, T. 8-Bit Approximations for Parallelism in Deep Learning. arXiv 2015, arXiv:1511.04561.
31. Courbariaux, M.; Bengio, Y. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1.
arXiv 2016, arXiv:1602.02830.
32. Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural
Networks. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016.
33. Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; Chen, Y. Incremental Network Quantization: Towards Lossless CNNs with Low-Precision
Weights. arXiv 2017, arXiv:1702.03044.
34. Yamamoto, K. IEEE COMP SOC Learnable Companding Quantization for Accurate Low-Bit Neural Networks; IEEE: Piscataway, NJ,
USA, 2021; pp. 5027–5036.
35. Krishnamoorthi, R. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper. arXiv 2018, arXiv:1806.08342.
Sensors 2024, 24, 181 25 of 26
36. Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and Training of Neural
Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the 2018 IEEE conference on computer vision and
pattern recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713.
37. Kuzmin, A.; Baalen, M.V.; Ren, Y.; Nagel, M.; Peters, J.; Blankevoort, T. FP8 Quantization: The Power of the Exponent. Adv. Neural
Inf. Process. Syst. 2022, 35, 14651–14662.
38. Zhu, F.; Gong, R.; Yu, F.; Liu, X.; Wang, Y.; Li, Z.; Yang, X.; Yan, J. Towards Unified Int8 Training for Convolutional Neural
Network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA,
USA, 13–19 June 2020; pp. 1969–1979.
39. Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P.I.-J.; Srinivasan, V.; Gopalakrishnan, K. Pact: Parameterized Clipping Activation
for Quantized Neural Networks. arXiv 2018, arXiv:1805.06085.
40. Li, F.; Zhang, B.; Liu, B. Ternary Weight Networks. arXiv 2016, arXiv:1605.04711.
41. Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained Ternary Quantization. arXiv 2016, arXiv:1612.01064.
42. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86,
2278–2324. [CrossRef]
43. Li, Y.; Dong, X.; Zhang, S.Q.; Bai, H.; Chen, Y.; Wang, W. Rtn: Reparameterized Ternary Network. Proc. AAAI Conf. Artif. Intell.
2020, 34, 4780–4787. [CrossRef]
44. Gong, R.; Liu, X.; Jiang, S.; Li, T.; Yan, J. Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks. In
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2
November 2019.
45. Chin, H.-H.; Tsay, R.-S.; Wu, H.-I. An Adaptive High-Performance Quantization Approach for Resource-Constrained CNN
Inference. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS),
Incheon, Republic of Korea, 13–15 June 2022; pp. 336–339.
46. Sui, X.; Lv, Q.; Bai, Y.; Zhu, B.; Zhi, L.; Yang, Y.; Tan, Z. A Hardware-Friendly Low-Bit Power-of-Two Quantization Method for
CNNs and Its FPGA Implementation. Sensors 2022, 22, 6618. [CrossRef]
47. Choukroun, Y.; Kravchik, E.; Kisilev, P. Low-Bit Quantization of Neural Networks for Efficient Inference. In Proceedings of the
2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019.
48. NVIDIA. 8-Bit Inference with TensorRT; NVIDIA: Santa Clara, CA, USA, 2017.
49. Liu, C.; Ding, W.; Chen, P.; Zhuang, B.; Wang, Y.; Zhao, Y.; Zhang, B.; Han, Y. RB-Net: Training Highly Accurate and Efficient
Binary Neural Networks with Reshaped Point-Wise Convolution and Balanced Activation. IEEE Trans. Circuits Syst. Video Technol.
2022, 32, 6414–6424. [CrossRef]
50. Hong, W.; Chen, T.; Lu, M.; Pu, S.; Ma, Z. Efficient Neural Image Decoding via Fixed-Point Inference. IEEE Trans. Circuits Syst.
Video Technol. 2020, 31, 3618–3630. [CrossRef]
51. Baskin, C.; Liss, N.; Schwartz, E.; Zheltonozhskii, E.; Giryes, R.; Bronstein, A.M.; Mendelson, A. Uniq: Uniform Noise Injection
for Non-Uniform Quantization of Neural Networks. ACM Trans. Comput. Syst. (TOCS) 2021, 37, 1–15. [CrossRef]
52. Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and
Huffman Coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto
Rico, 2–4 May 2016.
53. Nagel, M.; Amjad, R.A.; Van Baalen, M.; Louizos, C.; Blankevoort, T. Up or down? Adaptive Rounding for Post-Training
Quantization. Proc. Int. Conf. Mach. Learn. PMLR 2020, 119, 7197–7206.
54. Kumar, A.; Sharma, A.; Bharti, V.; Singh, A.K.; Singh, S.K.; Saxena, S. MobiHisNet: A Lightweight CNN in Mobile Edge
Computing for Histopathological Image Classification. IEEE Internet Things J. 2021, 8, 17778–17789. [CrossRef]
55. Meng, J.; Venkataramanaiah, S.K.; Zhou, C.; Hansen, P.; Whatmough, P.; Seo, J. Fixyfpga: Efficient Fpga Accelerator for Deep
Neural Networks with High Element-Wise Sparsity and without External Memory Access. In Proceedings of the 2021 31st
International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021;
pp. 9–16.
56. Li, Z.; Ni, B.; Li, T.; Yang, X.; Zhang, W.; Gao, W. Residual Quantization for Low Bit-Width Neural Networks. IEEE Trans.
Multimed. 2021, 25, 214–227. [CrossRef]
57. Venieris, S.I.; Christos-Savvas, B. fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs.
IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 326–342. [CrossRef]
58. Zhu, C.; Huang, K.; Yang, S.; Zhu, Z.; Zhang, H.; Shen, H. An Efficient Hardware Accelerator for Structured Sparse Convolutional
Neural Networks on FPGAs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 1953–1965. [CrossRef]
59. Li, Y.; Dong, X.; Wang, W. Additive Powers-of-Two Quantization: An Efficient Non-Uniform Discretization for Neural Networks.
In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020.
60. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
61. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with
Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA,
USA, 7–12 June 2015; pp. 1–9.
Sensors 2024, 24, 181 26 of 26
62. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. In Handbook of Systemic Autoimmune Diseases;
University of Toronto: Toronto, ON, Canada, 2009.
63. Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the 30th
Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29.
64. Vanholder, H. Efficient Inference with Tensorrt. In Proceedings of the GPU Technology Conference, San Jose, CA, USA, 4–7 April
2016; Volume 1.
65. Nagel, M.; van Baalen, M.; Blankevoort, T.; Welling, M. Data-Free Quantization through Weight Equalization and Bias Correction.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November
2019; pp. 1325–1334.
66. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf.
Process. Syst. 2012, 60, 84–90. [CrossRef]
67. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Conference and Workshop on Neural
Information Processing Systems 2019, Vancouver, BC, Canada, 8–14 December 2019; Volume 32.
68. Zhang, D.; Yang, J.; Ye, D.; Hua, G. Lq-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. In
Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 365–382.
69. Zhou, S.; Ni, Z.; Zhou, X.; Wen, H.; Wu, Y.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with
Low Bitwidth Gradients. arXiv 2016, arXiv:1606.06160.
70. Bai, Y.; Wang, Y.-X.; Liberty, E. Proxquant: Quantized Neural Networks via Proximal Operators. arXiv 2018, arXiv:1810.00861.
71. Asim, F.; Park, J.; Azamat, A.; Lee, J. CSQ: Centered Symmetric Quantization for Extremely Low Bit Neural Networks. In
Proceedings of the International Conference on Learning Representations 2022, New Orleans, LA, USA, 19–20 June 2022.
72. Kulkarni, U.; Hosamani, A.S.; Masur, A.S.; Hegde, S.; Vernekar, G.R.; Chandana, K.S. A Survey on Quantization Methods for
Optimization of Deep Neural Networks. In Proceedings of the 2022 International Conference on Automation, Computing and
Renewable Systems (ICACRS), Pudukkottai, India, 13–15 December 2022; pp. 827–834.
73. Xilinx. Vivado Design Suite User Guide: Synthesis. White Pap. 2021, 5, 30.
74. Li, J.; Un, K.-F.; Yu, W.-H.; Mak, P.-I.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Convolutional Neural
Network Accelerator for Object Recognition Applications. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 3143–3147. [CrossRef]
75. Yuan, T.; Liu, W.; Han, J.; Lombardi, F. High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization.
IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 68, 250–263. [CrossRef]
76. Bouguezzi, S.; Fredj, H.B.; Belabed, T.; Valderrama, C.; Faiedh, H.; Souani, C. An Efficient FPGA-Based Convolutional Neural
Network for Classification: Ad-MobileNet. Electronics 2021, 10, 2272. [CrossRef]
77. Renda, A.; Frankle, J.; Carbin, M. Comparing Fine-Tuning and Rewinding in Neural Network Pruning. In Proceedings of the
International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

sensors-24-00181-v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

sensors-24-00181-v2

Uploaded by

Copyright:

Available Formats

sensors

Copyright: © 2023 by the authors. 1. Introduction

Sensors 2024, 24, 181. https://doi.org/10.3390/s24010181 https://www.mdpi.com/journal/sensors

1.2. Existing Methods and Problems

Sensors 2024, 24, 181 4 of 26

17.5 0.12 0.0 0.12

0.0 0.125 0.125

putational efficiency of CNNs in edge deployment. Power-of-two quantization methods

1.3. Contributions of This Paper

where w represents the original floating-point weights to be quantized, q represents the

Fi = Fi1 + Fi2 , (8)

Ti = Si∗ Bi1 + Si∗ Bi2 . (15)

sub = min(|list(p) − Si∗ |)

Si∗∗ = return [ list(p)] → sub, (17)

Horizontal Horizontal Horizontal

Figure 5.. APTQ global fine-grained quantization architecture.

Algorithm 1: Adaptive Global Power-of-two Ternary Quantization Based on Unfixed Boundary

Algorithm 2: Universal Global Power-of-Two Quantization Based on Unfixed Boundary

Quantized Weight Recoding Weight

Figure 6.. Dedicated multiplication processing

3.1. APTQ Quantization Performance Testing

Table 3. Quantization retraining hyperparameters for various CNN models.

CNN Weight Decay Momentum Learning Rate Batch Size

3.1.2. APTQ Quantization Performance Comparison

Table 4. Ternary quantization results for the CIFAR10 data set.

Table 5. Ternary quantization results for the CIFAR100 data set.

Table 6. Ternary quantization results for the Mini-ImageNet data set.

In order to further verify the performance of APTQ ternary quantization, it was

CNN Method Top-1 Accuracy Decrease in Top-1 Error

ResNet-20 TWN LQ-Net APoT GSNQ APTQ

3.2. APQ Quantization Performance Testing

3.3. Hardware Performance Evaluation

3.3.2. Hardware Design Comparative Testing

CNN Models Storage Space

Modules Bit Width LUT FF DSP

Modules Bit Width LUT FF DSP

After processing in a 32-way parallel manner, the utilization situation of different

You might also like