Design of Power Efficient Posit Multiplier

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO.

5, MAY 2020 861

Design of Power Efficient Posit Multiplier


Hao Zhang , Member, IEEE, and Seok-Bum Ko , Senior Member, IEEE

Abstract—Posit number system has been used as an alterna-


tive to IEEE floating-point number system in many applications,
especially the recent popular deep learning. Its non-uniformed
number distribution fits well with the data distribution of deep
learning and thus can speedup the training process of deep
learning. Among all the related arithmetic operations, multiplica- Fig. 1. Each component of a posit number.
tion is one of the most frequent operations used in applications.
However, due to the bit-width flexibility nature of posit numbers,
the hardware multiplier is usually designed with the maximum
possible mantissa bit-width. As the mantissa bit-width is not
always the maximum value, such multiplier design leads to a
high power consumption especially when the mantissa bit-width
is small. In this brief, a power efficient posit multiplier archi-
tecture is proposed. The mantissa multiplier is still designed for
the maximum possible bit-width, however, the whole multiplier is
divided into multiple smaller multipliers. Only the required small
multipliers are enabled at run-time. Those smaller multipliers are
controlled by the regime bit-width which can be used to deter-
mine the mantissa bit-width. This design technique is applied to Fig. 2. Power consumption distribution of a posit multiplier.
8-bit, 16-bit, and 32-bit posit formats in this brief and an average
of 16% power reduction can be achieved with negligible area and
timing overhead.
always appeared in the format. The exponent and the mantissa
Index Terms—Posit number system, posit multiplier, computer are only included when the sign and the regime do not occupy
arithmetic, low-power arithmetic circuit. all the bit positions. Therefore, the mantissa (including the
implicit bit) bit-width can be from 1-bit to (nb − es)-bit,
where nb is the total bit-width of posit format and es is the
I. I NTRODUCTION exponent bit-width. In previous posit arithmetic unit designs,
OSIT number system is first proposed in [1]. It is
P designed to be used as an alternative to the conventional
IEEE floating-point formats [2] in many fields of applica-
such as posit adder and multiplier generator [6], posit adder
generator [7], and posit multiply-accumulate (MAC) unit gen-
erator [8], the mantissa multiplier is always designed with
tions [3], [4], [5]. It has larger dynamic range than IEEE (nb − es)-bit. As the actual mantissa bit-width is not always
floating-point format. As a result, a small bit-width posit for- the maximum value, the mantissa does not always require a
mat can meet the numeric requirements of applications while (nb − es)-bit multiplier. When using (nb − es)-bit multiplier
it brings many memory and computation benefit. In addition, for small bit-width mantissa, power or energy is wasted.
its non-uniformed data distribution fits well with the data dis- In this brief, a power efficient posit multiplier architecture
tribution of some applications, such as deep learning. The is proposed to improve the power efficiency. Each component
8-bit or 16-bit posit formats are widely used in deep learning of a posit multiplier is divided into smaller elements and at
systems. The 32-it posit format is used in some scientific com- run-time, only those required elements are enabled. Whether
putation applications to take the place of the standard 64-bit to enable or disable a element is controlled by the regime
floating-point format. shifting value generated in the posit pre-processing module.
Unlike the conventional floating-point format, the bit-width The power distribution of a posit multiplier is presented in
of each component in posit format, as shown in Fig. 1, is Fig. 2. As shown in Fig. 2, the mantissa multiplier consumes
dynamic (except the 1-bit sign). The sign and the regime are the most power. Therefore, special focus is put on the man-
tissa multiplier in this brief. The proposed design technique
Manuscript received February 2, 2020; accepted March 7, 2020. Date is applied to commonly used 8-bit, 16-bit, and 32-bit posit
of publication March 13, 2020; date of current version May 6, 2020.
This work was supported in part by the Natural Sciences and Engineering multiplier and an average of 16% power reduction is achieved.
Research Council of Canada and in part by the Research and Development The rest of this brief is organized as follows: Section II
Program of MOTIE/KEIT (Developing Processor-Memory-Storage Integrated presents the background information of posit format. The
Architecture for Low Power, High Performance Big Data Servers) under
Grant 10077609. This brief was recommended by Associate Editor W. Zhao. proposed power efficient posit multiplier architecture is
(Corresponding author: Seok-Bum Ko.) presented in Section III. Section IV presents the synthesis
The authors are with the Department of Electrical and Computer results of the proposed multipliers and their comparison with
Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
(e-mail: hao.zhang@usask.ca; seokbum.ko@usask.ca). standard posit multiplier designs. Finally, Section V concludes
Digital Object Identifier 10.1109/TCSII.2020.2980531 this brief.
1549-7747 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on October 07,2021 at 05:42:30 UTC from IEEE Xplore. Restrictions apply.
862 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 5, MAY 2020

Fig. 3. Posit component extraction in hardware arithmetic unit. Fig. 4. Datapath of the proposed posit multiplier.

II. P OSIT N UMBER S YSTEM rounding. In addition, the sign and exponent are also processed
The general format of a posit number is shown in Fig. 1. A separately. The mantissa multiplier is a (nb − es)-bit radix-
posit number Posit(nb, es) is defined with the total bit-width nb 4 modified Booth multiplier [10]. The bit-widths of other
and the exponent bit-width es. It has four components: sign (s), components are also shown in Fig. 4.
regime (rg), exponent (exp), and mantissa (frac). The compo- As discussed in Section I, in posit format, the bit-width
nent bit-width is not constant. The regime bit-width varies for of the mantissa varies for different values. Therefore, the
different values. The exponent and the mantissa will occupy mantissa does not always require a (nb − es)-bit multiplier.
the remaining bit positions and they will not be included in the Although the unused bits of the (nb−es)-bit mantissa are filled
format when the regime occupies all bit positions. The value with zeros, those zero bits will be inverted to ones when the
of a number represented in posit format is: partial product of Booth multiplier [10] is negative. Therefore,
the circuits for these bit positions, including the partial prod-
value = (−1)s × useedrg × 2exp × (1 + frac) (1) uct accumulation and the final adder, are still toggling. This
2es will lead to a waste of power and energy.
where useed = 2 .
In the proposed design, two changes are made to avoid the
In hardware arithmetic unit design, the extraction of compo-
unnecessary signal toggling in order to reduce the power con-
nents is not as straightforward as the floating-point format. The
sumption. The first change is the generation of the control
circuit shown in Fig. 3 (except the grey module) is commonly
signal for the mantissa multiplier so that only the neces-
used to extract each component of a posit number [8], [9].
sary part of the multiplier is enabled. The second change is
The number is complemented first if it is negative. Then the
the decomposition of the mantissa multiplier and each of the
regime part is first extracted. The regime part is a series of
small portion is controlled by the control signal. The design
ones (zeros) followed by a single zero (one) bit. Therefore, a
details of these two changes are discussed below in detail. A
leading zero detector (LZD) and a leading one detector (LOD)
multiplier for Posit(16, 1) is used as an example to discuss the
are used to count the number of leading bits. If leading ones
design details.
are detected, rg equals to count − 1. Otherwise, rg is −count
and a complementer, COMP, is needed to convert the positive
count to a negative rg value. In addition, the regime bit-width A. The Decomposition of the Multiplier
shift_rg is also generated which is count +1 so that the regime
For Posit(16, 1) multiplier, a 15-bit mantissa multiplier is
can be removed by shifting operation and the exponent and
used. When using radix-4 Booth multiplication algorithm [10],
mantissa can be obtained. The regime bit-width ranges from
the partial product array is shown in Fig. 5(a). There is a total
2-bit to (nb − 1)-bit. In order to accommodate all cases, the
of 8 partial products and each partial product is 16-bit.
final extracted mantissa is (nb − es)-bit. In a posit multiplier
In the proposed design, the 15-bit multiplier is divided into 4
or multiply-accumulate unit design, a (nb − es)-bit mantissa
groups: the most significant 3-bit is one group, and the remain-
multiplier will be used.
ing 12-bit are divided into three 4-bit groups. Correspondingly,
the 8 partial products are divided into 4 groups, RH_1, RH_2,
III. T HE P ROPOSED P OSIT M ULTIPLIER RH_3, and RH_4, as shown in Fig. 5. If the multiplier is less
The parameterized datapath of the proposed posit multiplier than 3-bit, then only the two partial products in RH_4 are
is shown in Fig. 4. The critical path contains posit component generated while all others are set to zeros. If the multiplier
extraction (which is detailed in Fig. 3), mantissa multiplier, is more than 3-bit but less than 7-bit, then partial products
final adder and normalization, posit component packing, and in RH_3 and RH_4 are generated. If the multiplier is more

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on October 07,2021 at 05:42:30 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND KO: DESIGN OF POWER EFFICIENT POSIT MULTIPLIER 863

TABLE I
T RUTH TABLE TO G ENERATE C ONTROL S IGNAL

mantissa bit-width (including the implicit bit):


mant_bit = nb − es − shift_rg (2)
By using this relationship, the control signal for multiplicand
and multiplier can be generated. Since four regions are used
for multiplicand and multiplier, respectively, the control signal
ctl for each can be 2-bit. When ctl is 00, only RH_4 (or RV_4)
is used. When ctl is 01, RH_3 and RH_4 (or RV_3 and RV_4)
Fig. 5. The proposed mantissa multiplier decomposed into multiple regions. are used. When ctl is 10, three regions are used and all regions
are used when ctl is 11. The truth table to generate the control
signal and the corresponding region to be enabled are shown in
than 7-bit but less than 11-bit, then partial products in RH_2, Table I. Regime is at least 2-bit so we do not care the cases of
RH_3, and RH_4 are generated. Finally, if the multiplier is shift_rg = 0000 or shift_rg = 0001. When shift_rg = 1111,
more than 11-bit, then all the partial products are generated. there is no space for exponent and mantissa in the format,
Similarly, the 15-bit multiplicand is also divided into 4 however, there is still 1-bit implicit bit.
groups. Each partial product is correspondingly divided into Based on Table I, the control signal can be generated as:
4 groups, RV_1, RV_2, RV_3, and RV_4, as shown in Fig. 5.
ctl[1] = shift_rg[3] and ctl[0] = shift_rg[2] (3)
If the multiplicand is less than 3-bit, then only RV_4 is gen-
erated while all others are set to zeros. If the multiplicand is These are the operations performed in the grey module in
more than 3-bit but less than 7-bit, then RV_3 and RV_4 are Fig. 3. In the proposed design, separate control signals for mul-
generated. If the multiplicand is more than 7-bit but less than tiplicand and multiplier are generated and used in the mantissa
11-bit, then RV_2, RV_3, and RV_4 are generated. Finally, if multiplier.
the multiplicand is more than 11-bit, then all bits in the partial Corresponding to 4 horizontal regions and 4 vertical regions,
product are generated. a total of 16 Booth-2 partial product generation (PPG) modules
Fig. 5 shows three examples of the partial product array in are implemented, as shown in Fig. 6. The use of control signal
the proposed design. Fig. 5(a) shows the partial product array to enable PPGs to generate partial product format shown in
when both multiplicand and multiplier are 15-bit (or more than Fig. 5 is also presented in Fig. 6. One PPG is enabled only
11-bit). In this case, all partial products are generated and all when both the corresponding horizontal control and vertical
bits in a partial product are generated. Fig. 5(b) shows the control are enabled. In addition to these control signals, extra
partial product array when multiplicand is 15-bit and multiplier control to manage the partial product extension bits (S bits in
is 7-bit. In this case, only the last four partial products are Fig. 5) is also required.
generated. As the multiplicand is 15-bit, all bits in those partial
products are generated. Finally in Fig. 5(c), the multiplicand is C. For Other Posit Bit-Width
further reduced to 7-bit. In this case, only the most significant
The discussion above used Posit(16, 1) as an example. For
8-bit of each partial products are generated.
other bit-width, the proposed method can be adjusted.
For 8-bit posit, we can still use 4-bit as the unit. However,
B. The Control Signal in this case, both the multiplicand and the multiplier will
To realize the proposed multiplier described in have only two regions. Therefore, only 1-bit control signal
Section III-A, a control signal is required to select among the for multiplicand or multiplier is enough.
four horizontal regions and four vertical regions. For 32-bit posit, more options are available. On one hand,
As discussed in Section II, when extracting regime com- we can use 4-bit as the unit and generate 3-bit control sig-
ponent, the bit-width of the regime (shift_rg in Fig. 3) is nals for multiplicand and multiplier. On the other hand, we
generated. This signal can be used to calculate the actual can use 8-bit as the unit and then both multiplicand and

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on October 07,2021 at 05:42:30 UTC from IEEE Xplore. Restrictions apply.
864 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 67, NO. 5, MAY 2020

TABLE II
C OMPARISON OF THE P ROPOSED P OSIT M ULTIPLIER
W ITH S TANDARD P OSIT M ULTIPLIER

Fig. 6. The use of ctl signal as the enable signal for partial product generation
(ppg) in mantissa multiplier (Posit(16,1) example). ppg(x,y) refers to the ppg
used to generate partial products in region RV_x and RH_y.

multiplier can still be divided into 4 regions. From our pre-


liminary experiments, both options can achieve similar power
reduction. However, the first option will bring a larger area
than the second option. Therefore, for 32-bit posit, we choose
the second option to perform the evaluation.
During the decomposition, smaller granularity, such as 2-
bit, is possible. In the proposed architecture, the choice of power advantage of the proposed design is more obvious.
4-bit granularity is mainly due to the Booth multiplier [10]. For these formats, more than 20% power reduction can be
In Booth-2 multiplier, 3-bit multiplier is used to generate one achieved. The proposed method divides the mantissa multiplier
partial product. Therefore, if smaller granularity is used, more into small portions and only the required portions are enabled
resource overhead will happen. at run-time. Other portions are disabled to avoid signal tog-
gling. In addition, due to the unused portions are always set
to zeros, the toggling in later stage addition and rounding is
IV. R ESULTS AND A NALYSIS also avoided. All of these lead to the power reduction of the
The 8-bit, 16-bit, and 32-bit models of the proposed archi- proposed designs.
tecture are implemented with Verilog HDL. Simulations with The comparison of the proposed posit multipliers with IEEE
extensive testing vectors are performed to verify the function- 754-2008 [2] standard multiplier is also shown in Table II.
ality of the proposed design. The testing vectors are generated With the same bit-width, the IEEE standard multiplier can
with the help of SoftPosit [11]. These verified models are then achieve better timing, area, and power. This is because the
synthesized with STM-90nm library with normal case param- IEEE 754-2008 floating-point format has constant bit-width
eters (1.00V and 25◦ C) using Synopsys Design Compiler. The for each component and thus the component extraction and
generated netlist is simulated with the testing vectors again to packing module used in posit design can be eliminated.
obtain the signal activity file. The signal activity file and the However, as mentioned in Section I, posit has much larger
synthesized netlist are then used by Synopsys PrimeTime PX dynamic range than IEEE floating-point format. Therefore, in
for an accurate power estimation. practical applications, small posit format can already meet the
The delay, area, and power consumption of 8-bit, 16-bit, and numeric requirement of the application. Moreover, the non-
32-bit posit multiplier designs are shown in Table II. The com- uniformed distribution is more close to the data distribution in
parison of the proposed designs with normal posit multiplier some applications. As a result, the posit format is used more
designs [8], [9] are also shown in Table II. Compared to efficiently in some applications. For deep learning, 8-bit posit
the normal designs, the proposed designs have slightly larger can achieve the performance of 32-bit IEEE floating-point with
delay due to the extra control signal applied for the mantissa much better hardware efficiency [1], [12].
multiplier. This also leads to slightly larger area. The area and power of each module in both normal and
In terms of power consumption, the proposed design for proposed posit multipliers are also compared to give more
Posit(8,0) and Posit(8,1) can achieve 8% and 3% power reduc- insight into the proposed architecture. As shown in Fig. 4,
tion, respectively. The power improvement is not significant there are a total of 6 major modules in the proposed posit
because the 8-bit design is relatively small and the overhead multiplier. Here, the component extraction and sign and expo-
of the extra control unit may hide the benefit of the proposed nent process module are combined into input process module
power reduction method. For 16-bit and 32-bit designs, the as shown in Table III. Also, the output process in Table III is

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on October 07,2021 at 05:42:30 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND KO: DESIGN OF POWER EFFICIENT POSIT MULTIPLIER 865

TABLE III
A REA AND P OWER C OMPARISON OF E ACH M ODULE FOR N ORMAL gets larger, the proposed architecture can still achieve power
AND P ROPOSED P OSIT (16,1) M ULTIPLIER reduction but the gap between normal design and the proposed
design is reduced. This is mainly due to the reduction in man-
tissa bit-width. However, due to the existence of the regime
bits, posit formats with small exponent bit-width can provide
enough dynamic range for many applications. The evaluated
formats shown in Table II are widely used in many applica-
tions [3], [4], [5]. The proposed power reduction method can
be effectively applied in those applications to achieve power
efficient posit computation.

V. C ONCLUSION
In this brief, a power efficient posit multiplier architecture
is proposed. Motivated by the fact that the whole mantissa
multiplier in a posit multiplier is not always fully required,
the proposed design divides the mantissa multiplier into small
portions. At run-time, only the required portions are enabled
to avoid unnecessary signal toggling to reduce the power con-
sumption. Whether to enable a multiplier portion is controlled
by the regime bit-width generated during component extrac-
tion. The proposed method is evaluated with 8-bit, 16-bit, and
32-bit posit multiplier and an average of 16% power reduction
can be achieved. The proposed method is suitable to be used
in any low power posit arithmetic unit designs.
In the future, more power reduction opportunity in the posit
multiplier architecture will be explored. In addition, the inves-
tigation of power efficient posit arithmetic unit design will be
extended to posit adder and posit multiply-accumulate unit.

R EFERENCES
[1] J. L. Gustafson and I. Yonemoto, “Beating floating point at its own
Fig. 7. Comparison of power consumption under various exponent width. game: Posit arithmetic,” Supercomput. Front. Innovat. Int. J., vol. 4,
no. 2, pp. 71–86, Jun. 2017.
[2] IEEE Standard for Floating-Point Arithmetic, IEEE Standard 754-2008,
Aug. 23, 2008, pp. 1–70.
composed of component packing and rounding modules. As [3] Z. Carmichael, S. H. F. Langroudi, C. Khazanov, J. Lillie,
shown in Table III, the proposed posit multiplier has 4% larger J. L. Gustafson, and D. Kudithipudi, “Deep positron: A deep neural
network using the posit number system,” CoRR, vol. abs/1812.01762,
area compared to normal posit multiplier. This area overhead pp. 1–6, Dec. 2018.
comes from the extra control used in the mantissa multiplier. [4] J. Johnson, “Rethinking floating point for deep learning,” CoRR,
Although for input process, an extra control signal is required vol. abs/1811.01721, pp. 1–8, Nov. 2018.
[5] M. Klöwer, P. D. Düben, and T. N. Palmer, “Posits as an alternative
to be generated, however, as shown in equation (3), the logic to floats for weather and climate models,” in Proc. Conf. Next Gener.
to generate control signal is simple, and the area overhead is Arithmetic, Mar. 2019, pp. 1–8.
negligible. [6] R. Chaurasiya et al., “Parameterized posit arithmetic hardware genera-
tor,” in Proc. IEEE 36th Int. Conf. Comput. Design (ICCD), Orlando,
The power reduction mainly comes from the mantissa FL, USA, Oct. 2018, pp. 334–341.
multiplier module and the final addition module as shown in [7] M. K. Jaiswal and H.-K. So, “Architecture generator for type-3 unum
Table III. Due to the control signal, only the required portion posit adder/subtractor,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS),
Florence, Italy, May 2018, pp. 1–5.
of the mantissa multiplier is enabled. This leads to a 28%
[8] H. Zhang, J. He, and S.-B. Ko, “Efficient posit multiply-accumulate
power reduction in the proposed Posit(16, 1) design. Because unit generator for deep learning applications,” in Proc. IEEE Int. Symp.
the unused portion of the mantissa multiplier is disabled, the Circuits Syst. (ISCAS), Sapporo, Japan, May 2019, pp. 1–5.
least significant part of the product will become zeros. The [9] A. Podobas and S. Matsuoka, “Hardware implementation of POSITs and
their application in FPGAs,” in Proc. IEEE Int. Parallel Distrib. Process.
signal toggle of the least significant part of the final adder can Symp. Workshops (IPDPSW), Vancouver, BC, Canada, May 2018,
also be avoided. This leads to on average 30% power reduc- pp. 138–145.
tion in final adder. However, as shown in Fig. 2, the adder [10] A. D. Booth, “A signed binary multiplication technique,” Quart. J. Mech.
Appl. Math., vol. 4, no. 2, pp. 236–240, 1951.
part does not contribute much to the total power consump- [11] SoftPosit-Python. Accessed: Oct. 2018. [Online]. Available:
tion. Therefore, the power reduction of the whole design is https://posithub.org/docs/PositTutorial_Part1.html
still 22%. [12] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson,
and D. Kudithipudi, “Performance-efficiency trade-off of low-precision
The power consumption of larger exponent bit-width numerical formats in deep neural networks,” in Proc. Conf. Next Gener.
designs are presented in Fig. 7. When the exponent bit-width Arithmetic, Mar. 2019, pp. 1–9.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SILCHAR. Downloaded on October 07,2021 at 05:42:30 UTC from IEEE Xplore. Restrictions apply.

You might also like