Professional Documents
Culture Documents
Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm
Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm
Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm
4, APRIL 2017
Abstract CORDIC algorithm is suitable to implement variable increases. Thus, the required number of multiplica-
sine/cosine function, but the large number of iterations lead tions and additions turns out to be larger. Typically, polynomial
to great delay and overhead. Moreover, due to finite bit-width is combined with look-up table [3]. The variable is compressed
of operands and number of iterations, the relative error of
floating-point sine or cosine is terrible when the input angle to a smaller range by look-up table, and polynomials calculate
is close to 0 or /2, respectively. To overcome these short- final results quickly. However, this combination is still expen-
comings, TCORDIC algorithm, which combines low latency sive to implement the multipliers, adders and tables for high
CORDIC and Taylor algorithm, is presented. After analyzing precision sine/cosine computation.
the latency of traditional CORDIC, low latency CORDIC is CORDIC algorithm is a kind of digital iteration method
proposed, which adopts the technique of sign prediction, com-
pressive iterations, and parallel iterations. Besides, the calcu- to calculate a variety of transcendental functions, including
lating boundary (N), which is used for determining whether sine/cosine function in circular coordinate and rotation
Taylor algorithm is selected or not in TCORDIC algorithm, mode. Calculating sine/cosine function based on CORDIC
is evaluated to achieve a trade-off between area and delay. algorithm has been employed [6] for a variety of high-speed
Truncated multipliers are used to reduce the area further. and real-time applications, such as adaptive filters [7] in
Finally, Using TCORDIC algorithm, pipelined and iterative
structures are implemented for IEEE-754 double precision digital signal processing (DSP), discrete sinusoidal transforms
floating-point sine/cosine with the input Z[0, /2]. Under [8] in signal processing, generating sinusoidal waveforms
typical condition (1V, 25 C), our designs are synthesized [9] in communication, robot control [10], and geometric
with 40 nm standard cell library. For a pipelined structure, computations [11] in graphics.
the frequency is up to 1.70 GHz and area 194049.64 m2 . However, a large number of iterations must be executed
Frequency decreases to 1.45 GHz for iterative structure, but
the area requires only 110590.81 m2 . TCORDIC is efficient in
in order, which is the a bottleneck for optimization of
controlling relative error, and achieves the accuracy within one CORDIC. The first data dependence is determination of the
ulp (unit in the last place) for floating-point sine/cosine function. rotation direction by iteration results of Z path. The second,
Index Terms CORDIC, floating-point sine/cosine, low latency, which refers to the (i + 1)t h iteration, can be started only
Taylor. after the completion of the i t h iteration, because the micro-
I. I NTRODUCTION rotations for any iteration are performed on the intermediate
vectors computed by the previous iterations. These two data
I N the fields of system control, simulation [1], high per-
formance computation [2], and scientific computation [3],
floating-point sine/cosine function with low error is essential.
dependence restricts optimization of the critical path. In
addition, each iteration is completed by carry look ahead
adder, and the delay due to carry propagation is proportional
In Intel 8087, Motorola, Cyrix coprocessors and CPUs such as
to the bit-width of operands. Therefore, when the precision
486DX and Pentium [4], sine/cosine is computed in hardware.
requirements are higher, the bit-width of operands and carry
FPGA and ASIC provide suitable platforms for implementa-
propagation delay increase.
tion of sine/cosine.
The precision of CORDIC converges linearly with the
Algorithms for sine/cosine function on hardware are divided
number of iterations. Error is inevitable because of the finite
into three categories: look-up table, polynomial approximation
bit-width of operands and the number of iterations. And,
and digital iteration [4]. Look-up table is applicable to low
relative error is too large to meet the precision requirements
precision [5], because hardware cost grows exponentially with
for IEEE754 floating-point standard.
precision. The convergence speed of polynomial approxima-
TCORDIC algorithm, combining low latency CORDIC with
tion is fast for a small variable, but decreases rapidly as the
Taylor algorithm, is presented to efficiently control relative
Manuscript received August 8, 2016; revised October 22, 2016; accepted error for floating-point sine/cosine function. Moreover, low
November 13, 2016. Date of publication December 15, 2016; date of current latency CORDIC is proposed by combining sign predic-
version March 27, 2017. This work is supported by National Natural Science
Foundation of China, NO: 61402499. This paper was recommended by tion, compressive iterations, and parallel iteration techniques.
Associate Editor Y. Pu Pipelined and iterative structures are implemented, and N for
The authors are with the School of Computer, National University of the calculating boundary, which is used for determining
Defense Technology, Changsha 410073, China (e-mail: 2278125123@qq.com;
yuanwulei@nudt.edu.cn; pyx@nudt.edu.cn; heting7410@qq.com). whether Taylor algorithm is selected or not in TCORDIC
Digital Object Identifier 10.1109/TCSI.2016.2631588 algorithm, is evaluated to achieve the balance between area
1549-8328 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 893
and delay. In addition, truncated multipliers further reduce achieve a fast carry-free computation. There are two types of
the area. The following are the main advantages of our redundant numbers: carry-save and signed-digit [20]. Redun-
designs: dant CORDIC can significantly reduce iteration period, but
1) The accuracy of floating-point sine/cosine function is both the scale factor and the number of iterations are variable.
guaranteed within one ulp (unit in the last place). Besides, sign prediction for redundant numbers increases
By combining Taylor and CORDIC, absolute and rel- implementation complexity. High-radix CORDIC [16], [21],
ative error can be controlled effectively with the input [22], [23] needs fewer iterations than radix-2 CORDIC does.
Z [0, /2], to achieve of the precision requirements of But, with a variable scale factor, both the look-up table for
IEEE754 floating-point standard. scale factor calculation and compensation lead to a great
2) The proposed low latency CORDIC combines the delay and area. The drawback inherent to the scaling-free
schemes of sign prediction, compressive iterations, and CORDIC [24], [25] lies in its small convergence range,
parallel iterations, which significantly reduces area and whether the domain folding technique or the repetitions of
delay. the first elementary micro-rotation lead to great cost and
3) By selecting Taylor appropriately, pipelined structure delay.
obtains low latency with maximum possible reduction Sang [26] points out that tangle approximation and round-
in area, and iterative structure makes full use of low off are the two main sources of error in CORDIC algorithm.
area cost with maximum possible reduction in delay. By choosing appropriate bit-width of operands and number
4) By optimizing the bit-width of truncated multiplier for of iterations, absolute error of fixed-point CORDIC can be
meeting the precision requirements, the area can be controlled and measured by MSE (mean-square-error), but
further reduced. relative error remains ignored. Therefore, Hu [27] finds that
when the input vector is close to 0 in vector mode, relative
II. BACKGROUND AND R ELATED W ORK error caused by round-off is maximum and sufficient enough
A. Traditional CORDIC Algorithm to misinterpret rotation directions. The partial normalization
scheme proposed by Kota [28] and the prescaling technique
Sine/cosine function is calculated in rotation mode and cir-
adopted by E. Antelo [29] can decrease the relative error.
cular coordinate. The following are the basic iterative formulas
However, researchers [30] analyzed only absolute and relative
for such calculation:
errors of fixed-point CORDIC, but made no mention of error
i
X i+1 = ki (X i i 2 Yi ) analysis and control for floating-point CORDIC.
i
Yi+1 = ki (Yi + i 2 X i ) (1) Anis [31] compares CORDIC with Taylor algorithm to
calculate sine/cosine function about structure and precision
Z i+1 = Z i i i .
in detail, but his focus is on the contrast rather than the
Where, i {1, 1} refers to sign(Z i ), which determines the combination. Daniel [32] combines CORDIC with Taylor
rotation directions of the vector, and i = tan 1 (2i ) is the algorithm for calculation of floating-point exponential function
micro-rotation angle. ki = 1 2i , and the scale factor after to improve precision, without elaborating the principle and
(1+2 )
n iterations is K n = ni=1 ki . Initial angle Z 0 is the rotation generalizing it to sine/cosine function. Maher [33] uses one
angle in total, and the iterations make Z i approach 0. The CORDIC to replace two Read Only Memories to compute sine
vector corresponding to Z 0 can be obtained after n iterations. and cosine of the upper address 0 , with Taylor computing
around 0 , which aims to improve only frequency resolution
X n = K1n (X 0 cos(Z 0 ) Y0 si n(Z 0 )) for Direct Digital Frequency Synthesizer.
(2)
Yn = K1n (Y0 cos(Z 0 ) + X 0 si n(Z 0 )).
With X 0 = K n ,Y0 = 0, the iteration results are equivalent to III. TCORDIC A LGORITHM
sine and cosine for Z 0 in the range of Z 0 [0, /2]. TCORDIC algorithm, combining low latency CORDIC and
Taylor algorithm, is proposed in this section to improve
B. Related Work the accuracy of floating-point sine/cosine computation of
CORDIC when the input angle is close to 0 or /2, and
Since Volder proposed CORDIC algorithm to calcu- guarantee the accuracy within one ulp.
late trigonometric function in 1959 [12], [13], many vari-
ants [14], [15] are presented to reduce delay. Reducing the
number of iterations [16], [17] and reducing the delay of A. Low Latency CORDIC Algorithm
each iteration are the two main prerequisites for achieving low 1) Delay Analysis for Traditional CORDIC: The delay for
latency. Carry-free adder is a straight-forward choice to reduce traditional CORDIC is determined by the number of iterations
the delay of each iteration. Parallel CORDIC [18], [19] speeds and the delay of each iteration. In pipelined structure, X/Y
up iteration period by CSA (Carry Save Adder), and multi- path is critical and Delay = n (TnC L A + Treg ), where
operand CSA tree architectures are adopted to replace the last TnC L A and Treg refer to the delay for n-bit Carry Look
half of the iterations. However, both the delay and the area ahead Adder and register, respectively. In iterative structure,
for CSA trees are large, and sign prediction for all iterations the worst path is determined by the maximum delay in X, Y
requires much area. Redundant CORDIC [16], [20], [21] uses and Z paths. As shown in Fig. 1, Delay = max{n(TnC L A +
a redundant representation and the corresponding adder to Tbshi f t (n,log2 n) + Treg ), n (TnC L A + TLU T + Treg )}, where
894 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017
n
j
j =0 b j 2 , b j {0, 1}. If Z j = b0 , b1 , . . . , b j 1 b j , . . . , bn
2 i(1)..i(2t +1)
A. Iterative Structure
1) Optimization of the Calculating Boundary (N): Iterative
structure needs only a truncated and fixed-point multiplier.
During iteration for CORDIC, the multiplier is multiplexed
for Taylor. Calculation for Taylor expansion adopts Horner
scheme, which involves the least multiplications.
Z 3 c Z 2i1
si n(Z ) = Z + (1)i+1
3! i=3 (2i 1)!
2 1 2 1 Z2
= Z 1 Z ..Z .. .
3! (2c 3)! (2c 1)!
(15)
In CORDIC, a 2-stage compressive iteration occupies one
clock cycle. In a multiplication, the first 2-stage compression
Fig. 4. TCORDIC algorithm. accounts for one clock cycle, and the last 2-stage compres-
sion and addition with CLA account for one clock cycle.
In Taylor, shift and addition account for one clock cycle
where E o f f set E Z = N, in total. The required number of cycles with the truncated
multiplier multiplexing is h for Taylor expansion, and a for
L + 1 + N + log2 n min n min . (13)
completing the compressive iterations in CORDIC. Therefore,
Similarly, when E o f f set E Z > N, to guarantee accurate the time constraint is h a, and the truncated multiplier had
L bits of mantissa, cmin can be known according to (12) where better be multiplexed as much as possible. While the number
E o f f set E Z = N. of iterations in CORDIC is n min , the number of cycles to
compress is
M Z2cmin +1 2(2cmin +1)(N)
< 2NL . (14) a = n min /4 . (16)
(2cmin + 1)!
With cmin Taylor expansion items, the truncated multiplier
IV. I MPLEMENTATIONS OF TCORDIC is multiplexed cmin + 1 times, and there are cmin 1 additions
In this section, IEEE-754 double precision floating-point and shifts. Therefore, the total number of cycles for Taylor
is taken as an example to illustrate the implementations of expansion is
TCORDIC algorithm. h = 3(cmin 1) + 4. (17)
To guarantee accurate mantissa, where L = 52 bits for
double precision floating-point computation, the number of With L = 52 bits to be guaranteed, (13), (14), (16), and (17)
iterations, as also the bit-width of operands n min in CORDIC, are used in computing. As shown in Fig. 5, the time constraint
and the number of expansion items cmin in Taylor required to h a can be satisfied with N 5. when N continues
calculate minimally relate with the calculating boundary (N), to increase, the number of iterations required to guarantee
which refers to (13) and (14). Therefore, the calculating precision increases. Therefore, N = 5 for the calculating
boundary (N) can be optimized to achieve a trade-off between boundary is the optimal value in iterative structure.
area and delay. The optimal values of the calculating bound- 2) Parameters and Process Illustrations: The implementa-
ary (N) are 13 and 5 in pipelined structure and iterative tion of the overall iterative structure is shown in Fig. 6(A).
structure, respectively. According to the above analysis, the key parameters for the
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 897
implementation of the optimized iterative structure are as double precision floating-point number, includes exponent E Z
follows: and mantissa M Z . If E o f f set E Z is greater than 5, Taylor is
Guaranteed precision L = 52 bits. selected to calculate si n(Z ). If E o f f set E Z is greater than 5,
Calculating boundary N = 5. For sine, it is computed Taylor is selected to calculate si n(/2 Z 0 ), which is equal
by Taylor when E o f f set E Z > 5 (i.e., Z [0, 25 ]), to cos(Z ). To calculate (/2 Z 0 ), the adder needs 52 + 64
otherwise it is computed by CORDIC. For cosine, it is bits guaranteed even in the worst case that the 52 high bits
calculated by Taylor when E o f f set E Z > 5 (i.e., Z of (/2 Z 0 ) are all 0. To calculate by Taylor, (/2 Z 0 )
[/225 , /2]), otherwise it is calculated by CORDIC. is normalized. Whether or not Taylor is selected, CORDIC
The number of iterations, as also the bit-width of calculation always works to calculate sine and cosine.
operands in CORDIC n min = 64. 4) Compression and Prediction: Compression and
The number of Taylor expansion items is cmin = 5. prediction module complete rotation direction prediction and
In CORDIC path, the first 32 iterations are completed by compressive calculation in Z path. To guarantee accurate
compressive iterations, and the last 32 iterations are completed prediction, correction iterations must be carried out; so, CSA
by parallel iterations, which multiplexes a 53*53-bit truncated and CLA are mixed in Z path. CLAs are used in the position
multiplier twice. In addition, the sign prediction scheme con- of correction iterations for summations, and CSAs between
sists of three levels, predicting directions for the 1t h 4t h , correction iterations for compression. A series of micro-
5t h 12t h , and 13t h 32t h iterations. rotation angles for iterations are predefined; so, segmented
In Taylor path, the 53*53-bit truncated multiplier is multi- look-up tables are adopted. Therefore, a number of iterations
plexed 6 times to calculate the first 5 expansion items. can be replaced with an addition and a table look-up. Besides,
3) Preprocessing: The preprocessing module determines sign prediction makes compressive calculation with CSA
whether Taylor is selected according to the input Z , whose available in Z path. In the first 32 iterations, correction
original floating-point form is transformed into fixed-point iterations have to be added at positions i = 1, 4, 12, and
form Z 0 . As shown in Fig. 6(B), Z , which is an IEEE-754 hence the prediction module is composed of 3 submodules.
898 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017
TABLE I
S YNTHESIZED R ESULTS W ITH 40 nm S TANDARD C ELL L IBRARY
V. E XPERIMENTAL R ESULTS
A. Synthesis Results
Our designs are synthesized with 40 nm standard cell of operands, in traditional CORDIC is 64, the same as
library under typical condition (1V, 25 C). There is a those in TCORDIC. Five-hundred points are uniformly taken
clock cycle constraint of 700 ps for iterative structure and during log2 Z [25, 5] to compute sine, and the
600 ps for pipelined structure, with 150 ps for input delay input Z correspondingly varies from 0x3e60000000000000
and 150 ps for output delay. The synthesized results of to 0x3 f a0000000000000. Similarly, five-hundred points
pipelined and iterative structures are shown in Table 1. For are uniformly taken with log2 (/2 Z ) [25, 5]
pipelined structure, CORDIC path and Taylor path occupy area to calculate cosine, and Z correspondingly varies from
176952.46 and 17097.18 m2 , respectively. 0x3 f f 911 f b54442d18 to 0x3 f f 921 f b53442d18.
As shown in Fig. 11, the error of traditional CORDIC is
B. Accuracy Analysis large for fixed bit-width of operands and number of iterations,
and the error bits of sine or cosine increase quickly when the
Because of finite bit-width of operands and number of
input angle is close to 0 or /2, respectively. The error of sine
iterations in CORDIC, the relative error of floating-point sine
or cosine of TCORDIC is no more than one error bit with the
or cosine is terrible when the input is close to 0 or /2,
supplement of Taylor.
respectively. Jie [6] provided a relative error of 232 with a
The other related works implemented fixed-point sine and
small input for double precision floating-point sine computa-
cosine functions based on CORDIC and focused on maximum
tion, and the maximum relative error should be larger with a
absolute error or mean square error (MSE). In [16], the
smaller input.
maximum absolute error was 8.04 104 and 5.50 104
The following comparison reflects the accuracy improve-
for cosine and sine functions with the angles ranging from
ment of our TCORDIC clearly. The number of inaccurate
/2 to /2 in the step of /500, respectively. Daniel [32]
bits of mantissa measure the relative error for floating-point
used 1000 inputs ranging from 0 to /2 to compute the MSE
computation quantitatively, and the input is limited in the range
of sine and cosine and provided MSE= 1.4 1013 .
close to 0 or /2. Because of the symmetry of sine and cosine,
the number and trend of error bits for sine with Z close to 0
are similar to those for cosine with Z close to /2. C. Delay and Area Comparisons
TCORDIC can guarantee faithful rounding (i.e., the error There is a differential on area and delay of designs with dif-
is smaller than one ulp) for any angles in [0, /2]. Fig. 11 ferent technology generation and design parameters; therefore,
compares the relative error of sine or cosine between tradi- normalized metrics is adopted for the comparisons. To avoid
tional CORDIC and TCORDIC when the input Z is close the effect of technology generation, the comparisons based on
to 0 or /2. The number of iterations, as also the bit-width function components are adopted. The function components
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 901
TABLE III
D ELAY AND A REA C OMPARISONS
i j
VI. C ONCLUSION While m n/2 + 1, i j2 (2(n+2)
2 (3n/2) ) + (2 (n+4) 2 (3n/2+1) ) + +(2(2n4)
In this paper, TCORDIC algorithm, which combines low (2n3)
latency CORDIC and Taylor algorithm, is presented to 2 ). i j kl
2) The computation of l2 :
improve the accuracy of floating-point sine/cosine computa- i j kl i j k
4
tion, and guarantee the accuracy within one ulp even when the i j k l2 involves in Cnm terms to
input is close to 0 or /2. Several schemes, sign prediction, be accumulated.
parallel iteration, and truncated multiplier, are employed to With m n/2 + 1, the maximum value of Cnm 4 is
(n/21)(n/22)(n/23)(n/24)
reduce the delay for low latency CORDIC. Finally, taking 4! , and the maximum value of
IEEE-754 double precision floating-point sine and cosine 2i j kl is 2(n/2+1+n/2+2+n/2+3+n/2+4) .
functions as an example, iterative and pipelined structures are Considering the function f (n) = 2n/8+4 n, f (n) =
implemented, and the advantages of low latency and low error 2n/8+4 ln 2 18 1 > 0 when n 0.
are achieved. With n = 0, f (0) = 20/8+4 0 > 0.
When n 0, 2n/8+4 n
< 1, and ( 2n/8+4
n
)4 < 1.
A PPENDIX A
P ROOF OF THE S IMPLIFICATION OF E QUATION (8) 2i j kl
i j k l
When m n/2 + 1, equation (8) can be simplified from (n/2 1)(n/2 2)(n/2 3)(n/2 4) 2n10
equation (7) with the error of X n or Yn less than 2n , and 2
4!
following is the proof of the simplification. The proof can be (n/2 1)(n/2 2)(n/2 3)(n/2 4) 3n/2
completed by three steps. = 2
4! 2n/2+10
Step one: the computation of the omitted value in Am,n .
Am,n = 1 + A , where A is the omitted value. n4 n
< 23n/2 = 23n/2 ( )4 < 23n/2 .
2n/2+16 2n/8+4
A = i j 2i j i(1)..i(2t ) for
i(1) ..
i j
3) The computation of i(2t ) 2
i j k l 2i j kl t 3:
i j k l
Considering the function g(n) = 2n/6+2 n, g (n) =
+.. (1)t .. i(1) ..i(2t ) 2i(1)..i(2t ) 2n/6+2 ln 2 16 1 > 0 when n 7.
i(1) i(2t )
With n = 7, g(7) = 27/6+2 7 > 0.
| i j 2i j |
i
j n
Therefore, 2n/6+2 < 1 when n 7.
+| i j k l 2i j kl | With (n/2 1)/2 (n m)/2 t 3,
i j k l
+.. + |(1)t .. i(1) ..i(2t ) 2i(1)..i(2t ) | .. 2i(1)..i(2t )
i(1) i(2t ) i(1) i(2t )
2i j + 2i j kl (n/2 1)(n/2 2) (n/2 2t) t n
i j i j k l 2 2t
(2t)!
+.. + .. 2i(1)..i(2t ) .
i(1) i(2t ) (n/2 1)(n/2 2) (n/2 2t)
= 22n
The following
sub-items 1), 2), and 3) aim to calculate (2t)! 2(t 2)n+ 2t
.. 2 i(1)..i(2t ) for t = 1, t = 2, and t 3,
i(1) i(2t ) n 2t n 2t
respectively. < 22n < 22n
(2t)! 2(t 2)n+ 2t +2t 2(t 2)n+4t
1) The computation of i j 2i j :
i j n n
With m 1 < i < j < n, i j2 consists = 22n ( (1/21/t )n+2 )2t < 22n ( n/6+2 )2t < 22n
2 2 2
of Cnm terms to be accumulated when the iteration
i j
formulas between the m t h and the (n1)t h are expanded. A 2 + 2i j kl
i j i j k l
a) With i = m and j = m + 1, m + 2, , n 1,
m j = 22m 2(m+n1) . + .. + .. 2i(1)..i(2t )
j 2 i(1) i(2t )
With i = m + 1 and j = m + 2, m + 3, , n 1,
b) (n+2) (3n/2)
m1 j = 22m2 2(m+n) . < {(2 2 ) + (2(n+4) 2(3n/2+1) )
j2
c) With i = m, i = m + 1, , i = n 2, + +(2(2n4) 2(2n3) )} + (23n/2 )
2i j + ((n/2 1)/2 2)(22n )
i j
< (2(n+2) + 2(n+4) + +2(2n4) )
= 2m j + 2m1 j
j j
+ {(23n/2 ) + ((n/2 1)/2 2)(22n )
++ 2(n2) j
j
2(3n/2) 2(3n/2+1) 2(2n3) }
2m (m+n1) 2m2 (m+n)
= (2 2 ) + (2 2 )
< 2(n+2) + 2(n+4) + +2(2n4) < 2n1 .
+ +(2(2n4) 2(2n3) ).
904 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017
[25] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, Modified Yuanwu Lei was born in 1982. He received the
virtually scaling-free adaptive CORDIC rotator algorithm and archi- M.S. degree and the Ph.D. degree in computer
tecture, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 11, science from National University of Defense and
pp. 14631474, Nov. 2005. Technology (NUDT), China, in 2007 and 2012,
[26] S. Y. Park and N. I. Cho, Fixed-point error analysis of CORDIC respectively. His research interests include high
processor based on the variance propagation formula, IEEE Trans. performance computer architecture and computing
Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp. 573584, Mar. 2004. engineering.
[27] X. Hu and S. C. Bass, A neglected error source in the CORDIC
algorithm, in Proc. IEEE ISCAS, May 1993, pp. 766769.
[28] K. Kota and J. R. Cavallaro, Numerical accuracy and hardware tradeoffs
for CORDIC arithmetic for special-purpose processors, IEEE Trans.
Comput., vol. 42, no. 7, pp. 769779, Jul. 1993.
[29] E. Antelo, J. D. Bruguera, T. Lang, and E. L. Zapata, Error analysis
and reduction for angle calculation using the CORDIC algorithm, IEEE
Trans. Comput., vol. 46, no. 11, pp. 12641271, Nov. 1997.
[30] C.-H. Lin and A.-Y. Wu, Mixed-scaling-rotation CORDIC Yuanxi Peng was born in 1966. He received the
(MSR-CORDIC) algorithm and architecture for high-performance B.S. degree in computer science from Sichuan
vector rotational DSP applications, IEEE Trans. Circuits Syst. I, Reg. University, China, in 1988 and the M.S. and
Papers, vol. 52, no. 11, pp. 23852396, Nov. 2005. Ph.D. degrees in computer science from National
[31] A. Boudabous, F. Ghozzi, M. Kharrat, and N. Masmoudi, Implementa- University of Defense and Technology (NUDT),
tion of hyperbolic functions using CORDIC algorithm, in Proc. IEEE China, in 1998 and 2001, respectively. He was
16th Int. Conf. Microelectron. (ICM), Dec. 2004, pp. 738741. a Visiting Professor in Department of Electronic
[32] D. M. Muoz, D. F. Sanchez, C. H. Llanos, and M. Ayala-Rincn, and Computer Engineering, University of Toronto,
FPGA based floating-point library for CORDIC algorithms, in Proc. Canada, during 20102011. He has been a Professor
6th Southern Program. Logic Conf., 2010, pp. 5560. of Computer School in NUDT since 2011. His
[33] M. Jridi and A. Alfalou, Direct digital frequency synthesizer with research interests are in the areas of high perfor-
CORDIC algorithm and Taylor series approximation for digital mance computing, multi- and many-core architectures, on-chip networks,
receivers, Eur. J. Sci. Res., vol. 30, no. 4, pp. 542553, 2009. cache coherence protocols, and architectural support for parallel programming.
[34] F. De Dinechin, M. Joldes, and B. Pasca, Automatic generation of
polynomial-based hardware architectures for function evaluation, in
Proc. 21st IEEE Int. Conf. Appl.-Specific Syst. Archit. Process. (ASAP),
Jul. 2010, pp. 216222.
Baozhou Zhu was born in 1992. He received Tingting He was born in 1991. She received the
the B.S. degree in mechanical engineering and B.S. degree in information systems and informa-
automation from South China University of Tech- tion management in Jiangsu University of Science
nology in 2015. He is currently working toward and Technology, China, in 2013. She is currently
the M.S. degree in the field of microelectronics working toward the M.S. degree in microelectronics
and solid-state electronics at the National University and solid-state electronics at the National University
of Defense and Technology (NUDT), China. His of Defense and Technology (NUDT), China. Her
research interests include high performance com- research interests include high performance com-
puter architecture and computing engineering. puter architecture and computing engineering.