Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

892 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO.

4, APRIL 2017

Low Latency and Low Error Floating-Point


Sine/Cosine Function Based
TCORDIC Algorithm
Baozhou Zhu, Yuanwu Lei, Yuanxi Peng, and Tingting He

Abstract CORDIC algorithm is suitable to implement variable increases. Thus, the required number of multiplica-
sine/cosine function, but the large number of iterations lead tions and additions turns out to be larger. Typically, polynomial
to great delay and overhead. Moreover, due to finite bit-width is combined with look-up table [3]. The variable is compressed
of operands and number of iterations, the relative error of
floating-point sine or cosine is terrible when the input angle to a smaller range by look-up table, and polynomials calculate
is close to 0 or /2, respectively. To overcome these short- final results quickly. However, this combination is still expen-
comings, TCORDIC algorithm, which combines low latency sive to implement the multipliers, adders and tables for high
CORDIC and Taylor algorithm, is presented. After analyzing precision sine/cosine computation.
the latency of traditional CORDIC, low latency CORDIC is CORDIC algorithm is a kind of digital iteration method
proposed, which adopts the technique of sign prediction, com-
pressive iterations, and parallel iterations. Besides, the calcu- to calculate a variety of transcendental functions, including
lating boundary (N), which is used for determining whether sine/cosine function in circular coordinate and rotation
Taylor algorithm is selected or not in TCORDIC algorithm, mode. Calculating sine/cosine function based on CORDIC
is evaluated to achieve a trade-off between area and delay. algorithm has been employed [6] for a variety of high-speed
Truncated multipliers are used to reduce the area further. and real-time applications, such as adaptive filters [7] in
Finally, Using TCORDIC algorithm, pipelined and iterative
structures are implemented for IEEE-754 double precision digital signal processing (DSP), discrete sinusoidal transforms
floating-point sine/cosine with the input Z[0, /2]. Under [8] in signal processing, generating sinusoidal waveforms
typical condition (1V, 25 C), our designs are synthesized [9] in communication, robot control [10], and geometric
with 40 nm standard cell library. For a pipelined structure, computations [11] in graphics.
the frequency is up to 1.70 GHz and area 194049.64 m2 . However, a large number of iterations must be executed
Frequency decreases to 1.45 GHz for iterative structure, but
the area requires only 110590.81 m2 . TCORDIC is efficient in
in order, which is the a bottleneck for optimization of
controlling relative error, and achieves the accuracy within one CORDIC. The first data dependence is determination of the
ulp (unit in the last place) for floating-point sine/cosine function. rotation direction by iteration results of Z path. The second,
Index Terms CORDIC, floating-point sine/cosine, low latency, which refers to the (i + 1)t h iteration, can be started only
Taylor. after the completion of the i t h iteration, because the micro-
I. I NTRODUCTION rotations for any iteration are performed on the intermediate
vectors computed by the previous iterations. These two data
I N the fields of system control, simulation [1], high per-
formance computation [2], and scientific computation [3],
floating-point sine/cosine function with low error is essential.
dependence restricts optimization of the critical path. In
addition, each iteration is completed by carry look ahead
adder, and the delay due to carry propagation is proportional
In Intel 8087, Motorola, Cyrix coprocessors and CPUs such as
to the bit-width of operands. Therefore, when the precision
486DX and Pentium [4], sine/cosine is computed in hardware.
requirements are higher, the bit-width of operands and carry
FPGA and ASIC provide suitable platforms for implementa-
propagation delay increase.
tion of sine/cosine.
The precision of CORDIC converges linearly with the
Algorithms for sine/cosine function on hardware are divided
number of iterations. Error is inevitable because of the finite
into three categories: look-up table, polynomial approximation
bit-width of operands and the number of iterations. And,
and digital iteration [4]. Look-up table is applicable to low
relative error is too large to meet the precision requirements
precision [5], because hardware cost grows exponentially with
for IEEE754 floating-point standard.
precision. The convergence speed of polynomial approxima-
TCORDIC algorithm, combining low latency CORDIC with
tion is fast for a small variable, but decreases rapidly as the
Taylor algorithm, is presented to efficiently control relative
Manuscript received August 8, 2016; revised October 22, 2016; accepted error for floating-point sine/cosine function. Moreover, low
November 13, 2016. Date of publication December 15, 2016; date of current latency CORDIC is proposed by combining sign predic-
version March 27, 2017. This work is supported by National Natural Science
Foundation of China, NO: 61402499. This paper was recommended by tion, compressive iterations, and parallel iteration techniques.
Associate Editor Y. Pu Pipelined and iterative structures are implemented, and N for
The authors are with the School of Computer, National University of the calculating boundary, which is used for determining
Defense Technology, Changsha 410073, China (e-mail: 2278125123@qq.com;
yuanwulei@nudt.edu.cn; pyx@nudt.edu.cn; heting7410@qq.com). whether Taylor algorithm is selected or not in TCORDIC
Digital Object Identifier 10.1109/TCSI.2016.2631588 algorithm, is evaluated to achieve the balance between area
1549-8328 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 893

and delay. In addition, truncated multipliers further reduce achieve a fast carry-free computation. There are two types of
the area. The following are the main advantages of our redundant numbers: carry-save and signed-digit [20]. Redun-
designs: dant CORDIC can significantly reduce iteration period, but
1) The accuracy of floating-point sine/cosine function is both the scale factor and the number of iterations are variable.
guaranteed within one ulp (unit in the last place). Besides, sign prediction for redundant numbers increases
By combining Taylor and CORDIC, absolute and rel- implementation complexity. High-radix CORDIC [16], [21],
ative error can be controlled effectively with the input [22], [23] needs fewer iterations than radix-2 CORDIC does.
Z [0, /2], to achieve of the precision requirements of But, with a variable scale factor, both the look-up table for
IEEE754 floating-point standard. scale factor calculation and compensation lead to a great
2) The proposed low latency CORDIC combines the delay and area. The drawback inherent to the scaling-free
schemes of sign prediction, compressive iterations, and CORDIC [24], [25] lies in its small convergence range,
parallel iterations, which significantly reduces area and whether the domain folding technique or the repetitions of
delay. the first elementary micro-rotation lead to great cost and
3) By selecting Taylor appropriately, pipelined structure delay.
obtains low latency with maximum possible reduction Sang [26] points out that tangle approximation and round-
in area, and iterative structure makes full use of low off are the two main sources of error in CORDIC algorithm.
area cost with maximum possible reduction in delay. By choosing appropriate bit-width of operands and number
4) By optimizing the bit-width of truncated multiplier for of iterations, absolute error of fixed-point CORDIC can be
meeting the precision requirements, the area can be controlled and measured by MSE (mean-square-error), but
further reduced. relative error remains ignored. Therefore, Hu [27] finds that
when the input vector is close to 0 in vector mode, relative
II. BACKGROUND AND R ELATED W ORK error caused by round-off is maximum and sufficient enough
A. Traditional CORDIC Algorithm to misinterpret rotation directions. The partial normalization
scheme proposed by Kota [28] and the prescaling technique
Sine/cosine function is calculated in rotation mode and cir-
adopted by E. Antelo [29] can decrease the relative error.
cular coordinate. The following are the basic iterative formulas
However, researchers [30] analyzed only absolute and relative
for such calculation:
errors of fixed-point CORDIC, but made no mention of error
i
X i+1 = ki (X i i 2 Yi ) analysis and control for floating-point CORDIC.
i
Yi+1 = ki (Yi + i 2 X i ) (1) Anis [31] compares CORDIC with Taylor algorithm to

calculate sine/cosine function about structure and precision
Z i+1 = Z i i i .
in detail, but his focus is on the contrast rather than the
Where, i {1, 1} refers to sign(Z i ), which determines the combination. Daniel [32] combines CORDIC with Taylor
rotation directions of the vector, and i = tan 1 (2i ) is the algorithm for calculation of floating-point exponential function
micro-rotation angle. ki = 1 2i , and the scale factor after to improve precision, without elaborating the principle and
 (1+2 )
n iterations is K n = ni=1 ki . Initial angle Z 0 is the rotation generalizing it to sine/cosine function. Maher [33] uses one
angle in total, and the iterations make Z i approach 0. The CORDIC to replace two Read Only Memories to compute sine
vector corresponding to Z 0 can be obtained after n iterations. and cosine of the upper address 0 , with Taylor computing
 around 0 , which aims to improve only frequency resolution
X n = K1n (X 0 cos(Z 0 ) Y0 si n(Z 0 )) for Direct Digital Frequency Synthesizer.
(2)
Yn = K1n (Y0 cos(Z 0 ) + X 0 si n(Z 0 )).
With X 0 = K n ,Y0 = 0, the iteration results are equivalent to III. TCORDIC A LGORITHM
sine and cosine for Z 0 in the range of Z 0 [0, /2]. TCORDIC algorithm, combining low latency CORDIC and
Taylor algorithm, is proposed in this section to improve
B. Related Work the accuracy of floating-point sine/cosine computation of
CORDIC when the input angle is close to 0 or /2, and
Since Volder proposed CORDIC algorithm to calcu- guarantee the accuracy within one ulp.
late trigonometric function in 1959 [12], [13], many vari-
ants [14], [15] are presented to reduce delay. Reducing the
number of iterations [16], [17] and reducing the delay of A. Low Latency CORDIC Algorithm
each iteration are the two main prerequisites for achieving low 1) Delay Analysis for Traditional CORDIC: The delay for
latency. Carry-free adder is a straight-forward choice to reduce traditional CORDIC is determined by the number of iterations
the delay of each iteration. Parallel CORDIC [18], [19] speeds and the delay of each iteration. In pipelined structure, X/Y
up iteration period by CSA (Carry Save Adder), and multi- path is critical and Delay = n (TnC L A + Treg ), where
operand CSA tree architectures are adopted to replace the last TnC L A and Treg refer to the delay for n-bit Carry Look
half of the iterations. However, both the delay and the area ahead Adder and register, respectively. In iterative structure,
for CSA trees are large, and sign prediction for all iterations the worst path is determined by the maximum delay in X, Y
requires much area. Redundant CORDIC [16], [20], [21] uses and Z paths. As shown in Fig. 1, Delay = max{n(TnC L A +
a redundant representation and the corresponding adder to Tbshi f t (n,log2 n) + Treg ), n (TnC L A + TLU T + Treg )}, where
894 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

Plugging the i t h iteration formulas into the (i + 1)t h , the


following formulas are available:

X i+2 = X i (1 i i+1 22i1 ) Yi (i 2i + i+1 2i1 )
Yi+2 = Yi (1 i i+1 22i1 ) + X i (i 2i + i+1 2i1 ).
(6)
In this way, the iteration formulas between the m t h and the
Fig. 1. Iterative structure based on traditional CORDIC. (n 1)t h are expanded:
.

X n = X m Am,n Ym Bm,n


TLU T and Tbshi f t (u,v) refer to the delay for look-up table

and u-bit barrel shifter of v control signals, respectively. Yn = Ym Am,n + X m Bm,n


 


The delay can be decreased significantly by reducing the
Am,n = 1 i j i j 2i j


number of iterations, and the main delay of each iteration
   

+ i j k l i j k l 2i j kl
deriving from CLA. Due to linear convergence of CORDIC,

the bit-width of operands and the number of iterations increase   i(1)..i(2t )


.. + (1) t
i(1) .. i(2t ) i(1) .. i(2t ) 2 (7)
linearly with precision. High precision requirement causes



great delay in carry propagation. Bm,n =  i 2i    i j k 2i j k


2) Sign Prediction Scheme: Low latency CORDIC algo-
i i j k

 
rithm eliminates the first data dependence by using sign


+.. + (1)t i(1) .. i(2t +1) i(1) ..i(2t +1)
prediction technique in Z path. The binary expression of Z j is

n

j
j =0 b j 2 , b j {0, 1}. If Z j = b0 , b1 , . . . , b j 1 b j , . . . , bn

2 i(1)..i(2t +1)

and b0 = b1 = = b j 1, transformation rule of sign


prediction between the j t h and k t h bits is defined as follows: where i, j, k, i (1), . . . , i (2t), i (2t + 1) are all integers from m
if Z j is positive or in other word b j 1 = 0, j is equal to 1; to n 1, and satisfied m 1 < i < j < k < n, m 1 < i (1) <
otherwise, j is equal to -1. Since i > j 1, i+1 is equal to i (2)... < i (2t) < i (2t + 1) < n. When m n/2 + 1, it can be
-1 if bi = 0, and i+1 is equal to 1 if bi = 1. known that i + j 2m + 1 n + 3. Except the first item 1
Angle approximation error of each iteration in this rule is in Am,n , the maximum sum of other items is less than 2n1 .
2i i , and the cumulative error of k i + 1 iterations must Except the first item i i 2(i) in Bm,n , the maximum sum
be less than 2n to ensure convergence. So, k 3i + 1 must of other items is less than 2n2 . Because of Ym 1 and
be satisfied. With index i (n log2 3)/3, it can be known X m 1, the error of X n or Yn is less than 2n , as analyzed
that 2i i < 2n . Thus, i can be replaced with 2i ; so, the in the Appendix.
last 2/3 iterations adopt transformation rule of sign prediction Thus, iterations from the (n/2 + 1)t h to the (n 1)t h can
directly. With index i < (n log23)/3, correct iterations are be simplified as follows:
added to ensure prediction accuracy in the iteration sequence  
X n = X n/2+1 Yn/2+1 ni=n/2+1 i 2i
according to k 3i + 1.  (8)
3) Compressive Iterations Based on CSA: Based on sign Yn = Yn/2+1 X n/2+1 ni=n/2+1 i 2i .
prediction, the first half of the iterations are compressed by The last half ofthe iterations can be regarded as the
CSA in X and Y paths. CSA eliminates carry propagation rotation with angle ni=n/2+1 i 2i , which is equal to Z n/2+1 .
delay of each compressive iteration and makes it irrelevant to Therefore, the formulas are converted as follows:
the bit-width of operands. X i and Yi are divided into sum and 
carry, respectively. X n = X n/2+1 Yn/2+1 Z n/2+1
 (9)
Yn = Yn/2+1 X n/2+1 Z n/2+1 .
X i = X i C + X i S
(3) Thus, the last half of the iterations can be completed with
Yi = Yi C + Yi S .
two multipliers and two adders.
The iteration formulas are converted as (4), where CLAs
in X and Y paths are replaced with 4:2 CSAs.
 B. Error Analysis for Floating-Point Sine/Cosine
C + X S = X C + X S 2i (Y C + Y S )
X i+1 i+1 i i i i i
(4) When the input is close to 0 or /2, the relative error
C S C S i
Yi+1 + Yi+1 = Yi + Yi + i 2 (X i + X i S ).
C of floating-point sine/cosine function is large, due to the
following errors:
4) Parallel Iterations Based on Multiplication: The last half
1) Angle Approximation Error: Angle approximation error
of the iterations are calculated by parallel iterations, which
comes from finite number of iterations. The resolution of
eliminate the second kind of data dependence and reduce the
results is 2n after n iterations; so, the angle approximation
number of iterations. The formulas of the i t h iteration are as
error approaches 2n . Absolute error is smaller as the number
follows: 
of iterations is larger. However, the magnitude of error is rel-
X i+1 = X i i 2i Yi
(5) ative to the exponent under IEEE-754 floating-point standard.
Yi+1 = Yi + i 2i X i . To round off correctly, absolute error should be less than
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 895

2 E Z Eo f f set L , assuming that Z = (1) S Z M Z 2 E Z . L


represents bit-width of mantissa in normalized form, and there
are 23 and 52 bits respectively for single and double precision
floating-point in IEEE-754 standard. It can be proved that
Z /2 < si n(Z ) < Z (E Z E o f f set < 0) with 0 < Z < /3;
so, the number of iterations needs more than L + 1
(E Z E o f f set ) to guarantee precision. Therefore, required
number of iterations increases as the input decreases.
2) Round-Off Error: Round-off error is inevitable and is
caused by finite bit-width of operands. The operands adopt Fig. 2. Error analysis with N  < 10.
fixed-point format, including 1 bit for sign, N1 bits for integer
and N2 bits for decimal. N2 increases as the precision required
increases. To ensure relative error less than 2 E Z Eo f f set L ,
decimal is at least L + 1 (E Z E o f f set ) bits. In addi-
tion, taking into account the round-off error accumulated
by L + 1 (E Z E o f f set ) iterations, there are at least
log2 (L + 1 (E Z E o f f set )) guard bits. Therefore, the bit-
width of operands requires 1 + N1 + L (E Z E o f f set ) +
log2 (L + 1 (E Z E o f f set )) bits to ensure precision, and
this increases with decreasing input.
The bit-width of operands, which does not include the bits Fig. 3. Error analysis with N  > 10.
for sign and integer, refers to n and is equal to the number of
iterations. To guarantee the accuracy of L mantissa bits under
IEEE754 floating-point standard, the following equation must the key issue in to guarantee precision is solving the problem
be satisfied: with a small input. Luckily, Taylor has a fast convergence
speed with a small variable. With floating-point input Z ,
L + 1 (E Z E o f f set ) + log2 n n. (10) Taylor expansion is expressed as follows:
To guarantee precision, the required number of iterations c Z 2i1
and bit-width of operands will have to be increased whenever si n(Z ) = Z Z 3 /3! + (1)i+1
i=3 (2i 1)!
the input is smaller, and this leads to sharp increase in
c 2i1 (2i1)(E Z E o f f set )
hardware cost. According to si n(Z ) = cos(/2 Z ), when i+1 M Z 2
= (1) . (11)
Z is close to /2, the relative error of cos(Z ) is similar to i=1 (2i 1)!
that of si n(Z ), Z being close to 0. If the number of expansion items is c, the omitted
3) A Specific Example of Relative Error of Sine: The (2c+1)(E Z E o f f set )
M 2c+1 2
relative error of sine can be clarified with a specific example. sum of Taylor expansion is less than Z (2c+1)! .
With input Z [0, /2], the bit-width of operands and number To guarantee the accuracy of L mantissa bits, relative error
of iterations is 64, and the guard bits are log2 64 = 6 bits. N  is must be less than 2 E Z Eo f f set L ; the following equation must
equal to E o f f set E Z and refers to the number of leading zeros be satisfied.
of Z 0 . Z 0 is the fixed-point form of Z and includes 1 sign bit, M Z2c+1 2(2c+1)(E Z Eo f f set )
1 integer bit and 62 decimal bits. < 2 E Z Eo f f set L . (12)
(2c + 1)!
With N  < 10, as shown in Fig. 2, and the sine of Z in
fixed-point form is only two cases of si n(Z 0 ) or si n(Z 0 ) ,
   C. TCORDIC Algorithm
because 2N 1 < si n(2N ) < 2N . After normalizing, all
bits of mantissa of si n(Z ) in floating-point form come from TCORDIC algorithm, as shown in Fig. 4, is composed of
the iterations of Z 0 , but they may include guard bits, which low latency CORDIC and Taylor algorithm, whose respec-
may be inaccurate. At the end, there are six guard bits, which tive advantages are mutually complementary in computing
may or may not be abandoned due to normalization. floating-point sine/cosine function, achieving the accuracy
Similarly, with N  > 10, the sine of Z is also two cases of within one ulp. As the convergence speed of Taylor is fast with
si n(Z 0 ) or si n(Z 0 ) , as shown in Fig. 3. After normalizing, a small variable, Taylor is used as a supplement to CORDIC.
mantissa bits of si n(Z ) in floating-point form come from the In TCORDIC, N refers to the calculating boundary that can
iterations of partial bits of Z and complementary 0 bits. The determine whether Taylor algorithm is selected or not. For
bits of mantissa iterated from 0 bits must be inaccurate, and sine, it is computed by Taylor when E o f f set E Z > N,
those iterated from the bits of Z include 6 guard bits. With otherwise it is computed by CORDIC. For cosine, it is
the increase of N  , the guard bits and inaccurate bits iterated calculated by Taylor when E o f f set E Z > N, where E Z refers
from 0 bits approach high bits of mantissa, and the relative to the exponent of /2 Z 0 after normalizing, otherwise it is
error becomes larger. calculated by CORDIC.
4) Error Analysis for Taylor Algorithm: In the case of finite When E o f f set E Z N, to guarantee accurate L
bit-width of operands and number of iterations in CORDIC, bits of mantissa, n min can be known according to (10)
896 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

Fig. 5. Delay influence of N in TCORDIC, where N is the calculating


boundary, h is the number of cycles to calculate Taylor expansion, and a is
the number of cycles to complete compressive iterations in CORDIC.

A. Iterative Structure
1) Optimization of the Calculating Boundary (N): Iterative
structure needs only a truncated and fixed-point multiplier.
During iteration for CORDIC, the multiplier is multiplexed
for Taylor. Calculation for Taylor expansion adopts Horner
scheme, which involves the least multiplications.
Z 3 c Z 2i1
si n(Z ) = Z + (1)i+1
3! i=3 (2i 1)!

2 1 2 1 Z2
= Z 1 Z ..Z .. .
3! (2c 3)! (2c 1)!
(15)
In CORDIC, a 2-stage compressive iteration occupies one
clock cycle. In a multiplication, the first 2-stage compression
Fig. 4. TCORDIC algorithm. accounts for one clock cycle, and the last 2-stage compres-
sion and addition with CLA account for one clock cycle.
In Taylor, shift and addition account for one clock cycle
where E o f f set E Z = N, in total. The required number of cycles with the truncated
multiplier multiplexing is h for Taylor expansion, and a for
L + 1 + N + log2 n min n min . (13)
completing the compressive iterations in CORDIC. Therefore,
Similarly, when E o f f set E Z > N, to guarantee accurate the time constraint is h a, and the truncated multiplier had
L bits of mantissa, cmin can be known according to (12) where better be multiplexed as much as possible. While the number
E o f f set E Z = N. of iterations in CORDIC is n min , the number of cycles to
compress is
M Z2cmin +1 2(2cmin +1)(N)
< 2NL . (14) a = n min /4 . (16)
(2cmin + 1)!
With cmin Taylor expansion items, the truncated multiplier
IV. I MPLEMENTATIONS OF TCORDIC is multiplexed cmin + 1 times, and there are cmin 1 additions
In this section, IEEE-754 double precision floating-point and shifts. Therefore, the total number of cycles for Taylor
is taken as an example to illustrate the implementations of expansion is
TCORDIC algorithm. h = 3(cmin 1) + 4. (17)
To guarantee accurate mantissa, where L = 52 bits for
double precision floating-point computation, the number of With L = 52 bits to be guaranteed, (13), (14), (16), and (17)
iterations, as also the bit-width of operands n min in CORDIC, are used in computing. As shown in Fig. 5, the time constraint
and the number of expansion items cmin in Taylor required to h a can be satisfied with N 5. when N continues
calculate minimally relate with the calculating boundary (N), to increase, the number of iterations required to guarantee
which refers to (13) and (14). Therefore, the calculating precision increases. Therefore, N = 5 for the calculating
boundary (N) can be optimized to achieve a trade-off between boundary is the optimal value in iterative structure.
area and delay. The optimal values of the calculating bound- 2) Parameters and Process Illustrations: The implementa-
ary (N) are 13 and 5 in pipelined structure and iterative tion of the overall iterative structure is shown in Fig. 6(A).
structure, respectively. According to the above analysis, the key parameters for the
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 897

Fig. 6. Implement of iterative structure.

implementation of the optimized iterative structure are as double precision floating-point number, includes exponent E Z
follows: and mantissa M Z . If E o f f set E Z is greater than 5, Taylor is
Guaranteed precision L = 52 bits. selected to calculate si n(Z ). If E o f f set E Z is greater than 5,
Calculating boundary N = 5. For sine, it is computed Taylor is selected to calculate si n(/2 Z 0 ), which is equal
by Taylor when E o f f set E Z > 5 (i.e., Z [0, 25 ]), to cos(Z ). To calculate (/2 Z 0 ), the adder needs 52 + 64
otherwise it is computed by CORDIC. For cosine, it is bits guaranteed even in the worst case that the 52 high bits
calculated by Taylor when E o f f set E Z > 5 (i.e., Z  of (/2 Z 0 ) are all 0. To calculate by Taylor, (/2 Z 0 )
[/225 , /2]), otherwise it is calculated by CORDIC. is normalized. Whether or not Taylor is selected, CORDIC
The number of iterations, as also the bit-width of calculation always works to calculate sine and cosine.
operands in CORDIC n min = 64. 4) Compression and Prediction: Compression and
The number of Taylor expansion items is cmin = 5. prediction module complete rotation direction prediction and
In CORDIC path, the first 32 iterations are completed by compressive calculation in Z path. To guarantee accurate
compressive iterations, and the last 32 iterations are completed prediction, correction iterations must be carried out; so, CSA
by parallel iterations, which multiplexes a 53*53-bit truncated and CLA are mixed in Z path. CLAs are used in the position
multiplier twice. In addition, the sign prediction scheme con- of correction iterations for summations, and CSAs between
sists of three levels, predicting directions for the 1t h 4t h , correction iterations for compression. A series of micro-
5t h 12t h , and 13t h 32t h iterations. rotation angles for iterations are predefined; so, segmented
In Taylor path, the 53*53-bit truncated multiplier is multi- look-up tables are adopted. Therefore, a number of iterations
plexed 6 times to calculate the first 5 expansion items. can be replaced with an addition and a table look-up. Besides,
3) Preprocessing: The preprocessing module determines sign prediction makes compressive calculation with CSA
whether Taylor is selected according to the input Z , whose available in Z path. In the first 32 iterations, correction
original floating-point form is transformed into fixed-point iterations have to be added at positions i = 1, 4, 12, and
form Z 0 . As shown in Fig. 6(B), Z , which is an IEEE-754 hence the prediction module is composed of 3 submodules.
898 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

the same as sign(Z 33) after 32 compressive iterations, and


only its 32 low bits participate in subsequent multiplication.
To guarantee 32 high bits of the truncated multiplier out-
put accurate, a 32 32-bit truncated multiplier is adopted,
outputting 38 bits, which consists of 6 guard bits and
32 accurate bits. Considering a truncated multiplier mul-
tiplexing for CORDIC and Taylor, a 53 53-bit truncated
multiplier, outputting 59 bits, is adopted (see Fig. 6(E)), which
is composed of 6 guard bits and 53 accurate bits.
After preprocessing, Taylor is not selected with
sel|sel = 0, and the truncated multiplier remains free
during the first 16 clock cycles. If sel|sel is 1, the truncated
multiplier is multiplexed 6 times to calculate the first
5 expansion items for Taylor. A multiplication requires
2 clock cycles, and an addition and a shift together need
a clock cycle. The first step is to calculate M Z2 in the
first 2 clock cycles, and the second step is to circulate
Fig. 7. The structure of a 4-stage compressive iteration. 4 times. Each circulation needs 4 clock cycles in the order
of multiplication, shift and addition, the last step being the
multiplication of the intermediate result by M Z to obtain the
In these submodules, Z 4 , Z 33 and Z 12 are calculated and the final result (see Fig. 6(E)).
corresponding rotation directions predicted. 7) Post-Processing: The post-processing module, shown
The structure of the second prediction submodule, including in Fig. 6(F), normalizes the final results and selects si n(Z )
a look-up table with rotation angles (32 64bits), is shown and cos(Z ) from the outputs of Taylor and CORDIC paths.
in Fig. 6(C). The bits Z 4 [59 : 52] are the key elements in the The normalization module is used to transform the final results
submodule. On the one hand, the bits determine the operations from fixed-point form to floating-point form. In addition,
of addition and subtraction in X and Y paths, and on the other, the final results are selected from the outputs of Taylor and
they index the look-up table for ar ctan_4 and ar ctan_8. CORDIC paths according to the valid signals sel and sel  .
In the third prediction submodule, the bits Z 12 [51 : 30]
account for predicting rotation directions from 13 to 32 ;
B. Pipelined Structure
besides, they index a series of rotation angles to accumulate.
The compression structure with CSAs in the third submodule 1) Optimization of the Calculating Boundary (N):
is shown in Fig. 6(C). Finally, a CLA is used to calculate In pipelined structure, the direct method of calculating Taylor
the summation of Z 32 , which will be a multiplier for parallel expansion is adopted. For calculating cmin items for Taylor
iterations in CORDIC. expansion, the required number of multipliers is
5) Compressive Iteration Multiplexing: Compressive itera- m = 2cmin 1(cmin > 1). (18)
tion multiplexing module completes compressive iterations in
X and Y paths for CORDIC. According to sign prediction, With L = 52 bits to be guaranteed, (13), (14), and (18) are
each multiplexing needs 2 clock cycles, and the first 32 com- used in computing. As shown in Fig. 9, m and n min change
pressive iterations multiplex this module 8 times in segmented with N, and the trends are diametrically opposite. Therefore,
cycles. Both the inputs and outputs of the multiplexing module there is an optimal value for N to achieve the least area. With
in X and Y paths are divided into carry and sum, and the N = 1, 2, 3, 4, 6, 8, 13, the number of iterations is minimum
outputs become the inputs for the next multiplexing, as shown when the number of multipliers is the same; so, choosing
in Fig. 6(D). The counter provides the multiplexing module between these points is a matter of priority. When N < 13,
with the selection signal S1 , which records the number of times the number of iterations increases with increase of N, but the
the multiplexing module has been multiplexed and accordingly number of multipliers decreases rapidly. On the other hand,
determines the bits to shift. when N > 13, decreasing the number of multipliers results
The structure of the multiplexing module is a 4-stage in a great increase of the number of iterations. Therefore,
compressive iteration, as shown in Fig. 7. As CSA is better N = 13 for the calculating boundary is the most suitable
than CLA in terms of area and delay, the CLAs are replaced value in TCORDIC.
with CSAs to complete compressive iterations. 2) Parameters and Process Illustrations: The implementa-
6) Truncated Multiplier Multiplexing: The truncated multi- tion of pipelined structure is shown in Fig. 8. According to
plier multiplexing module completes calculation of 5 expan- the above analysis, the key parameters for the implementation
sion items for Taylor and parallel iterations for CORDIC. of the optimized pipelined structure are as follows:
For ensuring 53 high bits of final and accurate results, Guaranteed precision L = 52 bits.
a 53 53-bit fixed-point and truncated multiplier is adopted Calculating boundary N = 13. For sine, it is computed
with an output of 59 bits, which includes 6 guard bits and by Taylor when E o f f set E Z > 13 (i.e., Z [0, 213 ]),
53 accurate bits. In CORDIC, 32 high bits of Z 33 become otherwise it is computed by CORDIC. For cosine,
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 899

Fig. 10. Bit-width simplification.

and two 53 53-bit multipliers are needed to obtain two inter-


mediate results of 106 bits. Finally, X 3 /6 is computed with a
106 106-bit multiplier. As shown in Fig. 10, two 53 53-bit
multipliers are replaced with two M M-bit truncated mul-
tipliers, and the intermediate results are of M1 bits, instead
Fig. 8. The overall pipelined structure. of 106 bits. There are (M1 M2 ) guard bits. Small circles
filled with gray color stand for the error generated from
the truncation, and formula (19) can ensure the accuracy
of M2 bits in the M M-bit truncated multipliers.

M 2(M1 2) < 2(M2 2) . (19)

Similarly, the 106 106-bit multiplier is replaced with a


M2 M2 -bit truncated multiplier, and its output is L 1 bits,
including (L 1 30) guard bits. Formula (20) must be satisfied
to guarantee 30 accurate bits.

M2 2(L 1 4) < 2(304) . (20)

Fifty-three bits are replaced with M bits, and round-off error


of M Z is less than 2(M1) . As shown in Fig. 10, unfilled
Fig. 9. Area influence of N in TCORDIC, where N is the calculating circles represent round-off error in the multiplication, and the
boundary, m is the number of multipliers to calculate Taylor expansion, and error is less than 2(M1) 2(M1) + 2 21 2(M1) .
n min is the number of iterations and bit-width of operands in CORDIC.
If 2(M 1) (M2 2), formula (21) is required to guarantee
precision.
2 21 2(M1) < 2(M2 2) . (21)
it is calculated by Taylor when E o f f set E Z > 13
(i.e., Z [/2 213 , /2]), otherwise it is calculated by Similarly, 106 accurate bits are replaced with M2 accurate
CORDIC. bits, and the round-off error of the intermediate results is
The number of iterations, as also the bit-width of generated. If 2(M2 2) 26, formula (22) needs to be
operands in CORDIC n min = 73. satisfied to meet precision.
The number of Taylor expansion items is cmin = 2.
2 22 2(M2 2) < 226 . (22)
In CORDIC path, the first 36 iterations are completed by
compressive iterations. The last 36 iterations are completed By solving (19), (20), (21), and (22), M2 32, L 1 36,
by parallel iterations, which requires two 39 39-bit truncated M 34, and M1 38 can be obtained.
multipliers. In addition, the sign prediction scheme consists In CORDIC path, the number of iterations, as also that
of three levels, predicting directions for the 1t h 4t h , of bit-width of operands, is 73. Thirty-six high bits of
5t h 12t h , and 13t h 36t h iterations. Z 37 have all become the same as sign(Z 37) after the first
In Taylor path, two 34 34-bit truncated multipliers and a 36 iterations, and only 36 low bits participate in subsequent
32 32-bit truncated multiplier are adopted to calculate the multiplication. A M3 M3 -bit truncated multiplier is adopted,
first 2 expansion items. outputting 36 accurate bits. As in the case of the foregoing
3) Bit-Width of Truncated Multipliers: In pipelined struc- analysis, round-off error for multiplication needs to satisfy that
ture, two expansion items for Taylor need to be calculated 2M3 2 21 < 236 . Therefore, the 39 39-bit trun-
with N = 13. Thirty bits of X 3 /6, which later participate cated multiplier is adopted, outputting 42 bits, which
in summation, must be accurate. Mantissa M Z is 53 bits, include 6 guard bits.
900 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

TABLE I
S YNTHESIZED R ESULTS W ITH 40 nm S TANDARD C ELL L IBRARY

4) CORDIC Path: As shown in Fig. 8, the third prediction


submodule calculates Z 37 , instead of Z 33 , and the rest in
Z path is almost the same as that in iterative structure.
To reduce area and delay, the first 8 iterations [18]
in X and Y paths are completed by look-up table, instead of Fig. 11. Comparison of error bits of sine/cosine.
compression. The remaining compressive iterations are divided
into 14 ROTs, and each ROT, which is a 2-stage compressive TABLE II
iteration module. Two 39 39-bit fixed-point and truncated U NIFIED PARAMETERS OF C OMPONENTS
multipliers are adopted, outputting 42 bits, which include
6 guard bits and 36 accurate bits.
5) Taylor Path: Taylor path consists of 3 truncated multi-
pliers and a CLA. The outputs of the two 34 34-bit truncated
multipliers are intermediate results, and they are all 38 bits.
With 6 guard bits abandoned, the remaining 32 bits involve
in the multiplication of a 32 32-bit truncated multiplier. The
output of the 32 32-bit multiplier is 36 bits, and its 30 high
bits participate in last addition with M Z .

V. E XPERIMENTAL R ESULTS
A. Synthesis Results
Our designs are synthesized with 40 nm standard cell of operands, in traditional CORDIC is 64, the same as
library under typical condition (1V, 25 C). There is a those in TCORDIC. Five-hundred points are uniformly taken
clock cycle constraint of 700 ps for iterative structure and during log2 Z [25, 5] to compute sine, and the
600 ps for pipelined structure, with 150 ps for input delay input Z correspondingly varies from 0x3e60000000000000
and 150 ps for output delay. The synthesized results of to 0x3 f a0000000000000. Similarly, five-hundred points
pipelined and iterative structures are shown in Table 1. For are uniformly taken with log2 (/2 Z ) [25, 5]
pipelined structure, CORDIC path and Taylor path occupy area to calculate cosine, and Z correspondingly varies from
176952.46 and 17097.18 m2 , respectively. 0x3 f f 911 f b54442d18 to 0x3 f f 921 f b53442d18.
As shown in Fig. 11, the error of traditional CORDIC is
B. Accuracy Analysis large for fixed bit-width of operands and number of iterations,
and the error bits of sine or cosine increase quickly when the
Because of finite bit-width of operands and number of
input angle is close to 0 or /2, respectively. The error of sine
iterations in CORDIC, the relative error of floating-point sine
or cosine of TCORDIC is no more than one error bit with the
or cosine is terrible when the input is close to 0 or /2,
supplement of Taylor.
respectively. Jie [6] provided a relative error of 232 with a
The other related works implemented fixed-point sine and
small input for double precision floating-point sine computa-
cosine functions based on CORDIC and focused on maximum
tion, and the maximum relative error should be larger with a
absolute error or mean square error (MSE). In [16], the
smaller input.
maximum absolute error was 8.04 104 and 5.50 104
The following comparison reflects the accuracy improve-
for cosine and sine functions with the angles ranging from
ment of our TCORDIC clearly. The number of inaccurate
/2 to /2 in the step of /500, respectively. Daniel [32]
bits of mantissa measure the relative error for floating-point
used 1000 inputs ranging from 0 to /2 to compute the MSE
computation quantitatively, and the input is limited in the range
of sine and cosine and provided MSE= 1.4 1013 .
close to 0 or /2. Because of the symmetry of sine and cosine,
the number and trend of error bits for sine with Z close to 0
are similar to those for cosine with Z close to /2. C. Delay and Area Comparisons
TCORDIC can guarantee faithful rounding (i.e., the error There is a differential on area and delay of designs with dif-
is smaller than one ulp) for any angles in [0, /2]. Fig. 11 ferent technology generation and design parameters; therefore,
compares the relative error of sine or cosine between tradi- normalized metrics is adopted for the comparisons. To avoid
tional CORDIC and TCORDIC when the input Z is close the effect of technology generation, the comparisons based on
to 0 or /2. The number of iterations, as also the bit-width function components are adopted. The function components
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 901

TABLE III
D ELAY AND A REA C OMPARISONS

Fig. 12. Delay of various CORDIC methods.

in the compared references are synthesized with the same


constraints. Table 2 is reasonably assumed according to the
results of synthesis. The unit delay and area are set to a full-
adder delay and area, which are denoted by TF A and A F A ,
respectively. To avoid the effect of parameters, the bit-width
of operands and the number of iterations in the compared
CORDIC methods are unified with 64.
The area and delay for look-up table (2 f 2 n) are
2 2n/8 0.5 A F A and log2 f TF A /2, respectively [18].
f

Rotation angle look-up table0 (24 n) is in Z path of our


structures. Scale factor look-up table1 (9 8n/161 n) Fig. 13. Area of various CORDIC methods.
provides initial values in K path of Rohit [16]. Initial iteration
value look-up table2 (2n/81 2 n) is in both Tsos [18] and
our structures, replacing the first n/8 1 iterations in X and Y truncated multipliers are not in critical path. Therefore, the
paths. Scale factor look-up table3 (3n/12+1 n) provides following paragraphs focus on the performance analysis of our
initial values in K path of Antelo [22]. implementations compared with various CORDIC methods.
According to the above estimations and calculations in Furthermore, considering Treg = 0.5TF A and Areg =
Table 2 and last paragraph, different CMOS processes are 0.62 A F A [24], the delay and area of various CORDIC methods
taken into consideration for fair comparison. The delay and are summarized in the units of TF A and A F A , respectively, as
area of our TCORDIC, polynomial approximation method, and shown in Figs. 12 and 13.
various CORDIC methods are compared in Table 3. 1) Comparisons of Pipelined Structure: In traditional
Polynomial approximation converges quickly for small vari- CORDIC [12], the area and delay are much greater than those
able, but slowly for larger variable. So, in high precision of other designs because of the large number of iterations and
arithmetic, even though the factorials can be tabulated, this the use of CLA.
approach still requires many multiplications and additions [4]. Rohit [16] proposes Hybrid algorithm of double step
In Florents polynomial-based architectures [34], the coef- branching CORDIC [17] and high performance radix-4
ficient sizes of the polynomials have been optimized to CORDIC [22] for which the scale factor compensation is
achieve the minimum overhead. The computation cosine or performed based on the algorithm presented in [23], reducing
sine alone requires 38 block RAMs (18K bits), 15 DSP blocks the number of iterations to 3n/8 + 1. However, scale factor
(18 18-bit), and 672 slices on Virtex IV, and the area has been look-up table1 occupies 18432 A F A , leading the area of the
roughly transformed in Table 3. Apparently, the equivalent structure much larger than those of others. In addition, each
area is larger than CORDIC methods, because polynomial iteration in X and Y paths requires 6 redundant adders, which
approximation method requires a great mass of DSP blocks increases the cost and delay.
and memory blocks to implement multiplications and look-up In Tsos Parallel CORDIC [18], the input Z is separated
tables. into L and H , and rotation direction prediction is completed
Actually, the main overhead of our TCORDIC is CORDIC through MAR and BBR techniques. In X and Y paths, the
path. In iterative structure, the truncated multiplier multiplex- first n/8 1 iterations are computed by look-up table2, and
ing calculates the expansion items for Taylor during the com- the second phase adopts multi-operand carry-save addition tree
pressive iterations for CORDIC, without increasing the area or architectures. However, there are 64 operands of 64 bits to be
affecting the delay. In pipelined structure, Taylor path requires shifted and compressed in X and Y paths, respectively, which
three truncated multipliers based on booth encoding and CSA. results in a great area and delay.
This involves additional area of 3 M2 /2 M2 /2 A F A + In Kuhlmanns literature [19], the directions of all
3 Aboot h , and it is only 1776 A F A with M2 = 32. Besides, the micro-rotations are precomputed. Despite eliminating the
902 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

sign examination, the scale factor calculation, and compen- TABLE IV


sation completely, the area is still vast for a large number P OWER AND A CTIVITY WITH D IFFERENT T ESTING S ETS
of operands to be compressed in X and Y paths. The delay
of the critical path is large because of Tselect ion which refers
to the delay of the selection for the multiple of the shifted
components. In addition, an initial delay of 15.6TF A has to be
added because of the delay of the on-the-fly converter.
Combining the precomputation of all rotation directions
in [19], Lakshmi [21] proposed an architecture using signed
digit arithmetic for rotational Radix-4 CORDIC algorithm. The
delay is larger than ours for the scale factor compensation
and the initial delay of 11.3TF A for the -Prediction block.
In addition, the area is vast for the scale factor calculation and
compensation.
In Antelos mixed Radix-4 and redundant CORDIC [22], D. Power Analysis
both delay and area are much greater than those of ours. The The power consumption of a digital circuit is a function
delay for Wi+1 , which lies in critical path, is up to 3TF A . of the silicon area given a fixed frequency and operating
In addition, each of the first n/6 iterations requires a clock voltage [25]. The power consumption for our architecture
cycle to wait for rotation directions in W path, resulting in a and traditional architecture are measured with 1 GHz clock
large latency. In regard to area, it is large owing to the scale cycles, 0.9 V supply voltage, and vast of random inputs (SG r )
factor look-up table3 and compensation. are used to obtain the power and activity in Table 4.
In Jaimes enhanced scaling-free CORDIC [24], a specific Compared with traditional CORDIC, TCORDIC can achieve
booth recoding is used and the corresponding hardware for lower power consumption for pipelined structure. However,
Z path can be omitted. Despite the reduction of micro- this is reverse for iterative structure. The trend of power
rotations, the introduction of traditional CORDIC iterations consumption between traditional CORDIC and TCORDIC is
and the use of CLA lead to great cost in area and delay. the same with the trend of area as illustrated in Fig. 13.
In our TCORDIC, the delay and area are least among The area comparisons of the implementations of our
the comparisons. Rotation direction prediction for n/2 iter- TCORDIC, polynomial approximation method, and various
ations requires only AC S A + 3 At able0 + 2 A64_C L A . Parallel CORDIC methods have already been provided in Table 3.
iterations in the second phase are implemented with two According to the relationship between area and power con-
truncated multipliers, based on booth encoding and CSA. sumption, we can expect that our pipelined architecture will
Calculation Multiplier Z 33 in Z path and booth encoding consume less power than [12], [16], [18], [19], [21], [22],
require 744 A F A and 336 A F A , respectively. But, the number [24], and [34]. For iterative structure, the power consump-
of operands for CSA to be compressed is only a quarter of tion of our TCORDIC is less than [16], and almost the
the number for integrated multipliers, and the bit-width of same as [22].
operands is optimized further. So, the area for compression Table 4 also illustrates that there is no significant change in
is only one sixteenth of integrated multipliers, and the delay the power consumption between TCORDIC, which involves
is reduced by 4TF A . Taylor algorithm, and our pure CORDIC.
2) Comparisons of Iterative Structure: In traditional Six testing sets (SG 1 , SG 2 , SG 3 , SG 4 , SG 5 , and SG 6 ) of
CORDIC [12], the area is least among the comparisons, but vast inputs are applied to evaluate the power consumption of
the delay is unacceptable. our designs by SpyGlass. All the inputs of the testing sets
In Hybrid algorithm [16], the area and delay of are large for SG 1 and SG 4 satisfy E o f f set E Z N, and Taylor is not
scale factor look-up table1 and the large number of redundant selected. All the inputs of the testing sets SG 2 and SG 5 meet
adders. E o f f set E Z > N, and Taylor is selected. Half of the inputs
In Radix-4 CORDIC [22], the area of is 152 A F A less of the testing sets SG 3 and SG 6 meet E o f f set E Z > N, and
than that of ours, but the delay is unacceptable. Despite the the possibility of selecting Taylor is 50%.
zero skipping technique, the average number of iterations is In iterative structure, the power consumption for TCORDIC
still 4n/5. The critical path is 6.7TF A in W path, including computation is 7.5% greater than that for pure CORDIC
1.3TF A for scale factor look-up table3 and 3.5TF A for the computation. When Taylor is not selected, the inputs
judgement of the skipping. of multiplexing multiplier remain unchanged during the
In our TCORDIC, a 4-stage compressive iteration module first 16 clock cycles.
is multiplexed. Only the delay increases for 8TMU X _8 in In pipelined structure, the power consumption for
critical path, but the area for compressive iterations reduces TCORDIC computation is 2.0% greater than that for pure
to one eighth of that in pipelined structure. A 53 53-bit CORDIC computation. When Taylor is not selected, the inputs
truncated multiplier is multiplexed, and parallel iterations can of Taylor path remain unchanged. When Taylor is selected,
be completed with an increase of only 3 clock cycles, as only half of the parallel iteration in CORDIC path is used to
compared to those required for our pipelined structure. calculate, and the other half remain unchanged.
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 903

  i j
VI. C ONCLUSION While m n/2 + 1, i j2 (2(n+2)
2 (3n/2) ) + (2 (n+4) 2 (3n/2+1) ) + +(2(2n4)
In this paper, TCORDIC algorithm, which combines low (2n3)
latency CORDIC and Taylor algorithm, is presented to 2 ).     i j kl
2) The computation of l2 :
improve the accuracy of floating-point sine/cosine computa-     i j kl i j k
4
tion, and guarantee the accuracy within one ulp even when the i j k l2 involves in Cnm terms to
input is close to 0 or /2. Several schemes, sign prediction, be accumulated.
parallel iteration, and truncated multiplier, are employed to With m n/2 + 1, the maximum value of Cnm 4 is
(n/21)(n/22)(n/23)(n/24)
reduce the delay for low latency CORDIC. Finally, taking 4! , and the maximum value of
IEEE-754 double precision floating-point sine and cosine 2i j kl is 2(n/2+1+n/2+2+n/2+3+n/2+4) .
functions as an example, iterative and pipelined structures are Considering the function f (n) = 2n/8+4 n, f  (n) =
implemented, and the advantages of low latency and low error 2n/8+4 ln 2 18 1 > 0 when n 0.
are achieved. With n = 0, f (0) = 20/8+4 0 > 0.
When n 0, 2n/8+4 n
< 1, and ( 2n/8+4
n
)4 < 1.
A PPENDIX A
P ROOF OF THE S IMPLIFICATION OF E QUATION (8) 2i j kl
i j k l
When m n/2 + 1, equation (8) can be simplified from (n/2 1)(n/2 2)(n/2 3)(n/2 4) 2n10
equation (7) with the error of X n or Yn less than 2n , and 2
4!
following is the proof of the simplification. The proof can be (n/2 1)(n/2 2)(n/2 3)(n/2 4) 3n/2
completed by three steps. = 2
4! 2n/2+10
Step one: the computation of the omitted value in Am,n .
Am,n = 1 + A , where A is the omitted value. n4 n
< 23n/2 = 23n/2 ( )4 < 23n/2 .
2n/2+16 2n/8+4
A = i j 2i j   i(1)..i(2t ) for
i(1) ..
i j
3) The computation of i(2t ) 2
i j k l 2i j kl t 3:
i j k l
Considering the function g(n) = 2n/6+2 n, g  (n) =
+.. (1)t .. i(1) ..i(2t ) 2i(1)..i(2t ) 2n/6+2 ln 2 16 1 > 0 when n 7.
i(1) i(2t )
With n = 7, g(7) = 27/6+2 7 > 0.
| i j 2i j |
i

j n
Therefore, 2n/6+2 < 1 when n 7.
+| i j k l 2i j kl | With (n/2 1)/2 (n m)/2 t 3,
i j k l

+.. + |(1)t .. i(1) ..i(2t ) 2i(1)..i(2t ) | .. 2i(1)..i(2t )
i(1) i(2t ) i(1) i(2t )

2i j + 2i j kl (n/2 1)(n/2 2) (n/2 2t) t n
i j i j k l 2 2t
(2t)!
+.. + .. 2i(1)..i(2t ) .
i(1) i(2t ) (n/2 1)(n/2 2) (n/2 2t)
= 22n 
The following
 sub-items 1), 2), and 3) aim to calculate (2t)! 2(t 2)n+ 2t
.. 2 i(1)..i(2t ) for t = 1, t = 2, and t 3,
i(1) i(2t ) n 2t n 2t
respectively. < 22n  < 22n
  (2t)! 2(t 2)n+ 2t +2t 2(t 2)n+4t
1) The computation of i j 2i j :
  i j n n
With m 1 < i < j < n, i j2 consists = 22n ( (1/21/t )n+2 )2t < 22n ( n/6+2 )2t < 22n
2 2 2
of Cnm terms to be accumulated when the iteration
i j
formulas between the m t h and the (n1)t h are expanded. A 2 + 2i j kl
i j i j k l
a) With i = m and j = m + 1, m + 2, , n 1,
 m j = 22m 2(m+n1) . + .. + .. 2i(1)..i(2t )
j 2 i(1) i(2t )
With i = m + 1 and j = m + 2, m + 3, , n 1,
b)  (n+2) (3n/2)
m1 j = 22m2 2(m+n) . < {(2 2 ) + (2(n+4) 2(3n/2+1) )
j2
c) With i = m, i = m + 1, , i = n 2, + +(2(2n4) 2(2n3) )} + (23n/2 )

2i j + ((n/2 1)/2 2)(22n )
i j
< (2(n+2) + 2(n+4) + +2(2n4) )
= 2m j + 2m1 j
j j
+ {(23n/2 ) + ((n/2 1)/2 2)(22n )
++ 2(n2) j
j
2(3n/2) 2(3n/2+1) 2(2n3) }
2m (m+n1) 2m2 (m+n)
= (2 2 ) + (2 2 )
< 2(n+2) + 2(n+4) + +2(2n4) < 2n1 .
+ +(2(2n4) 2(2n3) ).
904 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 64, NO. 4, APRIL 2017

Therefore, except the first item 1, the maximum sum of R EFERENCES


other items in Am,n is less than 2n1 .
 the computation of the omitted value in Bm,n .
Step two: [1] S. Aggarwal, P. K. Meher, and K. Khare, Scale-free hyperbolic
Bm,n = i i 2i + B , where B is the omitted value. CORDIC processor and its application to waveform generation, IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 60, no. 2, pp. 314326,
B = i j k 2i j k .. + (1)t Feb. 2013.
i j k [2] J. Chen, Y. Lei, Y. Peng, T. He, and Z. Deng, Configurable floating-
point FFT accelerator on FPGA based multiple-rotation CORDIC,
.. i(1) ..i(2t +1) 2i(1)..i(2t +1) Chin. J. Electron., vol. 25, no. 6, pp. 10631070, 2016.
i(1) i(2t +1)
[3] P. T. P. Tang, Table-lookup algorithms for elementary functions and
| i j k 2i j k | + .. + |(1)t their error analysis, in Proc. 10th Symp. Comput. Arithmetic, 1991,
i j k
pp. 232236.
.. i(1)..i(2t +1) 2i(1)..i(2t +1) | [4] V. Kantabutra, On hardware for computing exponential and trigono-
i(1) i(2t +1)
metric functions, IEEE Trans. Comput., vol. 45, no. 3, pp. 328339,
2i j k Mar. 1996.
i j k [5] D. Wang, J.-M. Mller, N. Brisebarre, and M. D. Ercegovac,

2i(1)..i(2t +1) .
(M, p, k)-friendly points: A table-based method to evaluate trigono-
+.. + .. metric function, IEEE Trans. Circuits Syst. II, Express Briefs, vol. 61,
i(1) i(2t +1)
no. 9, pp. 711715, Sep. 2014.
With (n/2 1)/2 (n m)/2 t 1, [6] J. Zhou, Y. Dou, Y. Lei, J. Xu, and Y. Dong, Double precision
hybrid-mode floating-point FPGA CORDIC co-processor, in Proc. 10th
.. 2i(1)..i(2t +1) IEEE Int. Conf. High Perform. Comput. Commun. (HPCC), Aug. 2008,
i(1) i(2t +1)
pp. 182189.
(n/2 1)(n/2 2) (n/2 2t 1) (2t +1)n/2
2 2t+1 [7] M. Chakraborty, A. S. Dhar, and M. H. Lee, A trigonometric formula-
(2t + 1)! tion of the LMS algorithm for realization on pipelined CORDIC, IEEE
Trans. Circuits Syst. II, Express Briefs, vol. 52, no. 9, pp. 530534,
(n/2 1)(n/2 2) (n/2 2t 1)
= 2n1  Sep. 2005.
(2t + 1)! 2(2t 1)n/2+ 2t+1 1 [8] T.-Y. Sung, Memory-efficient and high-speed split-radix FFT/IFFT
processor based on pipelined CORDIC rotations, Proc. IEE-Vis. Image
n 2t +1
< 2n1  Signal Process., vol. 153, no. 4, pp. 405410, Aug. 2006.
(2t + 1)! 2(2t 1)n/2+ 2t+1 +2t [9] L. Cordesses, Direct digital synthesis: A tool for periodic wave gener-
ation (part 1), IEEE Signal Process. Mag., vol. 21, no. 4, pp. 5054,
2n1 n 2t +1 Jul. 2004.
< (2t 1)n/2+4t +2
(2t + 1)! 2 [10] Y. Wang and S. Butner, A new architecture for robot control, in Proc.
IEEE Int. Conf. Robot. Autom., vol. 4. Mar. 1987, pp. 664670.
2n1 n
= ( (12/(2t +1))n/2+2 )2t +1 [11] T. Lang and E. Antelo, High-throughput CORDIC-based geometry
(2t + 1)! 2 operations for 3D computer graphics, IEEE Trans. Comput., vol. 54,
no. 3, pp. 347361, Mar. 2005.
2n1 n
< ( n/6+2 )2t +1. [12] J. E. Volder, The CORDIC trigonometric computing technique, IRE
(2t + 1)! 2 Trans. Electron. Comput., vol. EC-8, no. 3, pp. 330334, 1959.
[13] P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, and K. Maharatna,
2n1 2t +1
(2t +1)! ( 2n/6+2 ) <
n
According to the above function g(n), 50 years of CORDIC: Algorithms, architectures, and applications,
2n1 IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 56, no. 9, pp. 18931907,
(2t +1)! when n 7 Sep. 2009.
[14] J. E. Volder, The birth of CORDIC, J. VLSI Signal Process. Syst.
2n1 Signal, Image Video Technol., vol. 25, no. 2, pp. 101105, 2000.
B < t < 2n2 .
(2t + 1)! [15] M. Garrido, P. Kllstrm, M. Kumm, and O. Gustafsson, CORDIC II:
A new improved CORDIC algorithm, IEEE Trans. Circuits Syst. II,
Thus, except the first item i i 2(i) , the maximum sum of Express Briefs, vol. 63, no. 2, pp. 186190, Feb. 2016.
other items in Bm,n is less than 2n2 . [16] R. Shukla and K. C. Ray, Low latency hybrid CORDIC algorithm,
IEEE Trans. Comput., vol. 63, no. 12, pp. 30663078, Dec. 2014.
Step three: the computation of the error of X n or Yn . [17] D. S. Phatak, Double step branching CORDIC: A new algorithm for
With X m 1 and Ym 1, the error of X n is X . fast sine and cosine generation, IEEE Trans. Comput., vol. 47, no. 5,
pp. 587602, May 1998.
X = {X m Am,n Ym Bm,n } {X m 1 Ym i 2i } [18] T.-B. Juang, S.-F. Hsiao, and M.-Y. Tsai, Para-CORDIC: Parallel
i CORDIC rotation algorithm, IEEE Trans. Circuits Syst. I, Reg. Papers,
= {X m (1 + A ) Ym ( i 2i + B )} vol. 51, no. 8, pp. 15151524, Aug. 2004.
i [19] M. Kuhlmann and K. K. Parhi, P-CORDIC: A precomputation based
{X m Ym i 2i } rotation CORDIC algorithm, J. Appl. Signal Process., vol. 2002, no. 9,
i pp. 18, 2002.
= X m A Ym B |X m A | + |Ym B | [20] M. D. Ercegovac and T. Lang, Redundant and on-line CORDIC:
< 2n1 + 2n2 < 2n . Application to matrix triangularization and SVD, IEEE Trans. Comput.,
vol. 39, no. 6, pp. 725740, Jun. 1990.
Accordingly, the error of X n is less than 2n . [21] B. Lakshmi and A. S. Dhar, VLSI architecture for low latency radix-4
CORDIC, Comput. Electr. Eng., vol. 37, no. 6, pp. 10321042, 2011.
Similarly, the error of Yn is less than 2n . [22] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapata, High perfor-
mance rotation architectures based on the radix-4 CORDIC algorithm,
ACKNOWLEDGMENT IEEE Trans. Comput., vol. 46, no. 8, pp. 855870, Aug. 1997.
[23] P. R. Rao and I. Chakrabarti, High-performance compensation tech-
The authors would like to thank the suggestions of the nique for the radix-4 CORDIC algorithm, IEE Proc.-Comput. Digit.
associate editor and the anonymous reviewers. This work is Techn., vol. 149, no. 5, pp. 219228, Sep. 2002.
[24] F. J. Jaime, M. A. Snchez, J. Hormigo, J. Villalba, and E. L. Zapata,
supported by National Natural Science Foundation of China, Enhanced scaling-free CORDIC, IEEE Trans. Circuits Syst. I, Reg.
NO: 61402499. Papers, vol. 57, no. 7, pp. 16541662, Jul. 2010.
ZHU et al.: LOW LATENCY AND LOW ERROR FLOATING-POINT SINE/COSINE FUNCTION BASED TCORDIC ALGORITHM 905

[25] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, Modified Yuanwu Lei was born in 1982. He received the
virtually scaling-free adaptive CORDIC rotator algorithm and archi- M.S. degree and the Ph.D. degree in computer
tecture, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 11, science from National University of Defense and
pp. 14631474, Nov. 2005. Technology (NUDT), China, in 2007 and 2012,
[26] S. Y. Park and N. I. Cho, Fixed-point error analysis of CORDIC respectively. His research interests include high
processor based on the variance propagation formula, IEEE Trans. performance computer architecture and computing
Circuits Syst. I, Reg. Papers, vol. 51, no. 3, pp. 573584, Mar. 2004. engineering.
[27] X. Hu and S. C. Bass, A neglected error source in the CORDIC
algorithm, in Proc. IEEE ISCAS, May 1993, pp. 766769.
[28] K. Kota and J. R. Cavallaro, Numerical accuracy and hardware tradeoffs
for CORDIC arithmetic for special-purpose processors, IEEE Trans.
Comput., vol. 42, no. 7, pp. 769779, Jul. 1993.
[29] E. Antelo, J. D. Bruguera, T. Lang, and E. L. Zapata, Error analysis
and reduction for angle calculation using the CORDIC algorithm, IEEE
Trans. Comput., vol. 46, no. 11, pp. 12641271, Nov. 1997.
[30] C.-H. Lin and A.-Y. Wu, Mixed-scaling-rotation CORDIC Yuanxi Peng was born in 1966. He received the
(MSR-CORDIC) algorithm and architecture for high-performance B.S. degree in computer science from Sichuan
vector rotational DSP applications, IEEE Trans. Circuits Syst. I, Reg. University, China, in 1988 and the M.S. and
Papers, vol. 52, no. 11, pp. 23852396, Nov. 2005. Ph.D. degrees in computer science from National
[31] A. Boudabous, F. Ghozzi, M. Kharrat, and N. Masmoudi, Implementa- University of Defense and Technology (NUDT),
tion of hyperbolic functions using CORDIC algorithm, in Proc. IEEE China, in 1998 and 2001, respectively. He was
16th Int. Conf. Microelectron. (ICM), Dec. 2004, pp. 738741. a Visiting Professor in Department of Electronic
[32] D. M. Muoz, D. F. Sanchez, C. H. Llanos, and M. Ayala-Rincn, and Computer Engineering, University of Toronto,
FPGA based floating-point library for CORDIC algorithms, in Proc. Canada, during 20102011. He has been a Professor
6th Southern Program. Logic Conf., 2010, pp. 5560. of Computer School in NUDT since 2011. His
[33] M. Jridi and A. Alfalou, Direct digital frequency synthesizer with research interests are in the areas of high perfor-
CORDIC algorithm and Taylor series approximation for digital mance computing, multi- and many-core architectures, on-chip networks,
receivers, Eur. J. Sci. Res., vol. 30, no. 4, pp. 542553, 2009. cache coherence protocols, and architectural support for parallel programming.
[34] F. De Dinechin, M. Joldes, and B. Pasca, Automatic generation of
polynomial-based hardware architectures for function evaluation, in
Proc. 21st IEEE Int. Conf. Appl.-Specific Syst. Archit. Process. (ASAP),
Jul. 2010, pp. 216222.

Baozhou Zhu was born in 1992. He received Tingting He was born in 1991. She received the
the B.S. degree in mechanical engineering and B.S. degree in information systems and informa-
automation from South China University of Tech- tion management in Jiangsu University of Science
nology in 2015. He is currently working toward and Technology, China, in 2013. She is currently
the M.S. degree in the field of microelectronics working toward the M.S. degree in microelectronics
and solid-state electronics at the National University and solid-state electronics at the National University
of Defense and Technology (NUDT), China. His of Defense and Technology (NUDT), China. Her
research interests include high performance com- research interests include high performance com-
puter architecture and computing engineering. puter architecture and computing engineering.

You might also like