CORDIC Based Fast Radix-2 DCT Algorithm: Hai Huang and Liyi Xiao, Member, IEEE

IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO.
5, MAY 2013 483
CORDIC Based Fast Radix-2 DCT Algorithm

Hai Huang and Liyi Xiao, Member, IEEE
Abstract—This letter proposes a novel coordinate rotation duced using their orthogonal properties, respectively. Similar to
digital computer (CORDIC)-based fast radix-2 algorithm for the Cooley-Tukey fast Fourier transformation (FFT) algorithm,
computation of discrete cosine transformation (DCT). The the proposed algorithm can generate the next higher-order
proposed algorithm has some distinguish advantages, such as
Cooley-Tukey fast Fourier transformation (FFT)-like regular DCT from two identical lower-order DCTs. Furthermore, it
data flow, uniform post-scaling factor, in-place computation and has some distinguish advantages, such as FFT-like regular data
arithmetic-sequence rotation angles. Compared to existing DCT flow, uniform post-scaling factor, in-place computation and
algorithms, this proposed algorithm has lower computational com- arithmetic-sequence rotation angles. By using the unfolding
plexity. Furthermore, the proposed algorithm is highly scalable, CORDIC technique, this algorithm can overcome the problem
modular, regular, and suitable for pipelined VLSI implementation.
In addition, this letter also provides an easy way to implement the of difficult to realize pipeline that in conventional CORDIC
reconfigurable or unified architecture for DCTs and inverse DCTs. algorithms. This results in a pipeline and high-speed VLSI
implementation. Compared to existing DCTs, the proposed
Index Terms—Coordinate rotation digital computer (CORDIC),
discrete cosine transformation (DCT), fast radix-2 algorithm. algorithm has low computational complexity, and is highly
scalable, modular, regular, and able to admit efficient pipelined
implementation. In addition, this letter also provides an easy
I. INTRODUCTION way to implement the reconfigurable or unified architecture for
DCTs and IDCTs using the orthogonal property.
S INCE the discrete cosine transformation (DCT) was pro-

posed by Ahmed [1], various fast algorithms have been re-
ported in the literature. Existing fast algorithms can be classi-
II. PROPOSED CORDIC BASED FAST DCT ALGORITHM
For an -point signal, , the DCT is defined as:
fied into nonradix and radix categories. Algorithms in the non-
radix category attempt to reduce computational complexity and
make computations more efficient, such as matrix factorization
[2]–[5], directly deduced from signal flow graphs [6]–[9], and (1)
coordinate rotation digital computer (CORDIC) -based fast al- where if , and otherwise.
gorithms [10]–[18]. Due to extensive design optimization for According to (1), neglecting the post-scaling factor without
cost reduction and performance enhancement, these DCT algo- loss of generality, the main operation of an -point DCT de-
rithms are often complicated and hardly scalable to more than noted as can be written as:
8-point DCTs. Compared to nonradix algorithms, radix algo-
rithms allow us to generate higher-order DCTs from lower-order (2)
DCTs [19]–[26]. Furthermore, radix algorithms generally have
a regular computational structure, which reduces implementa-
tion complexity. However, due to their recursive nature, radix A length- input sequence , with is power-of-two, can
algorithms are difficult to realize pipeline and are not suitable be decomposed into and , which are defined as:
for high-speed applications.
(3)
Among radix algorithms, the radix-2 algorithm is the most
popular because of its computational efficiency and structural (4)
simplicity. In this letter, we propose a CORDIC-based radix-2
fast DCT algorithm. Based on the proposed algorithm, signal where .
flows of DCTs and inverse DCTs (IDCTs) are developed and de- So the original signal, , can be obtained from and
as follows:
Manuscript received January 22, 2013; revised March 07, 2013; accepted (5)
March 10, 2013. Date of publication March 14, 2013; date of current version
March 22, 2013. The associate editor coordinating the review of this manuscript (6)
and approving it for publication was Prof. Lei Wang.
H. Huang is with the Microelectronics Center, Harbin Institute of Tech- Substituting (5) and (6) into (2), (2) can be rewritten as:
nology, Harbin, China and also with the School of Software, Harbin University
of Science and Technology, Harbin, China (e-mail: ic@hrbust.edu.cn;
husthh@yahoo.com.cn).
L. Xiao is with the Microelectronics Center, Harbin Institute of Technology,
Harbin, China (e-mail: xiaoly@hit.edu.cn).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. (7)
Digital Object Identifier 10.1109/LSP.2013.2252616
1070-9908/$31.00 © 2013 IEEE

484 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 5, MAY 2013
Fig. 1. Signal flow of an -point fast discrete cosine transformation (DCT).
Fig. 2. Signal flow of a 2-point fast discrete cosine transformation (DCT).

Fig. 4. Signal flow of an 8-point fast discrete cosine transformation (DCT).
Let
(11)
Combing the constant values 2 and in recursively de-

composing stages with the post-scaling factor, the DCT can be
written as:
Fig. 3. Signal flow of a 4-point fast discrete cosine transformation (DCT).
where .
Since
(12)
According to (12), we can decompose the -point DCT into
two -point DCTs based on the CORDIC algorithm. For
power-of-two point DCT, the proposed algorithm computes the
(8) DCT by recursively decomposing it into 2-point DCT. Since
the basic operation of the algorithm is a 2-point DCT, similar
we get (9) and (10) to the radix-2 FFT, this algorithm is called fast radix-2 DCT. In
addition, the rotation angles of the CORDICs are arithmetic se-
quences with a common difference of . Another impor-
tant aspect is that all outputs, , , have
the uniform post-scaling factor. Furthermore, the post-scaling
factor can be merged into negative powers of two in the 2-D
DCT, which can be implemented with shifting operations.
(9)
III. DCT AND IDCT SIGNAL FLOW BASED ON
PROPOSED ALGORITHM
The general signal-flow graph for the proposed fast DCT al-
gorithm given in (12) is shown in Fig. 1, while the signal-flow
graphs of 2-point DCT, 4-point DCT, and 8-point DCT are re-
spectively represented in Figs. 2–4, where the angles in the cir-
(10) cles are used to represent CORDICs with this rotation angles.
In Fig. 1, there are two separate -point DCTs and one
where . CORDIC array. As mentioned above, the CORDIC array has
From (9) and (10), we find that each equation has two CORDICs with arithmetic-sequence rotation angles. The
-point with two different coefficients, and the four inputs are addressed in bit-reverse order and the outputs are ad-
coefficients just make one CORDIC. Hence, we combine the dressed in natural order. It also supports in-place computation
two equations to realize a CORDIC based fast DCT algorithm. like the FFT. Regular and pure feed-forward data paths of the
HUANG AND XIAO: CORDIC BASED FAST RADIX-2 DCT ALGORITHM 485
TABLE I
TRANSFER FUNCTIONS OF THE DCT AND IDCT
signal flow make them suitable for pipelined VLSI implemen-

tation. For special applications, a double-angle formula can be
used to reduce CORDIC types. Hence, the architecture based
on the signal flow is highly modular. Furthermore, the modified Fig. 5. Comparison of the computational complexity among the proposed and
unfolded CORDIC, which presented in our previous work [10], other algorithms: (a) numbers of additions and (b) numbers of multiplications/
CORDICS.
can be used to speed up computations and overcome recursive
problems in conventional CORDICs.
Similarly, the fast algorithm for the -point IDCT can be additions and CORDICs. In [20], Hou propose
deduced like the fast DCT algorithm. Alternatively, it can be a fast recursive DCT algorithm by decomposing the transforms
obtained more easily using their orthogonal property. As is into even-and odd-numbered frequency samples and implement
known, the DCT and IDCT are orthogonal transformations, them with low hardware complexity by reusing the lower-order
and the signal flow of the -point IDCT can be easily obtained DCT architecture. In [22], a fast parallel recursive radix-2 DCT
by inverting the transfer function of each building block shown algorithm is proposed by decomposing the even-length DCT
in Table I and reversing the signal flow direction. into two balanced lower order DCTs. In [23], a fast algorithm
In Table I, the CORDICs in the DCT and IDCT have the same for composite-sequence-length DCT is proposed. For radix-2
rotation angle but opposite rotation directions. When changing DCT, the algorithms in [20], [22] and [23] have the same
the CORDIC from a clockwise to an anticlockwise rotation with computational complexity with additions,
the same angle, the only thing that is required is to change all and multipliers. The fast DCT algorithm based on
adders to subtractors, and subtractors to adders in the rotation a shifted discrete Fourier transformation (SDFT) proposed by
iteration stage. This results in an easy way to implement a re- Hsiao [16] only needs to compute partial results of the SDFT
configurable or unified architecture for DCTs and IDCTs. Notice to generate -point DCT, and is implemented with a linear
that the proposed fast IDCT algorithm has the same arithmetic array with requires additions and CORDICs. It
complexity as does the DCT one. reduces the computational complexity by reusing the two basic
processors. We propose a CORDIC-based fast radix-2 DCT al-
IV. IMPLEMENTATION CONSIDERATION AND COMPARISONS gorithm that requires additions and
CORDICs.
The -point proposed DCT algorithm needs two -point
The following compares some of the major features of our
DCTs, CORDICs, and additions. Therefore, the
proposed CORDIC-based fast DCT algorithm.
number of CORDICs required by the proposed algorithm is
Hardware complexity: The CORDIC-based algorithm is
highly suitable for VLSI implementation, since it is built
(13) using shifters and adders only. Liang et al. [7] propose lifting
scheme-based fast multiplierless approximation of the DCT
using only binary shift and addition operations. This results
and the number of additions is
in a very low hardware complexity VLSI implementation. We
(14) use the modified unfolded CORDIC to realize a low hardware
complexity VLSI implementation also using only binary shift
Fig. 5 compares the computational complexity of the pro- and addition operations. Moreover, the computational accuracy
posed algorithm and other algorithms [3], [7], [16], [20], can be selected based on the trade-off between the hardware
[22] and [23] in terms of the numbers of additions and the complexity and approximation error. In addition, since our
numbers of the multipliers or CORDICs. In [3], Chen et al. proposed algorithm has uniform post-scaling factor, it is also
propose a fast recursive algorithm to factor any power-of-two suitable for scaled DCT implementation.
point DCT with requires additions and Fig. 6 shows the block diagram and the corresponding layout
multipliers. In [7], a fast multiplierless view of the proposed 8-point DCT after placement and routing
approximation of the DCT with the lifting scheme is derived based on SMIC 0.18 standard cell library. This design (ex-
from Chen’s algorithm. Hence, it has the same computational cluding the I/O pads) has area of 627738 with power con-
complexity as the Chen’s algorithm. Furthermore, it presents sumption of 12.5 mW under operating frequency of 66.7 MHz.
an efficient low-complexity fast VLSI implementation for the Scalability: Many CORDIC-based algorithms are limit to
DCT using only binary shift and addition operation. In [11], short-length DCT, such as the algorithms based on Loeffler’s
the DCT is obtained through computing the DHT with requires DCT [6], [17]. Our proposed algorithm provides much easier
486 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 5, MAY 2013
[3] W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational

algorithm for the discrete cosine transform,” IEEE Trans. Commun.,
vol. 25, no. COM-9, pp. 1004–1009, Sep. 1977.
[4] E. Feig and S. Winograd, “Fast algorithms for the discrete cosine trans-
form,” IEEE Trans. Signal Process., vol. 40, no. 9, pp. 2174–2193, Sep.
1992.
[5] V. Britanak and K. R. Rao, “Two-dimensional DCT/DST universal
computational structure for block sizes ,” IEEE Trans. Signal
Process., vol. 45, no. 11, pp. 3250–3255, Nov. 2000.
[6] C. Loeffler, A. Ligtenberg, and G. Moschytz, “Practical fast 1-D
DCT algorithms with 11 multiplications,” in Proc. Int. Conf. Acoust.,
Speech, Signal Process., 1989, pp. 988–991.
[7] J. Liang and T. D. Tran, “Fast multiplierless approximations of the
DCT with the lifting scheme,” IEEE Trans. Signal Process., vol. 49,
Fig. 6. Block diagram and the corresponding layout view of the proposed no. 12, pp. 3032–3044, 2001.
8-point DCT after P&R based on SMIC. 0 18 standard cell library. [8] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, “A
cost effective architecture for 8 8 two-dimensional DCT/IDCT using
and more regular way to realize the scalability, which can be direct method,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp.
easily extended to compute long-length DCTs as long as the 459–467, 1997.
[9] S. Yu and E. E. Swartzlander, Jr, “DCT implementation with distributed
transform length is power-of-two. arithmetic,” IEEE Trans. Comput., vol. 50, no. 9, pp. 985–991, Sep.
Modularity: For -point DCT, our proposed algorithm re- 2001.
quires CORDICs with only dif- [10] L. Xiao and H. Huang, “A novel CORDIC based unified architecture
for DCT and IDCT,” in 2012 Int. Conf. Optoelectronics and Microelec-
ferent CORDIC types. Furthermore, the rotation angles of the tronics (ICOM), 2012, Aug. 2012, pp. 496–500.
CORDICs in our proposed algorithm are arithmetic-sequence, [11] J. H. Hsiao, L. G. Chen, T. D. Chiueh, and C. T. Chen, “High
the double-angle formula can be used to reduce the number of throughput CORDIC-based systolic array design for the discrete
cosine transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 5,
the CORDIC types and make this algorithm highly modular. no. 3, pp. 218–225, Jun. 1995.
Compared to the CORDIC-based architectures in [3], [16], our [12] J. E. Volder, “The CORDIC trigonometric computing technique,” IRE
architecture has better modularity. Trans. Electron. Comput., vol. 8, pp. 330–334, 1959.
[13] Y. H. Hu and Z. Wu, “An efficient CORDIC array structure for the
Pipelinability: The architectures in [20], [22], [23] is recur- implementation of discrete cosine transform,” IEEE Trans. Signal
sive in nature, thus making them difficult to realize pipeline. We Process., vol. 43, pp. 331–336, Jan. 1995.
use the modified unfolding CORDIC to overcome this problem. [14] J. Chen and K. J. R. Liu, “A complete pipelined parallel CORDIC ar-
chitecture for motion estimation,” IEEE Trans. Circuits Syst. II, Analog
Moreover, the regular and purely feed forward data path makes Digit. Signal Process., vol. 45, no. 6, pp. 653–660, Jun. 1998.
our proposed algorithm based architecture suitable for pipelined [15] S. Yu and E. E. Swartzlander, Jr, “A scaled DCT architecture with the
VLSI implementation. CORDIC algorithm,” IEEE Trans. Signal Process., vol. 50, no. 1, pp.
160–167, Jan. 2002.
Reconfigurability: In [22], a method for mapping the type-IV [16] S. F. Hsiao, Y. H. Hu, T. B. Juang, and C. H. Lee, “Efficient VLSI
DCT to the type II-DCT is presented. However the method re- implementations of fast multiplierless approximated DCT using pa-
quired additional arithmetic computation. We present an easy rameterized hardware modules for silicon intellectual property design,”
IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 52, no. 8, Aug. 2005.
way to implement a reconfigurable architecture for DCT and [17] C.-C. Sun, S.-J. Ruan, B. Heyne, and J. Goetze, “Low-power and high
IDCT with the same arithmetic complexity by taking the orthog- quality CORDIC-based loeffler DCT for signal processing,” IET Proc.-
onal property of them. Circuits, Devices Syst., vol. 1, no. 6, pp. 453–461, 2007.
[18] Z. Wu, J. Sha, Z. Wang, L. Li, and M. Gao, “An improved scaled
DCT architecture,” IEEE Trans. Consumer Electron., vol. 55–2, pp.
V. CONCLUSIONS 685–689, May 2009.
In this letter, we propose a novel CORDIC-based radix-2 fast [19] B. G. Lee, “A new algorithm to compute the discrete cosine transform,”
IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp.
DCT algorithm. This algorithm can generate the next higher- 1243–1245, Dec. 1984.
order DCT from two identical lower-order DCTs. Compared [20] H. S. Hou, “A fast recursive algorithm for computing the discrete co-
to existing DCT algorithms, our proposed algorithm has sev- sine transform,” IEEE Trans. Acoust., Speech, Signal Process., vol.
ASSP-35, pp. 1445–1461, Oct. 1987.
eral distinct advantages, such as low computational complexity, [21] G. Bi, “Fast algorithms for type-III DCT of composite sequence
and being highly scalable, modular, regular, and able to admit lengths,” IEEE Trans. Signal Process., vol. 47, no. 7, pp. 2053–2059,
efficient pipelined implementation. Furthermore, the proposed Jul. 1990.
[22] C. W. Kok, “Fast algorithm for computing discrete cosine transform,”
algorithm also provides an easy way to implement a reconfig- IEEE Trans. Signal Process., vol. 45, no. 3, pp. 757–760, Mar. 1997.
urable or unified architecture for DCTs and IDCTs, which will [23] G. Bi and L. W. Yu, “DCT algorithms for composite sequence lengths,”
be researched in our future work. In addition, other types (types IEEE Trans. Signal Process., vol. 46, pp. 554–562, Mar. 1998.
[24] C. H. Chen, B. D. Liu, and J. F. Yang, “Direct recursive structures
I and IV) of CORDIC-based fast DCT algorithms will also be for computing radix-r two-dimensional DCT/IDCT/DST/IDST,” IEEE
examined in our future work. Trans. Circuits Syst. I: Reg. Papers, vol. 51, no. 10, Oct. 2004.
[25] H. Hsu and C. Liu, “Fast radix-q and mixed-radix algorithms for
REFERENCES type-IV DCT,” IEEE Signal Process. Lett., vol. 15, pp. 910–913, Dec.
2008.
[1] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”
[26] J. Wu, H. Shu, L. Senhadji, and L. Luo, “Mixed-radix algorithm for
IEEE Trans. Comput., vol. C-23, pp. 90–94, 1974.
the computation of forward and inverse MDCTs,” IEEE Trans. Circuits
[2] T. D. Tran, “The binDCT: Fast multiplierless approximation of the
Syst. I: Reg. Papers, vol. 56, no. 4, pp. 784–794, 2009.
DCT,” IEEE Signal Process. Lett., vol. 7, no. 6, pp. 141–144, 2000.

CORDIC Based Fast Radix-2 DCT Algorithm: Hai Huang and Liyi Xiao, Member, IEEE

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CORDIC Based Fast Radix-2 DCT Algorithm: Hai Huang and Liyi Xiao, Member, IEEE

Uploaded by

Copyright:

Available Formats

IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO.

5, MAY 2013 483

CORDIC Based Fast Radix-2 DCT Algorithm

S INCE the discrete cosine transformation (DCT) was pro-

1070-9908/$31.00 © 2013 IEEE

Fig. 1. Signal flow of an -point fast discrete cosine transformation (DCT).

Fig. 2. Signal flow of a 2-point fast discrete cosine transformation (DCT).

Combing the constant values 2 and in recursively de-

Fig. 3. Signal flow of a 4-point fast discrete cosine transformation (DCT).

signal flow make them suitable for pipelined VLSI implemen-

[3] W. H. Chen, C. H. Smith, and S. C. Fralick, “A fast computational

You might also like