Professional Documents
Culture Documents
1987 New 2n DCT Algorithms Suitable For Vlsi Implementation
1987 New 2n DCT Algorithms Suitable For Vlsi Implementation
ABSTRACT
Small
length
Discrete
Cosine
Transforms
(DCT'S)
areusedforimagedatacompression.
In t h a t case,
length 8 or 16 DCT's
are
needed
to be
performed
a t video rate.
We proposetwonewimplementation
of DCT'swhich
have
several
interesting
features,
as f a r as VLSI
implementation is concerned.
A first
one,
using
modulo-arithmetic,
needs
only
so t h a t a single
one
multiplication
per
input
point,
multiplier is needed on-chip.
Asecondone,basedon
a decomposition of t h eD C T
of t h e s e
into
polynomial
products,
and
evaluation
polynomialproducts
by distributedarithmetic,results
small
chip,
with
a great
regularity
and
in a very
t e s t a b i l i Ft yu.r t h e r m o trsheae,m
s t er u c t ucraen
be
used
for
FFT computationbychangingonlythe
ROM-part of t h e chip.
a new
Bothnewarchitectures
,are mainlybasedon
as a cyclicconvolution,
formulation of a length-2DCT
which is explainedinthefirstsection
of thepaper.
While i t is possible
to
obtain
"classical"
algorithms
meeting
these
three
points
(the
paper
describing
t h e m is under
the
process
of being
written),
we
proposeinthispapertwocompletelynewapproaches
t hhaat vs e v e rianlt e r e s t i nf ega t u r e s ,
as f a r
VLSI implementation is concerned.
We will
give
in this
paper
only
sketches
of proofs
f o rt h ed e r i v a t i o n s
of thealgorithms,sinceouraim
is to showthattheunderstanding
of t h e m a t h e m a t i c a l
underlying
structure
of t hDe C cTalne a d
to new
efficient algorithms.
11. THE LENGTH 2" DCT AS POLYNOMIAL PRODUCTS
The DCT is defined as follows :
I. INTRODUCTION
In therecentyears,many
fast DCTalgorithmswere
of majorinterest
:
proposed,amongwhichthreeare
the
CHEN-FRALICK
[I] algorithm, B.G. LEE [31,
and VETTERLI-NUSSBAUMER [41 algorithm.
a long
time,
has
The
first
one,
being
proposed
for
been
considered
for
VLSI implementation
several
times,althoughitdoesnotmeettheminimumarithmetic complexity Dl.
Theotheronesmeettheminimumknownnumber
of
bgth multiplications and additions
to implement a length
DCT
2 algorithm.
Furthermore,
has
itbeen
shown
t h a t , if thesealgorithmscouldbeimproved,thesame
approach would also improve a whole class of algorithm
(Le.IDand
2-D FFT's, DST ---) [ 6 ] . F r o m a p r a c t i c a l
point of view,thealgorithm
by LEEhasgreater
a
regularity than the VETTERLI-NUSSBAUMER algorithm,
but
has
poor
roundoff
noise
performances,
due
to
t h e l/cos coefficients.Both of themhavebeenimplemented in hardware (or silicon)
[51.
Withthoseconsiderations
in mind,onecan
t h e r e is stillsomeneedforDCTalgorithmsmeeting
the following three characteristics altogether
see t h a t
:
of
the
graph
(the
availability
of
of a length N DCT is o f t e n
required),
- good noise performances.
as
(1)
xk
N-1
=
2n
xi cos
i=O
4N
(2i+l) k
The
equivalence
between
the
above
DCT
for
and a cyclic
convolution
is obtained
through
two
:
permutations of t hien p uvt a r i a b l exs .
Thefirstone,alreadygiven
in[4]
t h tee r m (s2 i + l )
in ( I ) i n t o( 4 i t l )
'k
=x
is used to c h a n g e
: L e t us define
as :
N-1
(*)
N=2"
2.rr
X I
4N ( 4 i + l )
cos
i=O
The
second
one
will allow t o change a product of
indices ( 4 i + l ) ( 4 k + l )i n t o a sum of indices : u-1 + vk'
This
result
is obtained
through
the
use
of a o n e
t o one
correspondance
between
the
set of i n t e g e r s
of t hfeo r m
(4i+l),
i-0, -- 2"-1 a n tdhseu c c e s s i v e
n+2
powers of 5 modulo 2
i t is always
possible
write :
..
to
U.
(3)
' >2n+2 ,
4i+l = < 5
This
can
be
applied
recursively
to t hlee n g t h
N/2
DCTarisingfromthecomputation
of t h ee v e nt e r m s ,
a completeformulation
a n d so on,thusresultingin
of the DCT as polynomial products.
x'
< 5i >
(4)
X'li
4N -1
1 XZk 1
L e t us now
consider
separately
the
even
a n do d dt e r m si X 2 k + l
f of t h e DCT.
I t is well
known,
and
fairly
obvious
from
eq.
(1)
that
X2k is t h eo u t p u t
of a DCT of lengthN/2.
When
considering
these
polynomial
products,
it
is
easily
recognized
that
polynomials
the
involving
t hi en p u t
of t hDe CaTraerl el d u c t i o n s
of X(z)
modulo the cyclotomic factors of xN-1 (N=Zn).Knowing
see t h a t h ew h o l e
set of
polynomial
t h i so, n ec a n
products is equivalent t o a cyclic
convolution
(Le.
N
a polynomial
product
modulo
x -1) followed
b
f i a
reduction
modulo
the
cyclotomic
factors
of x -1.
T hsee q u e n c e
t o be
cyclically
convolved
with
the
to befound.But,sinceweknow,
i n p u td a t ar e m a i n s
by
successive
applications
of eq. (IO) t o t h D
e CT's
of decreasinglength
N, N/2, N / 4 ---- t h ee x p r e s s i o n
of t h e unknown
polynomial
modulo
the
cyclotomic
f a c t o r ist,
is easy to reconstruct
the
initial
one,
given in eq. (12) :
H e n c e ,t h ef o l l o w i n gd e c o m p o s i t i o n ,o nt h eo d dt e r m s
will apply recursively on the DCT's
of reduced lengths.
,
When
considering
only
the
odd
terms
)X2k+l
eq.(I)
is nowsymmetrical
in i a n d k, a n dt h et w o
permutationsdescribedabovearenowfeasiblein
k.
( W i t ht h eo n l yd i f f e r e n c et h atth e r ea r eN / Z + t e r m s
XZk+l, and N t e r m s xZicl, thus resulting in the
- term
of eq. 7. ( s e e [XI formoredetails).Hence,wehave
as a result :
We have
now
established
that
the
DCT
N=Zn can be obtained as shown in fig.
of
length
(I).
where :
I t hasbeenshownbyWINOGRAD
[91, t h a tt h em u l t i plicativecomplexity of a cyclicconvolution
of length
2" is given by :
(13)
N/2-1
'k
(14)
N - 1
V(z)
LA
i = O
(11)
Y(z) = X(z)
. V(z) mod
<
si >4N
-n -2
Consequences of p r a c t i c a li m p o r t a n c ec a nb eo b t a i n e d
byobservation
of t a b l e 1, containingthecomparison
betweenthislowerboundandthepracticalalgorithms
for short-lengths :
as :
zN" + I
Le. : t h eo d dt e r m s
of t h eD C Tc a nb es t a t e d
a polynomial product of length N/2.
as
42.2.2
1806
I t is possible(butmoreintricate)
t o show,byusmg
of WINOGRAD t h at th iusp p e r s o moet h erre s u l t s
boundisalsothelowerbound.Thisresultwasalready
obtained by M.T. HEIDEMAN [ I 11.
211
COS
-n -1
(10)
Furthermore,
one
of the
multiplications
involved
as a convolution, as shown
itnhDe C cTo m p u t e d
( I ) is trivial ( V(z) mod. x-1 = 1). We t h e n
in
fig.
obtain, as an upper bound :
x". z
t o g e ta nu p p e r
of t h el e n g t h
4
-
VETTCRLI
LEE
26
32
32
lower bound
CHEN
16
6
16
44
L-1
of
F u r t h e r m o rsei,n ct heceo m p u t a t i o n
of t hr e s u l t
modulo
the
cyclotomic
factors
of xN-I is obtained
as intermediate
variables
inside
the
inverse
NTT,
be
simplified
t hbeu t t e r f l i essh o w n
in fig. (2) can
withthelastoperationsinvolvedinthecomputation
Thisresultsinthediagramshown
I t shouldbenotedthatthiscorresponds
case for NTT's to be used :
to a favorable
SinceNTT'saregenerally
performedonshort-length
s e q u e n c e s( N = 1 6s e e m s
to be a maximum),weavoid
to t h e
in
NTT
that,
due
the
usual
problem
arising
th
relationship
betweena
, t h e Nroot of unity, N,
the
length
of t ht er a n s f o r ma n, d
M the
modulus
( a N 2 1 mod M), it is often
impossible
to use 2
as a root of unity
(thus
avoiding
multiplications
in the NTT) for even moderate lengths.
- What is needed to c o m p u t teh D
e CT
is really a
cyclic convolution, and there
is no need of the overlapadd
or
overlap-save
algorithms
to obtain a linear
convolution, as is needed in FIR filtering.
- Themoduloarithmetic
is notsuch a problem,since,
with the given constraints, we can work modulo
a Fermatnumber,or
a pseudoFermatnumber
[131, which
gives one of the Cimplest known modulo-arithmetic
[141.
In this case, a In flg. (3) represents
only
a shift,
andcanbeimplemented
by a rotation of theinput
word at a bit level.
- F u r t h e r m o r e , s i n c e a great precision on the
Xk is oft e nn e e d e d( u s e
of DCTinadaptativefeedbackloops),
t h en e e d
of greater
wordlengths
when
using
NTT's,
usual
case, is not
such
a waste.
c o m p a r e d to the
Y =
of innerproducts
6-1
a 1. x i o
.t
j=1
obtaining a
SincewehavenowestablishedtheDCT
as a cyclic
convolution,wecanuseNumberTheoreticTransforms
(NTT) 1121 t o c o m p u t e t h e c o n v o l u t i o n , a n d t h e s c h e m e
of fig. ( I ) now becomes as showninfig.
(2).
of t h e NTT-'box.
in fig. 3 f o r N=8.
(17)
t o be computed, and
,
(E
L-1
ai xij)
2-j
is0
In thisequation,thedoublesum
is a successiveshift
and
add
of elementary
terms
(between
brackets),
each
term
being
an
inner
product
between
ai
a n d a v e c t o r of bits (x.., i = O , ---N-I).
'I
f dependson
N binaryvariaL e t f bethisfunction.
ZN different
values.
If t h e s e
bles,
hence
can
take
a ROM at t h ea d d r e s s
corresvaluesarestoredin
ponding to the
binary
configuration
of the
input
bits,
an
implementation
of the
inner
product
by
distributed arithmetic is as shown in fig. 4.
I 1
When usedin
a DCTalgorithm,thedistributedarithof polynomial
the
product
m eitm
i cp l e m e n t a t i o n
willrequireoneinnerproductcomputationpercoeffic i e n t of theresultingpolynomial,andsomebutterflies
todecomposetheinitialDCTintopolynomialproducts
(see fig. 5).
A number of r e m a r k s a r e of i n t e r e s t :
- Sincethe
ROM is addressed by t h eb i t s
of s a m e
weight of t h e o u t p u t s of t h eb u t t e r f l i e s ,t h e s eb u t t e r flies can be implemented in serial arithmetic.
- Thespeed
of a circuitimplementingthisarchitect u r e will belimitedonly
by theoutputaccumulator.
If therequiredspeed
is lower,it is possible to r e d u c e
t hsei z e
of t hcei r c u i t
by using the
relationships
between
the
different
inner
products
involved
181,
in a mannerverysimilar
to thatexplainedin
[I 51
f o r t h e c o m p u t a t i o n of convolution.
All t h ec o m b i n a t i o n s
of t h ei n p u td a t aa r ep e r f o r medinserialarithmetic.Hence,theresultingarchiregular
and
easily
implemented.
t e c t u r e is very
S i n ctehset r u c t u r e
of the
decomposition
of t h e
is t hsea m e
as f o r
DCT
into
polynomial
products
o t h et r a n s f o r m st h,sea mset r u c t u rceaanl sboe
of F o u r iterra n s f o r m s
used
for
the
computations
by changing only the
ROM p a r t of t h e chip.
VI. CONCLUSIQN
We have
first
explained
the
equivalence
between
DCT and cyclic convolution.
Thus,weusedthisrelationshiptoobtainnewDCT
algorithms
with
some
characteristics
suitable
for
VLSI implementation.
O t h ea rl g o r i t h mc sa n
also be
obtained
with
such
an approach. Further work will be reported.
422.3
1807
REFERENCES
s11
[21
[31
[41
151
[61
DUHAMEL
P.
: "Dispositif
transformee
de
encosinusd'unsignalnum6rique6chantillonni".
French
patent,
n"9601629,
February
1986.
P. DUHAMEL : "Dispositif d ed 6 t e r m i n a t i o nd e
latransformkenumkriqued'unsignal".French
patent n"8612431, September 1986.
S. WINOGRAD : "Some
bilinear
forms
whose
m u l t i p l i c a t ci voem p l e xdi et yp etnhodens
field of constants".
Math.
Syst.
Theory,
1977,
Vole10,pp.169-180.
L. AUSLANDER, S. WINOGRAD : "Themultiplicative complexity of certain semi-linear systems
defined bypolynomials".
Adv. in AppliedMathematics. Vol. 1, n03, pp.257-299,1980.
M.T.
HEIDEMAN,
Private
communication.
H.J.NUSSBAUMER, "Fast Fourier Transform and
Convolution
algorithms."
Springer-Verlag,
1981.
R.C.AGARWAL,
C.S. BURRUS : "Fast convolutions
using
Fermat
number
transforms
with
applicationtodigitalfiltering".IEEETrans.on
ASSP, VOI. 22, pp. 87-97,1974.
L.M. LEIBOW'ITZ : "A simplifiedbinaryarithmetic
for
the
Fermat
Number
Transform".
IEEE
Trans.
on
ASSP, Vol. 24,
pp.
356-359,
1976.
S. CHU, C.S. BURRUS : "A p r i m e f a c t o r F F T a l gorithm using distributed arithmetic". IEEE Trans.
o n ASSP, Vol. 30, n02,pp.217-226,April1982.
It
-.13
14
Fig. 2
Fig. 4
Imp1ementatlon of an
inner
product
dlstrlbuted arlthrnetlc
28
Fig. 5 : TheDCT
1808
by
'
of length 8 by distributed a r l t h r n e t l c