Discrete Hartley Transforms On FPGA: Sriseshan S EE09B060 Murali Naik EE09B055 April 26, 2013

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Discrete Hartley Transforms on FPGA

Sriseshan S EE09B060

Murali Naik EE09B055

April 26, 2013

Introduction

In this assignment, we focus on DHTs that have power-of-2 lengths and carry
out the following:
1. Show mathematically how an N -point DHT can be computed from two
2 -point DHTs.
N

2. Implement 4-point DHT on an FPGA using systolic array based and


pipelined designs, and display all outputs on LCD.
3. Implement 8-point DHT on an FPGA using two 4-point DHT modules
and the algorithm above, and display all outputs on LCD.
4. Report and analyze the synthesized area and I/O delay of the above.
DHT nds extensive applications in signal processing mainly because of its real-
valued nature, symmetry wrt its inverse and its ecient computability when
compared to DFT and DCT.
• By denition, the N -point DHT X of input vector x is given by:

N −1
2kπn 2kπn
(1)
X
X(k) = x(n)(cos( ) + sin( ))
n=0
N N

for, k, n = 0, 1, 2, ... , N - 1.
• [1] can also be written as:

N
−1
2
X 4kπn 4kπn
X(k) = x(2n)(cos( ) + sin( ))
n=0
N N
N
−1
2
2kπ(2n + 1) 2kπ(2n + 1)
(2)
X
+ x(2n + 1)(cos( ) + sin( ))
n=0
N N

1
• Let x1 and x2 be the even and odd subsequences of x, with length N
2 each.
Their corresponding DHTs can be written as:
N
−1
2
4kπn 4kπn
(3)
X
X1 (k) = x(2n)(cos( ) + sin( ))
n=0
N N
N
−1
2
4kπn 4kπn
(4)
X
X2 (k) = x(2n + 1)(cos( ) + sin( ))
n=0
N N

• Let α = N .
2kπ
Consider
N
X1 (k) + X2 (k)cos(α) + X2 ( − k)sin(α)
2
N
−1
2

x(2n + 1)(cos(2αn) − sin(2αn)))sin(α) (5)


X
[4] =⇒ = X1 (k) + X2 (k)cos(α) + (
n=0

• Now, [2] and [3] =⇒


N
−1
2
X
X(k) = X1 (k) + x(2n + 1)(cos(2αn)cos(α) − sin(2αn)sin(α)
n=0
+sin(2αn)cos(α) + sin(α)cos(2αn))
N
−1
2
X
[4] =⇒ = X1 (k) + X2 (k)cos(α) + x(2n + 1)(cos(2αn) − sin(2αn))sin(α)
n=0

(6)
• Since [5] and [6] are identical, hence:

N
X(k) = X1 (k) + X2 (k)cos(α) + X2 ( − k)sin(α) (7)
2
Thus an N -point DHT can be obtained from two N2 -point DHTs. This
approach does away with having to compute various sines and cosines involved
and supports reuse of hardware.
The following sections outline implementations on FPGA.
N ote : The outputs displayed on LCD and through simulation were veried
using MATLAB. The LCD module was not considered for the synthesis reports.

2
Figure 1: Systolic array for 4-DHT

4-point DHT

For N = 4, the cosine and sine terms become ±1 and hence [1] is equivalent to
a matrix-vector multiplication of the form:
 
1 1 1 1
 1 1 −1 −1 
[x1 x2 x3 x4 ] 
 
1 −1 1 −1 
1 −1 −1 1
This can be implemented using a systolic array as shown in Figure [1].
Upon implementing this on an FPGA, the synthesis tool, as expected, re-
ports a total of 12 adders and subtracters. An estimated input to output delay
of 12ns was reported. It is expected that after an initial latency of 3 adds,
throughput of 1 output per add is achieved thereafter. Near identical results
are achieved if identical processing elements (controlled by appropriate tag bit,
to choose between add/sub) are used.
Given the nature of the transform matrix, a two stage pipelined design may
be implemented based on the following:

3
Stage 1 :
b1 = x0 + x2
b2 = x1 + x3
b3 = x0 − x2
b4 = x1 − x3

Stage 2 :
X0 = b1 + b2
X1 = b3 + b4
X2 = b1 − b2
X3 = b3 − b4

Thus, the latency is reduced to 2 adds in this case. As expected, a total of


only 8 adders and subtracters get synthesized. An estimated input to output
delay of 10.5ns was reported. As this proves more ecient then the former
design wrt both area and delay, this is used in computing 8-DHT.

8-point DHT

For higher values of N , a conventional design based on Figure [1] will have
high latency and low throughput as the processing elements will no longer be
simple adders/subtracters (due to the fact that multiplications with various
sines and cosines (which may have to be computed or fetched) will be required.)
Instead, [7] makes use of a single cosine and sine value, and the subproblems ( N2 -
DHT) can be computed in parallel. The cosine and sine value can be computed
using CORDIC and multiplications can be done by dedicated hardware, or these
multiplications can be retrieved from a ROM (this is feasible especially when
the prole of the input is known), thus eliminating the need for computing
and multiplying. However, given N = 8, and an √ (8.4)-bit xed point two's
complement notation, we implement dividing by 2 by making use of shifters
and adders. [7] can be rewritten as follows:

4
Figure 2: 8-DHT

N = 8:
X0 = X1 (0) + X2 (0)
1
X1 = X1 (1) + (X2 (1) + X2 (3)) √
2
X2 = X1 (2) + X2 (2)
1
X3 = X1 (3) + (X2 (1) − X2 (3)) √
2
X4 = X1 (0) − X2 (0)
1
X5 = X1 (1) − (X2 (1) + X2 (3)) √
2
X6 = X1 (2) − X2 (2)
1
X7 = X1 (3) − (X2 (1) − X2 (3)) √
2
Hence, block diagram depicting the implementation can be depicted as in
Figure [2].
Each 4-DHT module uses 8 adders/subtracters, each byRoot2 module uses
2 adders, the addsub block uses 2 adders and the nal stage uses 8 adders,
thus contributing to a total of 30 adders and subtracters being synthesized. An
estimated input to output delay of 22ns was reported, which is about twice the
estimated I/O delay of 4-DHT, showing that this implementation masks the
expensive computation of and multiplication with sines and cosines. Clearly,
this is scalable to higher powers of two in a similar way.

You might also like