Professional Documents
Culture Documents
Discrete Hartley Transforms On FPGA: Sriseshan S EE09B060 Murali Naik EE09B055 April 26, 2013
Discrete Hartley Transforms On FPGA: Sriseshan S EE09B060 Murali Naik EE09B055 April 26, 2013
Discrete Hartley Transforms On FPGA: Sriseshan S EE09B060 Murali Naik EE09B055 April 26, 2013
Sriseshan S EE09B060
Introduction
In this assignment, we focus on DHTs that have power-of-2 lengths and carry
out the following:
1. Show mathematically how an N -point DHT can be computed from two
2 -point DHTs.
N
N −1
2kπn 2kπn
(1)
X
X(k) = x(n)(cos( ) + sin( ))
n=0
N N
for, k, n = 0, 1, 2, ... , N - 1.
• [1] can also be written as:
N
−1
2
X 4kπn 4kπn
X(k) = x(2n)(cos( ) + sin( ))
n=0
N N
N
−1
2
2kπ(2n + 1) 2kπ(2n + 1)
(2)
X
+ x(2n + 1)(cos( ) + sin( ))
n=0
N N
1
• Let x1 and x2 be the even and odd subsequences of x, with length N
2 each.
Their corresponding DHTs can be written as:
N
−1
2
4kπn 4kπn
(3)
X
X1 (k) = x(2n)(cos( ) + sin( ))
n=0
N N
N
−1
2
4kπn 4kπn
(4)
X
X2 (k) = x(2n + 1)(cos( ) + sin( ))
n=0
N N
• Let α = N .
2kπ
Consider
N
X1 (k) + X2 (k)cos(α) + X2 ( − k)sin(α)
2
N
−1
2
(6)
• Since [5] and [6] are identical, hence:
N
X(k) = X1 (k) + X2 (k)cos(α) + X2 ( − k)sin(α) (7)
2
Thus an N -point DHT can be obtained from two N2 -point DHTs. This
approach does away with having to compute various sines and cosines involved
and supports reuse of hardware.
The following sections outline implementations on FPGA.
N ote : The outputs displayed on LCD and through simulation were veried
using MATLAB. The LCD module was not considered for the synthesis reports.
2
Figure 1: Systolic array for 4-DHT
4-point DHT
For N = 4, the cosine and sine terms become ±1 and hence [1] is equivalent to
a matrix-vector multiplication of the form:
1 1 1 1
1 1 −1 −1
[x1 x2 x3 x4 ]
1 −1 1 −1
1 −1 −1 1
This can be implemented using a systolic array as shown in Figure [1].
Upon implementing this on an FPGA, the synthesis tool, as expected, re-
ports a total of 12 adders and subtracters. An estimated input to output delay
of 12ns was reported. It is expected that after an initial latency of 3 adds,
throughput of 1 output per add is achieved thereafter. Near identical results
are achieved if identical processing elements (controlled by appropriate tag bit,
to choose between add/sub) are used.
Given the nature of the transform matrix, a two stage pipelined design may
be implemented based on the following:
3
Stage 1 :
b1 = x0 + x2
b2 = x1 + x3
b3 = x0 − x2
b4 = x1 − x3
Stage 2 :
X0 = b1 + b2
X1 = b3 + b4
X2 = b1 − b2
X3 = b3 − b4
8-point DHT
For higher values of N , a conventional design based on Figure [1] will have
high latency and low throughput as the processing elements will no longer be
simple adders/subtracters (due to the fact that multiplications with various
sines and cosines (which may have to be computed or fetched) will be required.)
Instead, [7] makes use of a single cosine and sine value, and the subproblems ( N2 -
DHT) can be computed in parallel. The cosine and sine value can be computed
using CORDIC and multiplications can be done by dedicated hardware, or these
multiplications can be retrieved from a ROM (this is feasible especially when
the prole of the input is known), thus eliminating the need for computing
and multiplying. However, given N = 8, and an √ (8.4)-bit xed point two's
complement notation, we implement dividing by 2 by making use of shifters
and adders. [7] can be rewritten as follows:
4
Figure 2: 8-DHT
N = 8:
X0 = X1 (0) + X2 (0)
1
X1 = X1 (1) + (X2 (1) + X2 (3)) √
2
X2 = X1 (2) + X2 (2)
1
X3 = X1 (3) + (X2 (1) − X2 (3)) √
2
X4 = X1 (0) − X2 (0)
1
X5 = X1 (1) − (X2 (1) + X2 (3)) √
2
X6 = X1 (2) − X2 (2)
1
X7 = X1 (3) − (X2 (1) − X2 (3)) √
2
Hence, block diagram depicting the implementation can be depicted as in
Figure [2].
Each 4-DHT module uses 8 adders/subtracters, each byRoot2 module uses
2 adders, the addsub block uses 2 adders and the nal stage uses 8 adders,
thus contributing to a total of 30 adders and subtracters being synthesized. An
estimated input to output delay of 22ns was reported, which is about twice the
estimated I/O delay of 4-DHT, showing that this implementation masks the
expensive computation of and multiplication with sines and cosines. Clearly,
this is scalable to higher powers of two in a similar way.