Professional Documents
Culture Documents
Lifeng Miao, Jun Jason Zhang, Chaitali Chakrabarti, Antonia Papandreou-Suppappola
Lifeng Miao, Jun Jason Zhang, Chaitali Chakrabarti, Antonia Papandreou-Suppappola
Lifeng Miao, Jun Jason Zhang, Chaitali Chakrabarti, Antonia Papandreou-Suppappola
School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ
{lifeng.miao, jun.zhang.ee, chaitali, papandreou}@asu.edu
20
,QSXWSDUWLFOHV 3(
,QLWLDO [ÖW + ZW + UL 5HSOLFDWHG
3DUWLFOHV 3DUWLFOH
,0+
6DPSOLQJ :HLJKWLQJ ,QGH[
VDPSOHU
0HPRU\
3( 3( 3( 3( 50(0
3DUWLFOH 0HDQ
[O[K ;/;+ [O[K ;/;+ [O[K ;/;+ [O[K ;/;+ *URXSDQG 3DUWLFOH [W
μ Z M ρM μ Z M ρM μ Z M ρM μ Z M ρM 0HPRU\
30(0
0HDQ 0HPRU\
030(0
'DWD
7LPHFRQWURO &8 [O [K μZ
&RPSXWH;/;+ ;/ ;+ ρM
ρM
&RPSXWHUHSOLFDWHIDFWRU
Fig. 2: Block diagram of the PE.
Fig. 1: PPF-IMH architecture with four PEs.
:SURG
executed in the CU. Each PE communicates with the CU, but
5HPDLQ
5HSOLFDWHG
there is no inter-PE communication. Fig. 1 also shows the UL 3DUWLFOH
3DUWLFOHLQGH[ L 08; ,QGH[
data that is transferred between the PE and CU. 0HPRU\
50(0
21
5 portantly, we will need to use the PPF-IMH algorithm to esti-
;/
σ
FRXQWHU mate the posterior PDF that would result in the waveform pa-
;+
− × + WKUHVKROG [ M
¦ ÷
μ [
rameter θ̂k that optimizes a performance cost function when
FRPSDULVRQ
[L estimating xk . For example, the optimization may be the
FRXQWHU
+ WKUHVKROG
μ [
minimization of the the mean-squared estimation error or the
[ M
[L
FRPSDULVRQ ¦ ÷ maximization of the signal-to-noise ratio (SNR) with respect
[L
to all possible waveform parameters θk [3].
The current PF hardware architecture can be extended to in-
clude adaptive waveform design. Instead of reading from the
+ WKUHVKROG5
FRXQWHU1 memory, the CU now has to update the measurement adap-
μ [5
tively, based on selecting the waveform parameter θk to op-
÷
FRPSDULVRQ
[L [5 M ¦
timize a cost function. The implementation of this step does
Fig. 4: Block diagram of the group-and-mean unit. not increase the PF computational intensity, and it does not
present a severe bottleneck for hardware architecture.
[OL
&8 where vk and nk are zero-mean Gaussian random variables
with variances σv2 = 10 and σn2 = 1, respectively.
[O!;/
VWD\
FKDQJH
;/
zk+1 = (x2k+1 /20) + (1/[α (xk+1 − θk+1 )2 + β]) + nk+1
[KL [K!;+
FKDQJH where α > 0 and β > 0 are constant numbers. At time k,
&RPSDULVRQ we can choose appropriate θk+1 for time (k + 1) to improve
;+ [K;+ estimation performance. Specifically, at each time k, we use
08;
22
Model 1 and Model 2 systems. The corresponding estima- 6.5
tion results are shown in Fig. 6 and Fig. 7. Table 2 shows the RMSE of nonadaptive PF
6
estimation root mean-squared error (RMSE) for the two mod- RMSE of adaptive PF
lel algorithm in [7] for both models. In addition, the RMSE 4.5
RMSE
performance of the PPF-IMH is close to the systematic resam-
4
pling algorithm, which means that the performance degrada-
tion due to parallelization in [7] is overcome by using IMH 3.5
2.5
2
0 10 20 30 40 50
8 Monte Carlo simulation times
25
6.3. Hardware Performance Evaluation
True
20 Algorithm in [6] Resource utilization: Table 3 summarizes the 1-PE and 4-PE
15
Proposed algorithm
architecture resource utilization. The sinusoidal and exponen-
10
tial functions are implemented using CORDIC units, and the
5
rest of the units are implemented using DSP cores. For the
4-PE architecture, the PE and CU occupied slices utilizations
0
are 408 (1%) and 420 (1%), respectively. Our resource uti-
−5
lization is fairly low; only about 5% of the total hardware re-
−10
sources are used. Thus, 20 such architectures should be able
−15 to fit onto a single Xilinx Virtex-5 platform.
−20
23
replication factor, respectively. Thus, one PPF-IMH iteration ing IMH sampling with parallel PF implementation. The pro-
takes Ttotal = (Ls + Lw + N + Lr + N + Lm + Lρ ) × Tclk = posed method was also implemented on the Xilinx Virtex-5
585 cycles. For a system clock rate of 100 MHz, the total FPGA platform. While it is difficult to give an apple-to-apple
processing period for one iteration is Ttotal = 5.85 μs. comparison with other FPGA based implementations because
of the differences in the model and the number of particles,
6DPSOLQJ we can still claim that due to algorithm modification, our
:HLJKWFRPSXWLQJ *URXS computation time is smaller though our resource utilization
,0+VDPSOHU 0HDQ is slightly larger. Furthermore, we discussed the feasibility of
*OREDO
ρ
&RPSXWH
parameter adaptation in the parallel PF algorithm for future
UDQJLQJ
agile sensing estimation problems. Simulations demonstrated
/V /Z 1 /U 1 /P /S
that the estimation performance is significantly improved and
Fig. 9: Execution time of proposed method. the processing speed is faster due to parallelization.
Communication overhead: The communication overhead of
the proposed algorithm is 96 bytes when using N = 1, 000 8. REFERENCES
particles, P = 4 PEs, and R = 10 groups. This is a signif-
icant reduction compared to the traditional algorithm whose [1] R. E. Kalman, “A new approach to linear filtering and pre-
communication overhead is 2,500 bytes. diction problems,” Transaction of the ASME-Journal of Basic
Engineering, pp. 35–45, March 1960.
Scalability: Fig. 10 shows the execution time and communi-
cation overhead for one processing iteration versus the num- [2] N. J. Gordon, D. J. Salmon, and A. F. M. Smith, “Novel ap-
ber P of PEs for the proposed parallel architecture. The proach to nonlinear/non-Gaussian Bayesian state estimation,”
processing period curve saturates when the number of PEs in IEE Proceedings F in Radar and Signal Processing, 1992,
vol. 140, pp. 107–113.
is large because there is no significant speedup when M/P
approaches the constant latency L, which in this case, is [3] S. Sira, A. Papandreou-Suppappola, and D. Morrell, “Dynamic
L = Ls + Lw + Lr + Lm + Lρ = 85 cycles. The communica- configuration of time-varying waveforms for agile sensing and
tion overhead curve increases linearly with respect to P , and tracking in clutter,” IEEE Trans. on Signal Processing, vol. 55,
pp. 3207–3217, July 2007.
the slope is equal to 2R + 4 where R is the number of groups
in each PE. Thus, increasing the number of PEs beyond 4 has [4] M. Bolić, Architectures for Efficient Implementation of Particle
little benefit. However, this number is likely to change, based Filters, Ph.D. thesis, State University of New York at Stony
on the number of particles and hardware implementation of Brook, 2004.
the model equations. [5] A. Athalye, M. Bolić, S. Hong, and P. M. Djurić, “Generic
hardware architectures for sampling and resampling in particle
filters,” EURASIP Journal of Applied Signal Processing, vol.
2500 17, pp. 2888–2902, 2005.
Processing period (cycles)
2000
[6] A. C. Sankaranarayanan, R. Chellappa, and A. Srivastava, “Al-
1500
gorithmic and architectural design methodology for particle fil-
1000 ters in hardware,” in IEEE Int. Conf. Computer Design, Oct.
500 2005, pp. 275–280.
0
0 2 4 6 8 10 12 14 16 18 [7] B. B. Manjunath, A. S. Williams, C. Chakrabarti, and
Number of PEs (P) A. Papandreou-Suppappola, “Efficient mapping of advanced
Communication overhead (bytes)
400
signal processing algorithms on multi-processor architectures,”
300 in IEEE Workshop on Design and Impl. Signal Proc. Systems,
200
October 2008, pp. 269–274.
100
[8] C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
Springer-Verlag, New York, 1999.
0
0 2 4 6 8 10
Number of PEs (P)
12 14 16 18 [9] S. Hong, Z. Shi, and K. Chen, “Easy-hardware-implementation
MMPF for maneuvering target tracking: Algorithm and archi-
tecture,” Journal of Signal Processing Systems, Feb. 2010.
Fig. 10: Scalability of proposed parallel architecture. [10] R. van der Merwe, J. F. G. de Freitas, A. Doucet, and E. A.
Wan, “The unscented particle filter,” Tech. Rep., Engineering
Dept., Cambridge University, Aug. 2000.
7. CONCLUSION [11] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A
tutorial on particle filters for online nonlinear non-Gaussian
In this paper, we proposed an efficient parallel architecture
Bayesian tracking,” IEEE Trans. on Signal Processing, vol.
for implementing PFs. This architecture can achieve both 50, pp. 173–449, Feb. 2002.
high speed and accurate estimation performance by employ-
24