Li 2016

A HIGH SPEED PARALLEL TIMING SYNCHRONIZATION ALGORITHM
FOR 16QAM
HAO LI, ZHI-GANG WANG, HOU-JUN WANG
School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
E-MAIL: haoli_uestc@163.com
Abstract: symbol points in each symbol period. And it is independent

The paper presents an efficient parallel timing of the carrier synchronization, which has been demonstrated
synchronization algorithm structure, which is suitable for high in [3]. In [4], the authors have combined farrow interpolation
speed communications demodulation system and easy to filter [5] to achieve the classic serial timing synchronization
implement on FPGA platform. First, a new parallel timing algorithm on the basis of Gardner algorithm. Other similar
synchronization structure is displayed. Wherein the proposed
serial structures are presented in [6] and [7]. But these
parallel structure make up of a feedback loop based on the
farrow interpolation filter, Gardner algorithm and the traditional serial timing synchronization structure is not valid
numerically controlled oscillator (NCO) controller. In addition, for the high data stream, due to the impact of the current
the parallel implementation structure of these modules has been digital hardware devices. Hence, we need to study a parallel
studied. Especially, the principle based on the sign bit decision timing synchronization algorithm architecture based on
is employed in parallel NCO algorithm. By this feedback loop, farrow interpolation filter and Gardner algorithm. A parallel
we can get the best symbol points after timing synchronization. timing synchronization architecture based on Gardner
The simulation results show that the entire parallel timing algorithm has been presented in [8], which contains a sample
synchronization algorithm not only can correct working, but dropping unit, parallel interpolation filter and a loop filter
also the system performance loss does not exceed 1dB.
control unit. But only the interpolation filter to be processed
in parallel, and other important module is still serial mode.
Keywords: The author also gives a fully parallel timing synchronization
Parallel timing synchronization algorithm structure;
Farrow interpolation filter; Gardner algorithm
structure in [9]. However, it does not give a specific
theoretical derivation process.
1. Introduction This paper proposes a new fully high speed parallel
timing synchronization structure in digital domain. The
With the development of wireless communication parallel structure is based on farrow interpolator and Gardner
algorithm. In the entire parallel loop architecture, the
technology, the demand for data transmission rate has been
improved parallel Gardner error detector and parallel NCO
up to several gigabits per second (Gbps) in the high speed
demodulation system. In order to achieve the transmission of control algorithm is analyzed in detail. This fully digital
gigabits data rates in a limited bandwidth, the most effective parallel timing synchronization algorithm is very suitable for
option is to use broadband modulation techniques, such as 16QAM and easy to implement on FPGA platform.
16QAM. The rest of the paper is organized as follows. Section 2
describes the parallel timing synchronization algorithm and
Timing synchronization is a key technology of the high
speed demodulation system [1]. Universal timing its implemental structure. Section 3 shows the results of
synchronization algorithm is designed based on O&M simulation and FPGA implementation. Finally, conclusions
algorithm [2] and Gardner algorithm [3]. O&M algorithm is are drawn in Section 4.
a process of timing error calibration in the frequency domain.
Its advantage is that we can obtain an accurate timing error 2. Parallel timing synchronization
information, but it also needs to do square operation and the
DFT conversion for the signal, which will consume a lot of Since the sample clock of the receiver and the sender
resources. While the Gardner algorithm is an error detection clock is not synchronized, this results in the sample data is
algorithm of no auxiliary data. Its characteristic is that we not best sample point at the receiving. But according to the
can calculate the value of a timing error with only two input sample signal and interpolation position, the present
978-1-5090-6126-6/16/$31.00 2016 IEEE 403

structure can obtain the best sample value by interpolation timing error detection algorithm for 16QAM can be
algorithm. Here, the interpolation position is given by the expressed as
NCO controller. The serial farrow interpolation filter output e(r ) [ y(r 1 2) a] [ y(r ) y(r 1)] (2)
based on polynomial fitting can be written as [10] where a [ y(r ) y(r 1)] 2 . The y (r ) denotes the r-th
N
y (k ) ¦P
l 0
l
k v(l ) (1) symbol sample point. The y(r 1 2) denotes sample point
between the r-th symbols and the (r-1)-th symbols.
where v(l ) ¦ b (i) x(m
i
l k i) . And the P k denotes the From equation (2), we can found that it uses two
symbols sample points and their intermediate sample point.
interpolation position. The bl (i ) indicates the polynomial This requires input data rate of the Gardner timing error
coefficients. detector is twice the symbol rate. And it also can be seen that
According to equation (1), the serial farrow structure the algorithm can estimate a timing error value in each
has been described in [10], which is not specified here. The symbol period and easy to implement. However, the
designed four-parallel timing synchronization structure is traditional Gardner algorithm does not meet the demand of
shown in Figure 1. The parallel structure consists of a the high data rate transmission. So we get a parallel Gardner
feedback loop. In Figure 1, the parallel farrow interpolation algorithm through its improvement.
filter structure can be seen as four groups parallel FIR filter We assume Gardner error detector has four route
H(i), i = 0,1,2,3. And each group parallel FIR filter parallel input data, the parallel Gardner algorithm is
coefficients are different, which can be obtained by expressed as follows
polynomial fitting. The parallel FIR filter structure can be e(2n) [ y(4n) a(2n)] [ y(4n 1) y(4n 1)]
simply implemented with the register cache method. And the ®
timing adjust block is used to select the correct signal from ¯e(2n 1) [ y(4n 2) a(2n 1)] [ y(4n 3) y(4n 1)]
the data FIFO. The following sections will focus on the (3)
parallel loop algorithm structure of the Gardner error where a(2n) [ y(4n 1) y(4n 1)] 2 . The y(4n i) ,
detector and the NCO controller. i 0,1, 2,3 is four route parallel input data of the current
x(4n 3)
x(4n 2) clock cycle. The y(4n 1) is the fourth input data of the
x(4n 1)
x(4n)
before clock cycle. In every clock cycle, the fourth input data
needs to be delayed by one clock cycle through the register.
H(3)
H(3) Ă H(0)
H(0) e(2n)
y (4n)
y (4n)
yout 31 1/2
1/2
2
P1 y (4n 1)
H ( n)
detector
error detector
yout 01 y (4n 1)
D 1/2
1/2
adjust
d ust
yout 32
y (4n 1)
Timing adj
P2
Gardner error
Timing
NCO
Parallel NCO
yout 02 y (4n 2)
Gardner
w
yout 33 e(2n 1)
Parallel
P3 y (4n 2)
yout 03
y (4n 3)
yout 34 1/2
1/2
2

P4
yout 04 y (4n 3)
Loop
Fig.2 Four-parallel Gardner error detector structure

ffilter
Loop
en1
lter
en2
fi
en3
en4 In addition, the y(4n 1) and the y(4n i), i 1,3 is
the symbol sample point. The y(4n) and the y(4n 2) is
Fig.1 Four-parallel timing synchronization structure
the intermediate sample point. We can find that the parallel
Gardner algorithm can calculate the two route parallel timing
2.1. Gardner error detector
error values for four route parallel input data in one clock
cycle. The Figure 2 shows the four-parallel Gardner error
Gardner error detector is used to detect the loop timing
detector implementation structure. In Figure 2, in order to
error. Gardner algorithm is deduced based on QPSK
simplify the subsequent processing of the loop filter, the
modulated signal. But it is also suitable for 16QAM
modulated signal, which has been verified in [11]. The average of timing error H (n) is calculated.
404
2.2. Loop filter traditional NCO structure does not be implemented in FPGA.
So we propose an improved parallel NCO algorithm, which
The role of the loop filter is filtering the timing error. It uses the principle of the sign bit decision.
can reduce the influence of high frequency noise, so as to Suppose parallel NCO control unit has M route and
reduce the jitter of the timing error signal. This makes timing generate M sets of data K1 (n) , K2 (n) , , KM (n) . Each set
error value more smoothly and make the entire loop of the of data is given
timing synchronization is more stable. Here we just give a
Ki (n) K (nM i) i 1, 2,3, M (6)
common loop filter structure, as shown in Figure 3.
k1
Combining equation (4) can be obtained the following
expression
K (nM i) [K (nM i 1) w(n)]mod1, i 1, 2, M (7)
e( n ) Z Then equation (7) is also written as
D
K (nM 1) [K (nM ) w(n)]mod 1
° K (nM 2) [K (nM 1) w(n)]mod 1
k2 °°
® K (nM 3) [K (nM 2) w(n)]mod 1 (8)
°
Fig.3 Loop filter structure °
°̄ K (nM M ) [K (nM M 1) w(n)]mod 1
In Figure 3, the k1 and the k 2 are loop filter
coefficients, which determines the performance of the timing where register variable K (nM i) , i 1, 2,3, M is
loop convergence rate. Its calculation method has been initialized to 0.
specifically given in [12]. The above process is described as follows. Input control
word w(n) at the initial moment, then according to above
2.3. Parallel NCO equation (8), all M groups K can be calculated in one clock
cycle. Meanwhile, analyzing the positive and negative of the
Parallel NCO control unit provides parallel fractional M groups K . Because we have to get M groups K in one
interval for farrow interpolation filter, which will determine
the interpolation position. It also provides overflow parallel clock cycle, so we will not use the pipeline design ideas. And
enable signal for the timing adjust unit. equation (8) is simplified as follows
It has been pointed out the essence of NCO control unit K (nM 1) [K (nM ) w(n)]
is a differential equation in [4]. It is expressed as follows ° K (nM 2) [K (nM ) 2 w(n)]
°°
K (m) [K (m 1) w(m)]mod 1 (4)
® K (nM 3) [K (nM ) 3w(n)] (9)
where mod is defined as a modulo function that just take the °
remainder portion. Specific given by (5). °
K (m 1) w(m) 1 K (m 1) w(m) 0 °̄ K (nM M ) [K (nM ) Mw(n)]
K (m) ® (5) It can be seen that the current time needed M groups K
¯K (m 1) w(m) otherwise
Next, the K (m) represents the value of NCO register in only relevant K (nM ) and w(n) . This allows to calculate
the m-th working clock. The w(m) is input control word of the value of M groups K independent of each other, so as
the NCO control unit. to achieve the purpose of parallel processing. In equation (9),
The working clock of the NCO is 1 Ts , and the 1 Ti is it does not need to judge the positive and negative of the M
groups K and also do not need to use modulo function,
output data rate after interpolator. Input control word w(m)
since the binary subtraction is converted into an addition
is adjusted through the loop filter. When the loop reaches operation by its complement. It means that we can do parallel
balance, the w(m) is approximately a fixed constant, that is processing according to the per clock cycle.
w(m) Ts Ti . According to the analysis of the preceding The M-parallel NCO overflow enable signal expression
Gardner algorithm, we can know that 1 Ti 2Rs , where Rs is given
represents the symbol rate. Therefore, at the initial moment
of the loop, we set w(1) 2RsTs . With the sample clock 1 Ts
is quite large in high speed demodulation system, the
405
en[1] sgn[K (nM )] sgn[K (nM 1)] Scatter plot Scatter plot
° en[2] sgn[K ( nM 1)] sgn[K ( nM 2)]
5 4
°° 2
Quadrature
Quadrature
® en[3] sgn[K (nM 2)] sgn[K (nM 3)] (10)
°
0 0
° -2
°̄ en[ M ] sgn[K (nM M 1)] sgn[K (nM M )]
-5 -4
where sgn[K ] means to take the sign bit of K and
-5 0 5 -4 -2 0 2 4
In-Phase In-Phase
represents exclusive OR (XOR). (a) before timing synchronization (b) after timing synchronization
And the M-parallel fractional interval is calculated 1 0.02
based on M groups K and en , the expression is written as
Fractional interval
Loop filter output

0.01
P1 en[1]? [K (nM ) w(n)] : P0 0.5 0

°
°° P2 en[2] ? [K (nM 1) w(n)] : P1 -0.01
® P3 en[3]? [K (nM 2) w(n)] : P2 (11) 0 -0.02

0 1000 2000 3000 4000 0 1000 2000 3000 4000
° Number of Symbols Number of Symbols
° (c) (d)
¯° P M en[ M ]? [K (nM M 1) w(n)] : P M 1

Fig.4 Simulation results
where P0 is the fourth input value of the before clock cycle.
The “ ? ” represents the logical condition judgment. The
0
10
detail calculation expression is as follows 10

-1
K (nM ) w (n) en[1] 1

P1 ® (12) 10
-2
¯ P0 en[1] 0
BER
-3
So the timing adjust unit selects valid data parallel

10
output by overflow enable signal, and the interpolation filter 10

-4
calculates the correct interpolation point by fractional -5
interval.
10 Float simulation BER
fixed simulation BER
Theoretical BER
-6
10
0 2 4 6 8 10 12 14
3 Simulation and implementation results EbNo(dB)
Fig.5 BER of the parallel timing synchronization

Here, the performance of the proposed high speed Figure 5 compares the BER simulation results of the
parallel timing synchronization algorithm is simulated in parallel demodulator with the theoretical value. We can also
MATLAB. The simulation parameter has been set as follows : find out that the performance of the proposed parallel
16QAM modulated signal, M 4 , f s 1.44 GHz, symbol demodulation architecture has the smaller loss. And the loss
rate Rs 180 MHz, w(1) 2Rs f s 1 4 , timing phase is less than 1dB.
error H 0.4 f s , loop filter coefficients of timing A hardware system platform also has been built with
Xilinx XC7VX485T FPGA chip and Texas Instruments (TI)
synchronization k1 210 and k2 221 , signal to noise A/D converter ADC12D1600 device. And a trigger capture
ratio (SNR) 20dB. The quantized parameters are as follows: mechanism is used on this platform [13], it can complete the
12bits after sampling, 25bits for timing synchronization. accurate capture of the signal .The parameters of the
Figure 4 shows the simulation results of the parallel system. hardware implementation are the same as MATLAB
Figure 4 (a) shows the constellation before timing simulation.
synchronization. Figure 4 (b) is a constellation after timing
synchronization, which has been compensated for the timing Table 1 Hardware resource utilization
phase error. Figure 4 (c), (d) shows the fractional interval Logic Utilization Used Available Utilization
curve and loop filter output in timing synchronization Num. of Slice Registers 36844 607200 6%
process, respectively. It is concluded that the fractional Num. of Slice LUTs 17525 303600 6%
Num. of Block/FIFO 13 1030 1%
interval quickly converges to 1 4 , and the loop filter output Num. of DSP48E1s 320 2800 11%
converges to near zero. From Figure 4, we can know that the Table 1 shows the hardware resource utilization of the
proposed parallel demodulator can efficiently work.
406
parallel timing synchronization structure after FPGA [7] Zhou X, Chen X, Zhou W, et al. “All-Digital Timing
implementation. In this experiment, the running clock of the Recovery and Adaptive Equalization for 112 Gbit⁄s
entire system is 1440 8 180 MHz in FPGA. Thus, for high POLMUX-NRZ-DQPSK Optical Coherent Receivers”
speed data streams, the existing FPGA can meet the [J]. Journal of Optical Communications & Networking,
implementation requirements through parallel processing 2010, 2(11):984-990.
architecture. And the consumption of hardware resource is [8] Schmidt D, Lankl B. “Parallel architecture of an all
also allowed to accept. digital timing recovery scheme for high speed receivers”
[C]. International Symposium on Communication
4 Conclusions Systems Networks and Digital Signal Processing. IEEE,
2010:31-34.
This paper presents a brand new high speed parallel [9] Higashino S, Kobayashi S, Yamagami T. “A parallel
timing synchronization structure for 16QAM. This parallel architecture of interpolated timing recovery for high-
structure has been proved to be quite effective and very speed data transfer rate and wide capture-range” [C].
suitable for FPGA implementation. The simulation indicates Optical Data Storage. International Society for Optics
that the system performance loss is very small, which is very and Photonics, 2007:66200Y-66200Y-6.
close to the theoretical value. Meanwhile, the hardware [10] Harris F. “Performance and design of Farrow filter used
implementation results show that the proposed parallel for arbitrary resampling” [C]. Int Conf on Digital
architecture does not consume a lot of resources. This simple Signal Processing. 1997:595 - 599.
and efficient parallel timing synchronization algorithm of [11] D'Andrea A N, Luise M. “Optimization of symbol
low complexity can be widely used in high speed timing recovery for QAM data demodulators” [J]. IEEE
communication system. Transactions on Communications, 1996, 44(3):399-406.
[12] Landgrebe D A. Phaselock Techniques, 3rd Edition [J].
2003.
Acknowledgements
[13] Guo J, Shi Y, Wang Z. “A Novel Design of DDR-based
Data Acquisition Storage Module in a Digitizer” [C].
This paper is supported by National Natural Science
International Conference on Communications, Circuits
Foundation of China (Grant No. 60934002).
and Systems. IEEE, 2007:995-998.
References
[1] Jablon N K. “Joint blind equalization, carrier recovery

and timing recovery for high-order QAM signal
constellations” [J]. IEEE Transactions on Signal
Processing, 1992, 40(6):1383-1398.
[2] Oerder M, Meyr H. “Digital filter and square timing
recovery” [J]. IEEE Transactions on Communications,
1988, COM-36(5):605-612.
[3] Gardner F M. “A BPSK/QPSK timing-error detector
for sampled receivers” [J]. IEEE Transactions on
Communications, 1986, 34(5):423-429.
[4] GARDNER F. M. “Interpolation in digital modems. I:
Fundamentals” [J]. IEEE Transactions on
Communications, 1993.
[5] Farrow, C.W. “A continuously variable digital delay
element” [J]. Proceedings - IEEE International
Symposium on Circuits and Systems, 1988, 3:2641-
2645.
[6] Zhou X, Chen X, Zhou W, et al. “Digital timing
recovery combined with adaptive equalization for
optical coherent receivers” [J]. Proceedings of SPIE -
The International Society for Optical Engineering,
2009, 7632:1-6.
407

Li 2016

Uploaded by

Copyright:

Available Formats

You might also like

Li 2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Li 2016

Uploaded by

Copyright:

Available Formats

A HIGH SPEED PARALLEL TIMING SYNCHRONIZATION ALGORITHM

Abstract: symbol points in each symbol period. And it is independent

978-1-5090-6126-6/16/$31.00 2016 IEEE 403

Fig.2 Four-parallel Gardner error detector structure

And the M-parallel fractional interval is calculated 1 0.02

based on M groups K and en , the expression is written as

Loop filter output

­ P1 en[1]? [K (nM ) w(n)] : P0 0.5 0

® P3 en[3]? [K (nM  2) w(n)] : P2 (11) 0 -0.02

¯° P M en[ M ]? [K (nM  M  1) w(n)] : P M 1

detail calculation expression is as follows 10

­K (nM ) w (n) en[1] 1

So the timing adjust unit selects valid data parallel

output by overflow enable signal, and the interpolation filter 10

calculates the correct interpolation point by fractional -5

3 Simulation and implementation results EbNo(dB)

Fig.5 BER of the parallel timing synchronization

[1] Jablon N K. “Joint blind equalization, carrier recovery

You might also like

P1 en[1]? [K (nM ) w(n)] : P0 0.5 0

® P3 en[3]? [K (nM 2) w(n)] : P2 (11) 0 -0.02

¯° P M en[ M ]? [K (nM M 1) w(n)] : P M 1

K (nM ) w (n) en[1] 1