Yuan 2016

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1
A 70 mW 25 Gb/s Quarter-Rate SerDes Transmitter

and Receiver Chipset With 40 dB of Equalization
in 65 nm CMOS Technology
Shuai Yuan, Student Member, IEEE, Liji Wu, Member, IEEE, Ziqiang Wang,
Xuqiang Zheng, Chun Zhang, and Zhihua Wang, Senior Member, IEEE
Abstract—A 25 Gb/s transmitter (TX) and receiver (RX) chipset

designed in a 65 nm CMOS technology is presented. The proposed
quarter-rate TX architecture with divider-less clock generation
can not only guarantee the timing constraint for the highest-speed
serialization, but also save power compared with the conventional
designs. A source-series terminated (SST) driver with a 2-tap feed-
forward equalizer (FFE) and a far-end crosstalk canceller (XTC)
is implemented in the TX chip. The RX chip employs an adap-
tive quarter-rate 2-tap decision-feedback equalizer (DFE) and a
baud-rate clock and data recovery (CDR). The power-efficient
DFE uses the combination of the soft-decision technique and a new
dynamic structure. The DFE adaption logic and baud-rate CDR
logic share a set of error samplers to save power and area. A
hybrid alternate clock scheme is proposed to satisfy the timing
requirement and reduce the power consumption further. The
measurement results show that the TX and RX chipset totally
compensates for a Nyquist channel loss of more than 40 dB, and
consumes only 70 mW from a 1.2 V supply when operating at
25 Gb/s.
Index Terms—CMOS, crosstalk, equalization, low power,
quarter-rate, receiver, SerDes, transmitter.
Fig. 1. Traditional architecture: half-rate TX and quarter-rate RX with DFE

I. I NTRODUCTION adaption.
G LOBAL internet traffic will grow 3.2-fold from 2014 to

2019 with a compound annual growth rate of 26% based
on the estimation in [1]. In this regard the mainstream I/O Fig. 1 shows a traditional half-rate TX architecture including
system, termed the high-speed SerDes has been widely used in a data path and a clock path. In the data path, the source-series
many applications, such as the computer systems, communica- terminated (SST) driver topology is considered as an attractive
tion networks, and in the bulk of consumer electronics. To meet solution for low power consumption [4]–[6]. However, few
the growing proliferation of bandwidth demand, the serial link measures are taken to save the power of the clock path even
data rates have been increased to 25–28 Gb/s, which is more though it is usually as high as that of the data path. As presented
than 75% higher than previous generation of SerDes [2]. They in Fig. 1, the half-rate input clock is used for the last stage
are defined in several new standards such as OIF CEI-25G/28G, serialization, whereas the quarter-rate clock used for the former
100 Gb Ethernet, and InfiniBand EDR [3]. Power consumption stage serialization is generated by the power-hungry divider
and signal integrity are two primary challenges for the design of circuit. To satisfy the timing constraints of the two clock path
a 25 Gb/s high-speed SerDes. Some significant design advances delays, it usually requires 2 sets of clock buffers (BUF#1 and
must be accomplished at both the transmitter (TX) side and the BUF#2) to match the timing requirements and to drive the large
receiver (RX) side. load simultaneously. However, even with this modification,
it is still difficult to match the delay across PVT variations
when the data rate attains 20 Gb/s and higher. In addition, the
Manuscript received December 28, 2015; revised March 4, 2016; accepted power consumption of the buffers is also significant. Taking, for
March 21, 2016. This work is supported by National Natural Science Foun- example, a 20 Gb/s transmitter with only a 1-stage serialization
dation of China (NSFC), No. 61371011. This paper was recommended by
Associate Editor S. Levantino. as reported by [7], the total power consumption is 57 mW, of
The authors are with the Institute of Microelectronics, Tsinghua University, which the clock path consumes 22.5 mW. The power consump-
Beijing, 100084, China (e-mail: yuans07@mails.tsinghua.edu.cn). tion of the divider circuit and clock buffers is 8.5 mW and
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. 7.2 mW respectively. To solve these power consumption prob-
Digital Object Identifier 10.1109/TCSI.2016.2555250 lems, the quarter-rate TX architecture is proposed recently [8],
1549-8328 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS
[9], in which the highest-speed serialization is achieved by

using the quarter-rate clocks exclusively. However, an extra
delay-locked loop (DLL) [8] or a quadrature voltage-controlled
oscillator (VCO) [9] is also required to generate at least
4 phases at the quarter-rate frequency. Moreover, the 4:1 MUX
utilized in the aforementioned quarter-rate TX is far more
difficult to design than the common 2:1 MUX. Therefore, a
quarter-rate SST TX with a novel structure [10] is proposed in
this paper to save the power of both the data and clock paths.
It still uses the 2:1 CMOS MUX, but the divider circuit and the
clock buffers used for matching the delay are eliminated in the
proposed divider-less clock generation. On the other hand,
the channel loss that produces inter-symbol interference (ISI)
will affect the signal integrity at higher frequencies more
severely. The feed-forward equalizer (FFE) is normally em-
ployed as the TX equalization. However, for the multichannel
applications, the crosstalk between the multi-parallel channels
also exacerbates the signal integrity [4], which cannot be re- Fig. 2. Proposed quarter-rate TX architecture.
solved by the FFE. To mitigate the far-end crosstalk (FEXT)
performance comparisons are shown in Section IV. Section V
and the crosstalk-induced jitter (CIJ), several approaches using
states the conclusion.
a crosstalk canceller (XTC) have been adopted in the designs
of TX [11], [12]. In this paper, a SST driver merged with the
FFE and XTC is proposed, which can compensate for both the II. T RANSMITTER
channel loss and FEXT with relatively low power.
A. Overall Architecture
For the RX design, a quarter-rate solution (Fig. 1) is consid-
ered to be a good compromise between performance, complex- Fig. 2 presents the proposed quarter-rate TX architecture
ity, and power [4]. Due to the strong equalization ability without with the divider-less clock generation. The input clock has a
enhancement of the noise and crosstalk, the decision-feedback quarter-rate frequency, i.e., 6.25 GHz when the output data
equalizer (DFE) has become the most attractive approach at rate is 25 Gb/s. Unlike the recently reported quarter-rate TX,
the RX side. To achieve the low-power implementation of DFE this TX uses the 2:1 MUX rather than the direct 4:1 MUX to
and at the same time meet the tight requirements of the critical achieve the data serialization, which means that it does not need
timing path, soft-decision [13], current-integrating [14], and a generator for the multi-phase quarter-rate clocks. Moreover,
charge-steering [15] techniques have been proposed in recent the 2:1 MUX is implemented using a CMOS transmission-
years. Additionally, the adaption of DFE is also necessary for gate structure [10] with no static power consumption, which is
diverse transmission channels. However, as shown in Fig. 1, also more stable than the 4:1 MUX. Compared with traditional
coexisting with the traditional clock and data recovery (CDR) half-rate TX architecture, the half-rate clock used for the last
by oversampling the input signal, the DFE adaption will result stage serialization is generated by the proposed divider-less
in more samplers with a large penalty in power and area. Thus clock generation circuit, so that a large number of clock buffers
the baud-rate CDR has been introduced by [16], [17]. Because used for matching the delay as well as the divider circuit are
the phase detection is established from the same data and error completely removed. The implementation of the divider-less
samples used for the DFE adaptation, both the CDR and adap- clock generation is similar to the design of DLL [19]. The
tion problems are simultaneously solved without any additional input quarter-rate clock is delayed by a voltage controlled delay
circuits [16]. Although all of these referenced techniques have line (VCDL), which operates at quarter-rate to minimize power
been proposed to minimize power consumption, their effective consumption. The phase difference between the input clock and
combinations have rarely occurred in recently reported works. the delayed clock is detected by an XOR-based phase detector
For lower power performance, this paper presents a quarter- (PD), these detection results are utilized by a voltage-to-current
rate RX with a baud-rate CDR and an adaptive quarter-rate (V/I) converter to adaptively adjust the delay of the VCDL.
2-tap DFE. Both the soft-decision technique and a new dynamic When the phase delay is adjusted to 90 ◦ , the loop is locked.
circuit structure similar to the charge-steering technique are At the same time, the output of the PD is just a half-rate
implemented in the DFE to achieve a power efficiency of clock meeting the requirement of the last stage serialization.
0.24 mW/Gb/s. The power efficiency of the entire RX is less This divider-less clock generation circuit can save significant
than 2 mW/Gb/s. In addition, to obtain a more efficient com- power compared to the conventional clock path, the consequent
bination of these techniques, a novel hybrid alternate clock shortcomings of which could be some risks of false locking or
scheme (HACS) [18] is also proposed in this RX. duty cycle distortion, therefore some measures are taken in the
This paper is organized as follows. Section II describes the circuit designs to address these problems. Two same duty-cycle
TX architecture and design considerations. Section III presents correctors (DCC) [10] based on the phase blending technique
the RX implementation including the overall architecture and [19] are also adopted in the clock path to correct the duty cycle
the circuit techniques. The measurement results as well as the of both the quarter-rate clock and the half-rate clock.
YUAN et al.: A 70 mW 25 Gb/s QUARTER-RATE SerDes TRANSMITTER AND RECEIVER CHIPSET WITH 40 dB OF EQUALIZATION 3
Fig. 3. Timing diagram when loop locked.

Fig. 4. Schematic of VCDL.
TABLE I
S IMULATION R ESULTS OF T IMING PARAMETERS
Another significant benefit of introducing this divider-less

clock generation circuit is that the data path can also be greatly
simplified. As shown in Fig. 2, the 4:2 MUX and 2:1 MUX only
have 3 latches in total, whereas the conventional architecture
as presented in Fig. 1 requires 15 latches, which exhibits that
more than 50% power of the data path is saved. Because the
outputs of the pseudo-random-bit-sequence (PRBS) generator
Fig. 5. Simulated VCDL delay versus control voltage across corners.
are hitherto retimed by the quarter-rate clock, the first-stage
retiming latches can be eliminated. The second-stage retiming
latches can also be removed as a result of the adoption of the
divider-less clock generation. The timing diagram when the
loop locked is shown in Fig. 3, tC−Q is clock-to-output delay
of the first-stage selector. The phase difference between the
quarter-rate clock and the half-rate clock is tD , which is just the
delay time of an XOR PD. The hold time of the second-stage
latch is tH , which is expressed as
tH = tC−Q − tD . (1)
Table I shows the post layout simulation results of these 3

timing parameters across different process and temperature Fig. 6. Schematic of V/I converter.
corners. As shown, tH varies from 5.7 ps to 9.5 ps, which is
completely sufficient for the timing constraint of the 25 Gb/s voltage is from 0 V to 1 V, the simulated time delay and phase
data transmission when 1 unit interval (UI) is equal to 40 ps. delay of the VCDL at 6.25 GHz range from 25.6 ps to 68.7 ps
Moreover, because of the decreased load, the power consump- and 57.6◦ to 154.6◦ respectively. This delay range ensures that
tion of the clock path is further reduced. 90◦ is the unique phase locked point.
Fig. 6 shows the schematic of the V/I converter with a start-
up circuit. The start-up circuit is used to establish the initial
B. Divider-Less Clock Generation
control voltage to 0 V so that it can also avoid the false locking
The proposed divider-less clock generation circuit consists condition. Note that the current asymmetry of the up and down
of a VCDL, an XOR-based PD, and a V/I converter. To paths in the V/I may lead to duty cycle distortion of the half-
eliminate the harmonic locking or false locking for this loop, rate clock, which will influence the jitter performance of the
the delay time of the VCDL is designed within a limited TX output. Fig. 7 depicts the simulated duty cycle results of
range. The schematic of the VCDL in Fig. 4 is realized by the PD’s output across FF, TT, and SS process corners versus
cascading 3 same delay cells based on the differential buffer temperature (sweeps from −40 ◦ C to 85 ◦ C), when the gate
delay stage with a cross-coupled PMOS pair and symmetric voltage of the tail-current transistor VB is fixed to 580 mV. It
loads. Fig. 5 presents the simulation results of the VCDL delay can be seen that the duty cycle is not sensitive to temperature.
across process and temperature variations. When the control However, when the process corner is changed, the duty cycle
TABLE II
S IMULATION R ESULTS OF T ERMINATION I MPEDANCE
Fig. 7. Simulated duty cycle of PD’s output versus temperature across corners
when VB = 580 mV.
Fig. 10. Simulation environment setup.
matching of 50 Ω. Each slice unit includes a pre-driver employ-

ing the CMOS transmission-gate structure and a conventional
output stage. The MOS to poly resistance ratio in the output
stage is set to 1:4. These 25 units are divided into three parts
based on different function. The first part has 4 units, which
are normally connected to the main tap of the FFE. Moreover,
each of these 4 units can be disconnected from the SST driver’s
output to cover the process variations and deviations in the
matched impedance. Table II presents the simulation results of
the termination impedance across the process and temperature
variations. By choosing the total number of the parallel units,
Fig. 8. Simulated duty cycle of PD’s output versus temperature across corners the termination impedance can be calibrated to approximate
with different VB.
50 Ω across different corners. The second part has 15 SST slice
units that belongs to 4 groups with the weighting of 8 ×, 4 ×,
2 ×, 1 ×, respectively, which are controlled by a 4-bit signal
T 3 : 0 to select the connection to the main tap or the post tap.
The remaining 6 units make up the XTC, the principle of which
is similar to that described by [11]. The XTC is calibrated by
a 2-bit signal X1 : 0. When X = 1, the units are connected
to the main tap of the victim lane, meaning that the XTC is
disabled. When X = 0, the units are connected to both the
main tap and the post tap of the adjacent aggressor lane to
compensate for the FEXT.
The equalization value of the FFE can be represented as

Fig. 9. Implementation of SST driver with 2-tap FFE and XTC. (M − P )
EQFFE = −20 log (2)
K
range will deviate from around 50%. Thus, a simple calibration where M and P are the number of the slice units connected
approach is adopted to alleviate this issue. In this design, VB to the main tap and the post-cursor tap, respectively, and K =
is not fixed and can be calibrated through an off-chip digital to M + P . Assuming that the XTC is disabled and four units of
analog converter (DAC). Fig. 8 shows the simulated duty cycle the first part are all connected, then K = 25. When T 3 : 0 =
results of the PD’s output across FF, TT, and SS process corners 0111, M = 17, and P = 8, thus the equalization value calcu-
versus temperature, when VB is sequentially set to 700 mV, lated by (2) is 8.9 dB. The simulation environment setup is
580 mV, and 500 mV. The duty cycle across extreme variations shown in Fig. 10. The ideal capacitors and inductor are used
varies form 48.2% to 50.9%. This deviation range is within 3%, to mimic the output pads and the bonding wire. A 30 cm
so that the following DCC circuit does not need to deal with RLGC transmission line model is utilized as a lossy channel.
large duty cycle distortions. The termination at the RX side is connected to the high power
rail. Fig. 11 shows the simulated eye diagrams before and after
the channel, when the FFE is disabled (T 3 : 0 = 1111). The
C. SST Driver With FFE and XTC
eye after the channel is completely closed without the effect of
The implementation of the SST driver with a 2-tap FFE and FFE. Fig. 12 presents the eye diagrams at the same points when
a far-end XTC is shown in Fig. 9, which consists of 25 parallel T 3 : 0 = 0111. The eye after the channel is largely open with
connected SST slice units needed to achieve the impedance a jitter of 5.7 ps.
Fig. 11. Simulated eye diagrams when T 3 : 0 = 1111. (a) TX’s output at
point A when the channel is disconnected. (b) TX’s output after the channel at
the receiver side B. Fig. 14. Schematic of proposed CTLE with inductive peaking.
Fig. 12. Simulated eye diagrams when T 3 : 0 = 0111. (a) TX’s output at
point A when the channel is disconnected. (b) TX’s output after the channel at
the receiver side B.
Fig. 15. Frequency response of CTLE. (a) Without inductive peaking. (b) With
inductive peaking.
B. CTLE
Fig. 13. Proposed quarter-rate RX architecture. The implementation of the CTLE realized with RC degen-
eration and inductive peaking is described in Fig. 14. The
III. R ECEIVER variable resistor consists of a NMOS transistor and a poly
resistor. By controlling the gate voltage VG of the transistor, the
A. Overall Architecture
CTLE’s boost gain at high frequency can been continuously cal-
Fig. 13 shows the overall block diagram of the proposed ibrated. The inductive peaking technology is adopted to enlarge
quarter-rate RX, which also includes a data path and a clock the boost gain without increasing the power. Fig. 15 shows the
path. The data path consists primarily of a continuous-time simulated frequency response of the proposed CTLE and the
linear equalizer (CTLE), a quarter-rate DFE and error sampler, traditional CTLE with no inductive peaking. When VG changes
a DFE adaption logic with 3 current digital-to-analog converters from 1 V to 400 mV, the traditional CTLE’s boost gain at
(IDACs) and a baud-rate CDR logic. Due to the benefit of baud- 12.5 GHz varies from 0.6 dB to 11 dB. By introducing the in-
rate CDR, the edge samplers used in the common oversampling ductive peaking, the boost gain ranges from 5.5 dB to 16 dB,
CDR are saved, which also induces the simplification in the which means that an extra 5 dB compensation for the channel
phase interpolator (PI) based clock path. As shown in Fig. 1, loss can be provided.
the traditional quarter-rate RX with the DFE adaption requires
the PI to generate 8-phase clocks. However, the PI in
C. Low Power DFE
Fig. 13 only needs to generate 4-phase clocks, the power of
which is reduced considerably. An additional 25%-duty-cycle Fig. 16 presents the implementation of the low-power
clock generator is necessary in the clock path to realize the quarter-rate soft-decision DFE with 4 quarter-lanes ( Quarter_
HACS with little penalty in power and area. The details of the Lane_0–3). In each quarter-lane, data after the equalization
proposed HACS will be described in the following paragraphs. of the CTLE is firstly sampled by the sample-and-hold (S/H)
Fig. 16. Implementation of quarter-rate DFE and error sampler.

Fig. 18. Schematic of latch. (a) First-stage latch. (b) Second-stage latch.
Fig. 19. Timing diagram of HACS and simulated eye diagrams.

Fig. 17. Schematic of dynamic summer block.
coefficients instead of the tail capacitors [15], which can save

circuit. The quarter-rate clock used for S/H has a 25% duty
large chip area and also be convenient for the DFE adaption.
cycle, which allows the S/H to hold the data for 3 UIs. The
Second, a switch transistor (M8) is added to ensure the equal of
followed summer block employing a new dynamic structure
the output nodes (A_N + and A_N −) in the reset phase. Third,
combines the incoming data with 2 feedback taps. The data flow
the main tap is controlled by both Vmain and Vconst, the former
after the summer is divided into 2 branches, data sampler lane
is created by the adaption logic whereas the latter is fixed. To
(DSL) and error sampler lane (ESL). The DSL branch includes
guarantee the amplitude range and make slight adjustment, the
2-stage dynamic latches and a CMOS SR latch. The first-stage
transistor size ratio between M4 and M5 is set to 1:3.
dynamic latch behaves as a slicer to produce the feedback
As presented in Fig. 18(a), the first-stage dynamic latch is
return-to-zero (RZ) data rather than the common non-return-to-
similar to the summer block, which is used as the first-stage
zero (NRZ) data. Then the second-stage dynamic latch and the
slicer. The second-stage latch (Fig. 18(b)) has a complete clock-
SR latch make up a RZ-to-NRZ converter. The ESL is similar
controlled dynamic structure, whose output can be rushed to
to the DSL, the only difference is that an error generator takes
the full swing. So this structure is employed in the RZ-to-NRZ
the place of the first-stage latch. The error generator [18] is also
converter with a CMOS SR latch.
designed based on the same dynamic structure as the summer
of the DFE. The 2 reference levels (VH and VL) are input
D. Proposed HACS
and calibrated by the off-chip DACs. On account of the soft-
decision technique, the taps for each quarter-lane are obtained Note that, as the feedback tap, the output of the first-stage
from other 2 quarter-lanes, which means that it requires only 1 dynamic latch is in the RZ form. How to use the RZ form to
slicer in each DSL branch. Accordingly, the proposed DFE has achieve the soft-decision is a new design challenge. To realize
realized smaller hardware cost compared with [15] and [20] to the combination of soft-decision and dynamic sampling, we
save the power dissipation. propose a hybrid alternate clock scheme. As shown in Fig. 16,
The schematic of the proposed dynamic summer block is there are two sets of clocks in the DFE and error samplers,
shown in Fig. 17, which is implemented to eliminate static CLK_N with a 25% duty cycle and CK_N with a 50% duty
power consumption thoroughly. The principle of the dynamic cycle. The timing diagram of the HACS is reported in Fig. 19.
summer is similar to the current-integrating summer [14], Taking Quarter_Lane_0 for example, CLK_0 controls the S/H
but the clock-controlled NMOS switches (M1–M3) are added to sample the input data. When the CLK_0 is low, the summer
above the tail bias transistors (M4–M7). During the reset phase, block is active. The first-stage latch’s output in the Quarter_
the NMOS switches are off, so that the dynamic summer has no Lane_3 and Quarter_Lane_2 are also generated during this
static power consumption. The dynamic summer also exhibits 3 phase, they behaves as the first tap and the second tap with three
differences compared with the charge-steering structure in [15]. quarters effective value. The output of the summer A_0n is just
First, the tail transistors are responsible for adjusting the tap in time sampled by the first-stage latch using the 90◦ shifted
Fig. 20. Block diagram of CDR logic.
Fig. 23. (a) Implementation of PD. (b) Implementation of SS-LMS logic.
Fig. 21. Block diagram of DFE adaption logic.
Fig. 24. TX die micrograph.
The principles of the PD in baud-rate CDR [16] and the

SS-LMS logic in DFE adaption are also similar, which can be
Fig. 22. Definition of error. (a) Error in CDR. (b) Error in DFE adaption.
described as
ΔPn = Dn−1 ∗ Dn ∗ (En − En−1 ) (4)

clock CLK_1’. At last, the second-stage latch uses the 50%- ΔH(k)n = − (Dn−k ∗ En ) = Dn−k ∗ Dn ∗ En (5)
duty-cycle CK_2 to amplify the signal. These two sets of clocks
are alternate from Quarter_Lane_0 to Quarter_Lane_3. The where ΔPn is the phase error factor for the baud-rate CDR, and
proposed HACS provides 2 merits. First, to ensure the accurate ΔH(k)n means the coefficient error factor of the tap − k for the
clock phase and adequate timing margin for the soft-decision DFE adaption. According to (4) and (5), the implementation of
architecture. Second, to balance the load of the different clocks the PD and SS-LMS logic are exhibited in Fig. 23(a) and (b)
and save the power of the clock distribution. The simulated separately. In the design of PD, 2 additional XORs are em-
output eye diagrams of each part are also presented in Fig. 19, ployed to accelerate the process of the phase detection.
in which both the eye height and the width of effective value
are indicated.
IV. M EASUREMENTS
Both the TX chip and the RX chip are fabricated in a 65 nm
E. Baud-Rate CDR and DFE Adaption
standard CMOS technology with 9 metal layers. The chips are
Fig. 20 shows the block diagram of the CDR digital logic, directly wire-bonded to the printed circuit board (PCB) for the
which includes a PD, a majority voter, a bandwidth controller measurements.
and a digital filter. The output of the digital filter controls the PI
to generate 4-phase clocks. The 2-bit external signal BW_CTRL A. Transmitter Chip
controls the bandwidth controller to calibrate the bandwidth of
the CDR loop. The DFE adaption logic (Fig. 21) is designed The TX die micrograph is shown in Fig. 24. To achieve a
based on the sign-sign least-mean-square (SS-LMS) algorithm, multi-channel application and test the XTC function, this TX
and the structure is similar to the CDR logic. Differently, prototype is designed with 4 data-channels, 2 PRBS generators,
the SS-LMS logic takes place of the PD, and at the end an and a clock path. The output of each data-channel is single-
amplitude limiter is added to ensure the tap coefficient of the ended. The PRBS generator is implemented on the basis of the
DEF within a reasonable range. The different 2-bit external parallel structure, which generates 8-lane PRBS-7 data with dif-
signal controls the bandwidth of the DFE adaption loop. ferent phases to make sure that the output of each data-channel
Note that the actual required error signals in baud-rate CDR still owns the PRBS-7 pattern. The whole chip measures
logic (E) and SS-LMS logic (E ) have a little different, the 1.01 mm × 0.69 mm including the decoupling capacitors and
definition of which are shown in Fig. 22(a) and Fig. 22(b) re- I/O pads. The active area is 0.38 mm × 0.25 mm.
spectively. The relationship between E and E is given by Fig. 25 depicts the simulated power breakdown for TX.
The 4-channel TX consumes a total power of 87 mW from
E = −(D ∗ E). (3) a supply of 1.2 V when working at 25 Gb/s, which means
Fig. 28. Measured eye diagrams of 22 Gb/s single-ended TX output with both
FFE and XTC. (a) VB = 500 mV. (b) VB = 600 mV.
Fig. 25. TX power breakdown.
Fig. 29. (a) Measured eye diagram of 24 Gb/s single-ended TX output with
both FFE and XTC. (b) Measured eye diagram of 25 Gb/s single-ended TX
Fig. 26. (a) Testing PCB. (b) Measured channel loss and eye diagram of output with both FFE and XTC.
20 Gb/s single-ended TX output without FFE and XTC.
DDJ is reduced by 7.2 ps because of the proposed XTC. On this
occasion, T 3 : 0 = 1000 and X1 : 0 = 01.
The measured 22 Gb/s eye diagrams are presented in
Fig. 28(a) and (b). When the VB of V/I is 500 mV [Fig. 28(a)],
the TJ is 31.3 ps with 1.57 ps of duty cycle distortion (DCD).
Then VB is calibrated to 600 mV (Fig. 28(b)), the DCD is
decreased to only 40 fs and the TJ is also improved by 0.8 ps.
Actually, this TX is able to operate up to 25 Gb/s. Fig. 29(a)
and (b) exhibit the eye diagrams at 24 Gb/s and 25 Gb/s,
respectively. Due to the single-ended output structure and more
serious channel loss and crosstalk, both the time margin perfor-
mance and the voltage margin performance are not very ideal
Fig. 27. (a) Measured eye diagram of 20 Gb/s single-ended TX output with even though the measured bit-error-rate (BER) is still smaller
FFE but without XTC. (b) Measured eye diagram of 20 Gb/s single-ended TX than 1e-12. If adopting the differential output structure with less
output with both FFE and XTC. severe channel, the eye diagram at 25 Gb/s will be obviously
improved.
that the average power consumption for each channel is less
than 22 mW. The proposed divider-less clock generation circuit
B. Receiver Chip
(including VCDL, PD, and V/I) consumes only 4.7 mW.
As shown in Fig. 26(a), the 2-inch coupled micro-strip lines Fig. 30 shows the RX die micrograph, which measures
are realized as the test channels with the spacing of 30 mil. 1.512 mm × 1.276 mm including all the test circuits, de-
Fig. 26(b) shows that the measured channel loss at 12.5GHz is coupling MOS capacitors and I/O pads. The active area is
8.9 dB. A Keysight DSOX93204A Infiniium oscilloscope with 0.52 mm × 0.35 mm. Fig. 31 is the simulated power breakdown
32 GHz bandwidth is used to measure the eye diagrams and when the RX operates at 25 Gb/s. The total power consumption
the jitter decomposition. Fig. 26(b) also shows the measured is 48 mW, among which the low-power DFE draws only 6 mW,
20 Gb/s eye diagram of the TX output without the FFE and XTC. thus the power efficiency of the DFE is 0.24 mW/Gb/s. A
The eye is completely closed. Fig. 27(a) presents the measured Keysight E5071C ENA network analyzer is used to measure the
eye diagram of the 20 Gb/s TX output when the FFE is enabled channel loss, and a Tektronix BSA286C 28.6 Gb/s BERTScope
but the XTC is disabled. The total jitter (TJ) is 36 ps and eye is used to test the BER performance.
height is 85 mV. The data dependent jitter (DDJ) is 19.4 ps. The As presented in Fig. 32, the test channel including the
measured eye diagram of the 20 Gb/s TX output when the FFE bonding wire, PCB trace, SMA connector, and cable has a loss
and XTC are both active is shown in Fig. 27(b). The TJ and of 33.8 dB at the Nyquist frequency (12.5 GHz). The eye of the
eye height is improved by 8.2 ps and 95 mV, respectively. The received 25 Gb/s PRBS-7 data after this channel is completely
Fig. 33. (a) Measured eye diagram of the recovered clock at 3.125 GHz.
(b) Measured eye diagram of the demuxed data at 6.25 Gb/s.
Fig. 30. RX die micrograph.
Fig. 34. Measured jitter tolerance result.
Fig. 31. RX power breakdown.
Fig. 32. Measured channel loss and eye diagram of received 25 Gb/s data after Fig. 35. Measured BER bathtub curve.
the channel.
C. Performance Comparison
closed. Fig. 33 shows the eye diagrams of the 3.125 GHz
recovered clock and the 6.25 Gb/s dumuxed data. The measure Table III summarizes the measured performance of the TX
TJ of the recovered clock is 23 ps, with a random RMS jitter of described in this paper, in comparison to some recently pub-
1 ps. The TJ of the 6.25 Gb/s dumuxed data is 52 ps. lished similar designs [5]–[8]. The one-channel power con-
The RX performance is also demonstrated by the jitter tol- sumption of the TX working at 25 Gb/s is 21.8 mW, and the
erance test at 25 Gb/s as shown in Fig. 34, along with the power efficiency is reduced by at least 17% compared with
CEI-25G/28G JTOL mask [21]. Measured at the BER threshold the other state-of-the-art designs. Additionally, the FFE and
of 1e-9, the out-of-band jitter tolerance at 100 MHz is 0.22 UI. XTC are merged together with the SST driver in this work
The BER bathtub curves are measured after turning off the CDR to compensate for both the ISI and FEXT. Table IV is the
logic. As indicated in Fig. 35, at first the DFE is disabled and the RX performance summary and comparison with other similar
CTLE is maximally opened (VG = 400 mV). The measured designs [14]–[16], [20], [22], which indicates that the RX
BER is above 1e-8. Then the adaptive DFE is active so that the implemented in this paper can compensate for the most serious
eye opening reaches 0.42 UI for BER = 1e − 12. channel loss. The eye opening and the DFE power efficiency
TABLE III [3] J. Bulzacchelli et al., “A 28-Gb/s 4-tap FFE/15-Tap DFE serial link trans-
TX P ERFORMANCE S UMMARY AND C OMPARISON ceiver in 32-nm SOI CMOS technology,” IEEE J. Solid-State Circuits,
vol. 47, no. 12, pp. 3232–3248, Dec. 2012.
[4] B. Zhang et al., “A 28 Gb/s multistandard serial link transceiver for
backplane applications in 28 nm CMOS,” IEEE J. Solid-State Circuits,
vol. 50, no. 12, pp. 3089–3100, Dec. 2015.
[5] C. Menolfi et al., “A 28 Gb/s source-series terminated TX in 32 nm CMOS
SOI,” in IEEE ISSCC Dig. Tech. Papers, 2012, pp. 334–336.
[6] Y. H. Song et al., “An 8-to-16 Gb/s 0.65-to-1.05 pJ/b 2-tap impedance-
modulated voltage-mode transmitter with fast power-state transitioning in
65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, 2014, pp. 446–447.
[7] H. Wang et al., “A 21-Gb/s 87-mW transceiver with FFE/DFE/analog
equalizer in 65-nm CMOS technology,” IEEE J. Solid-State Circuits,
vol. 45, no. 4, pp. 909–920, Apr. 2010.
[8] J. Kim et al., “A 16-to-40 Gb/s quarter-rate NRZ/PAM4 dual-mode
transmitter in 14 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, 2015,
pp. 60–61.
[9] A. A. Hafez et al., “A 32–48 Gb/s serializing transmitter using multiphase
serialization in 65 nm CMOS technology,” IEEE J. Solid-State Circuits,
vol. 50, no. 3, pp. 763–775, Mar. 2015.
[10] S. Yuan et al., “A 4 × 20-Gb/s 0.86 pJ/b/lane 2-tap-FFE source-series-
TABLE IV terminated transmitter with far-end crosstalk cancellation and divider-less
RX P ERFORMANCE S UMMARY AND C OMPARISON clock generation in 65 nm CMOS,” in Proc. IEEE Custom Integr. Circuits
Conf. (CICC), 2015, pp. 1–4.
[11] S. Y. Kao et al., “A 7.5-Gb/s one-tap-FFE transmitter with adaptive far-
end crosstalk cancellation using duty cycle detection,” IEEE J. Solid-State
Circuits, vol. 48, no. 2, pp. 391–404, Feb. 2013.
[12] H. K. Jung et al., “A slew-rate controlled transmitter to compensate for
the crosstalk-induced jitter of coupled microstrip lines,” in Proc. IEEE
Custom Integr. Circuits Conf. (CICC), 2010, pp. 1–4.
[13] K. L. J. Wong et al., “A 5-mW 6-Gb/s quarter-rate sampling receiver with
a 2-tap DFE using soft decisions,” IEEE J. Solid-State Circuits, vol. 42,
no. 4, pp. 881–888, Apr. 2007.
[14] A. Agrawal et al., “A 19 Gb/s serial link receiver with both 4-tap FFE and
5-tap DFE functions in 45 nm SOI CMOS,” in IEEE ISSCC Dig. Tech.
Papers, 2012, pp. 134–135.
[15] J. W. Jung and B. Razavi, “A 25 Gb/s 5.8 mW CMOS equalizer,” in IEEE
ISSCC Dig. Tech. Papers, 2014, pp. 44–45.
[16] F. Spagna et al., “A 78 mW 11.8 Gb/s serial link transceiver with adaptive
RX equalization and baud-rate CDR in 32 nm CMOS,” in IEEE ISSCC
Dig. Tech. Papers, 2010, pp. 366–367.
[17] P. A. Francese et al., “A 16 Gb/s 3.7 mW/Gb/s 8-tap DFE receiver and
baud-rate CDR with 31 kppm tracking bandwidth,” IEEE J. Solid-State
Circuits, vol. 49, no. 11, pp. 2490–2502, Nov. 2014.
[18] S. Yuan et al., “A 48 mW 15-to-28 Gb/s source-synchronous receiver with
adaptive DFE using hybrid alternate clock scheme and baud-rate CDR in
are also comparable to the others. The power efficiency of the 65 nm CMOS,” in Proc. Eur. Solid-State Circuits Conf. (ESSCIRC), 2015,
pp. 144–147.
entire RX is 1.92 mW/Gb/s, which is improved by at least 47% [19] J. Y. Chang et al., “A 15–20 GHz delay-locked loop in 90 nm CMOS
compared with the other state-of-the-art designs. technology,” in Proc. IEEE Asian Solid-State Circuits Conf. (ASSCC),
2008, pp. 213–216.
[20] R. Bai et al., “A 0.25 pJ/b 0.7 V 16 Gb/s 3-tap decision-feedback equalizer
V. C ONCLUSION in 65 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, 2014, pp. 46–47.
[21] Optical Internetworking Forum, “IA title: Common electrical I/O
In this paper, we have demonstrated a 25 Gb/s quarter-rate (CEI)—Electrical and jitter interoperability agreements for 6G + bps, 11G
SerDes transmitter and receiver chipset fabricated in a 65 nm + bps and 25G + bps I/O, section 13.3.11.2.1.” [Online]. Available: http://
CMOS technology. Some novel structures and clock schemes www.oiforum.com/public/documents/OIF_CEI_03.1.pdf
[22] Y. Doi et al., “32 Gb/s data-interpolator receiver with 2-tap DFE in
are developed for the purpose of low power and intensive 28 nm CMOS,” in IEEE ISSCC Dig. Tech. Papers, 2013, pp. 36–37.
equalization ability. The total power consumption of the TX
and RX chipset is approximately 70 mW (TX: 21.8 mW, RX:
48 mW) at 25 Gb/s, which compensates for more than 40 dB
channel loss (TX: 8.9 dB, RX: 33.8 dB). The power efficiency
and equalization performance are better than the reported state-
of-the-art works.
Shuai Yuan (S’15) was born in Jilin, China. He
received the B.S. degree from the Department
R EFERENCES of Electronic Engineering, Tsinghua University,
[1] CISCO VNI forecast highlights. [Online]. Available: http://www.cisco. Beijing, China, in 2011. He is now working toward
com/c/en/us/solutions/service-provider/visual-networking-index-vni/ the Ph.D. degree at the Institute of Microelectron-
vni-forecast.html ics, Tsinghua University, China. His research inter-
[2] H. Kimura et al., “A 28 Gb/s 560 mW multi-standard SerDes with single- ests include high-speed wireline transceiver and low
stage analog front-end and 14-tap decision feedback equalizer in 28 nm power equalizer.
CMOS,” IEEE J. Solid-State Circuits, vol. 49, no. 12, pp. 3091–3103,
Dec. 2014.
Liji Wu (M’95) received the B.S., M.S., and Ph.D. Chun Zhang was born in 1972 in Jilin province,
degrees in electronic engineering from Tsinghua China. He received the B.S. and Ph.D. degrees from
University, Beijing, China, in 1988, 1991, and 1997, the Department of Electronic Engineering, Tsinghua
respectively. University, Beijing, China, in 1995 and 2000, re-
From April 1997 to May 2000, he worked in the spectively. He works in Tsinghua University since
Center for Advanced Technology in Telecommunica- 2000 till now. He worked in the Department of Elec-
tions, Polytechnic Institute of New York University tronic Engineering from 2000 to 2004. From 2005,
(NYU-Poly), Brooklyn, NY, USA, as a Postdoctoral he is an Associate Professor of the Institute of Micro-
Fellow, worked on design and implementation of electronics. His research interests include mixed
high-speed control circuits and systems utilized in signal integrated circuits and systems, embedded
WDM ATM Multicast optical switching systems microprocessor design, digital signal processing, and
sponsored by DARPA. Then he worked in high-tech industry in the U.S. for radio frequency identification.
more than 4 years, including TyCom Laboratories (former AT&T Bell Labs on
Undersea Optical Fiber Communications), Eatontown, NJ, USA, as a Senior
Member of Technical Staff. He received Tsinghua University Outstanding Zhihua Wang (M’99–SM’04) received the B.S.,
Graduate Award and Medal in 1988. He joined Tsinghua University as a Full- M.S., and Ph.D. degrees in electronic engineering
Time Faculty since 2005. He is a Board Member of Shanghai Pudong Science from Tsinghua University, Beijing, China, in 1983,
& Technology Association, Shanghai, China since 2006. 1985, and 1990, respectively.
In 1983, he joined the faculty at Tsinghua Uni-
versity, where he is a full Professor since 1997 and
Ziqiang Wang was born in 1975 in Beijing, China. Deputy Director of Institute of Microelectronics
He received the B.S. and Ph.D. degrees from the since 2000. From 1992 to 1993, he was a visiting
Department of Electronic Engineering, Tsinghua scholar at Carnegie Mellon University. From 1993 to
University, Beijing, China, in 1999 and 2006, respec- 1994, he was a Visiting Researcher at KU Leuven,
tively. He works as a Research Assistant in the Insti- Belgium. His current research mainly focuses on
tute of Microelectronics, Tsinghua University, after CMOS RF IC and biomedical applications. His ongoing work includes RFID,
Doctor graduation. From 2015, he is an Associate PLL, low-power wireless transceivers, and smart clinic equipment with combi-
Professor of the Institute of Microelectronics. His nation of leading edge CMOS RFIC and digital imaging processing techniques.
research interest is analog circuit design. Prof. Wang has served as Deputy Chairman of Beijing Semiconductor
Industries Association and ASIC Society of Chinese Institute of Communi-
cation, as well as Deputy Secretary General of Integrated Circuit Society in
China Semiconductor Industries Association. He had been one of the chief
scientists of the China Ministry of Science and Technology serves on the
Xuqiang Zheng received the B.S. and M.S. degrees expert committee of the National High Technology Research and Development
from Central South University, Hunan, China. He is
Program of China (863 Program) in the area of information science and
now working as an engineer in Tsinghua University, technologies from 2007 to 2011. He had been an official member of China
China. His research interest is high-speed SerDes Committee for the Union Radio-Scientifique Internationale (URSI) during 2000
design and analog to digital converter. to 2010. He was the chairman of IEEE Solid-State Circuit Society Beijing
Chapter during 1999–2009. He has been a Technologies Program Committee
member of ISSCC (International Solid-State Circuit Conference) during 2005
to 2011. He is an Associate Editor for IEEE T RANSACTIONS ON B IOMEDICAL
C IRCUITS AND S YSTEMS and IEEE T RANSACTIONS ON C IRCUITS AND
S YSTEMS —PART II: E XPRESS B RIEFS .

Yuan 2016

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yuan 2016

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1

A 70 mW 25 Gb/s Quarter-Rate SerDes Transmitter

Abstract—A 25 Gb/s transmitter (TX) and receiver (RX) chipset

Fig. 1. Traditional architecture: half-rate TX and quarter-rate RX with DFE

G LOBAL internet traffic will grow 3.2-fold from 2014 to

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

[9], in which the highest-speed serialization is achieved by

Fig. 3. Timing diagram when loop locked.

Another significant benefit of introducing this divider-less

Table I shows the post layout simulation results of these 3

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

matching of 50 Ω. Each slice unit includes a pre-driver employ-

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 16. Implementation of quarter-rate DFE and error sampler.

Fig. 19. Timing diagram of HACS and simulated eye diagrams.

coefficients instead of the tail capacitors [15], which can save

Fig. 20. Block diagram of CDR logic.

Fig. 23. (a) Implementation of PD. (b) Implementation of SS-LMS logic.

Fig. 21. Block diagram of DFE adaption logic.

Fig. 24. TX die micrograph.

The principles of the PD in baud-rate CDR [16] and the

ΔPn = Dn−1 ∗ Dn ∗ (En − En−1 ) (4)

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

Fig. 25. TX power breakdown.

Fig. 30. RX die micrograph.

Fig. 34. Measured jitter tolerance result.

Fig. 31. RX power breakdown.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS

You might also like