Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A 32Gb/s NRZ 37dB SerDes in 10nm CMOS to

Support PCI Express Gen 5 Protocol


Mike Bichan, Clifford Ting, Bahram Zand, Jing Wang, Ruslana Shulyzki, James Guthrie, Katya Tyshchenko,
Junhong Zhao, Alireza Parsafar, Eric Liu, Aynaz Vatankhahghadim, Shaham Sharifian, Aleksey Tyshchenko, Michael
De Vita1, Syed Rubab, Sitaraman Iyer1, Fulvio Spagna1, Noam Dolev2
Mixed-Signal IP Group
Intel Corp.
Toronto, Canada
1
Santa Clara, USA
2
Chandler, USA

Abstract— This paper presents the first SerDes design to


demonstrate a PCI-Express 5 link with area of 0.33mm2 per lane,
die edge usage per lane of 285 um, dynamic junction temperature
range from -40C to 125C, energy efficiency of 11.4pJ/bit including
PLL and clocking, power management including power gating for
all analog blocks, continuous data rate support between 1-32 Gb/s,
and supporting channel topologies with insertion loss up to 37dB
at 16GHz with BER < 1e-12 in 10nm process technology.

Keywords—PCIe; SerDes; NRZ; 10nm;

I. INTRODUCTION
Figure 1: Receiver block diagram.
Growing data center bandwidth demand increases the need for
fast wireline transmission. One of the key protocols used to
transfer data between SoC’s is PCI Express. The recently
published PCI Express Gen 5 standard allows data transmission
at up to 32 Gb/s [1]. The practical design requirements for a PCI
Express Gen 5 PHY IP go beyond protocol compliance. Other
considerations include: small IP area, die-edge dimension, and
power to allow large numbers of data links in a single SoC, large
dynamic temperature range for extreme environmental
conditions, advanced power management to reduce power when
the link is not in use, low data path latency to improve overall
system performance, and robust performance with internal
adaptive equalization. This paper presents the first SerDes
design to demonstrate a PCI Express 5 link with area of 0.33mm2
per lane, die edge usage per lane of 285um, dynamic junction
temperature range from -40C to 125C, energy efficiency of Figure 2: Transmitter block diagram.
11.4pJ/bit including PLL and clocking, power management
including power gating for all analog blocks, continuous data AC-coupled passive attenuator providing programmable 2-7dB
rate support between 1-32Gb/s, and supporting channel attenuation as well as low-frequency equalization. The middle
topologies with insertion loss up to 37dB at 16GHz with BER < node of the pi-coil connects to ESD diodes sufficient to meet the
1e-12 in 10nm process technology [2]. 250V CDM and 1kV HBM requirements. The attenuator is
followed by a two-stage continuous time linear equalizer
II. RECEIVER (CTLE). The CTLE stages are CML amplifiers with resistive
loads and passive inductor peaking. They have adjustable DC
The receiver analog front end is shown in Fig. 1. It consists of gain and zero frequency with fixed Nyquist gain. These are
an input stage with pi-coil and 100Ω differential termination followed by a variable gain amplifier (VGA) that provides
supporting a common-mode range of 0-0.7V, followed by an broadband gain control using a gm-subtraction structure. This

978-1-7281-6031-3/20/$31.00 ©2020 IEEE


the difference between cursor and first precursor. The CTLE
boost algorithm attempts to adjust the CTLE boost until the DFE
tap 2 weight is zero to avoid under- and over-equalization. The
VGA gain control algorithm targets an eye height at the slicer
input of ±100mV.

The CDR shown in Fig. 1 uses an Alexander phase detector


where the quadrature clocks are generated by a differential four-
stage CMOS ring oscillator. The oscillator has a dedicated
regulator with 40dB rejection from a 1.8V supply to compensate
for the inherent supply sensitivity of the CMOS ring. Additional
care is taken in the oscillator biasing to minimize phase noise.
This topology allows low latency and provides a CDR
bandwidth of 40MHz. The RX oscillator drives a phase
interpolator (PI) with ±0.25UI range for initial calibration of IQ
alignment. Following the PI is a duty cycle correction circuit
with ±6ps range.

III. TRANSMITTER
Figure 3: Synthesizer block diagram.
The TX shown in Fig. 2 is a voltage-mode design where the data
structure gives flat gain or attenuation in the range +6dB to - path operates from a 1V supply and the clock path uses 1V
10dB. Finally, a set of four parallel broadband amplifiers with regulated down from 1.8V. The final 2:1 mux is set back six
fixed 6dB gain up to 30GHz drive the decision feedback stages from the final output in order to reduce power at the
equalizer (DFE) linear summing nodes. The front end has total expense of additional supply noise sensitivity. Impedance
Nyquist gain of 16dB and 5mV-rms output-referred noise. control is achieved by disabling some of the 36 units and
achieves granularity better than 5%. Equalization is supported
The CTLE is followed by an analog DFE with clock and data by a 5-tap FIR filter where any four consecutive taps can be
recovery (CDR) using a half-rate Alexander phase detector. The selected at once. The TX supports receiver detection, duty cycle
equalization on the data samples is provided by a 12-tap DFE correction, and a serializer with parallel word widths of
while the equalization on the edge samples is provided by a 3- 8/10/16/20/32/40 bits. The transmitter is able to send a 32Gb/s
tap DFE. Independent DFE tap weights are allowed for each of loopback signal to the receiver to enable debugging and high
the even and odd summing nodes. The data path feeding the volume test with coverage over most of the analog circuitry.
feedback taps uses a quarter-rate implementation for power
efficiency. This implementation is functionally equivalent to a The clocking scheme for the transmitter is shown in Fig. 3 and
half-rate implementation but uses 8GHz quadrature clocks to consists of three PLLs which are shared across four data lanes
save power. The slicers use a double-tail topology [3] with 3mV- with a half-rate clock distribution network. A complementary
rms input-referred noise. The data summing nodes include an LC-DCO generates the 16GHz clock needed to support 32Gb/s
auxiliary slicer sampling in-phase with the data slicer at the +1 operation. A single-ended three-stage CMOS ring oscillator
or -1 levels to enable LMS adaptation. The auxiliary slicer also generates frequencies from 4-16GHz to support all data rates
enables vertical eye margining and background slicer offset
correction. This correction is required to maintain continuous
error-free operation over the full 165C temperature range as the
slicer thresholds can shift after the initial offset calibration. Two
speculative edge slicers are used in each edge summing node to
shift the sampling point and better equalize the first precursor
ISI term [4]. The remaining edge taps are used to reduce bang-
bang jitter by reducing ISI at the edge crossings.

The adaptation logic uses a sign-sign LMS algorithm for data


taps, edge taps, sampling point optimization, CTLE boost
control, and VGA gain control. An on-die microcontroller is
included for additional adaptation capabilities during link
training. Data and edge equalization are separated so that ISI at
the sampling instant and jitter at the zero crossings can be
independently optimized. The sampling point optimization Figure 4: LCDCO schematic, 16GS/s modulation scheme and measured
algorithm adjusts the skew between I and Q clocks to maximize phase noise in both available modes.
Figure 6: Measurement results for 32Gb/s: TX eye diagram with PRBS31
data, channel response excluding package, stressed RX jitter tolerance with
37dB channel, and vertical bathtub plot.
Figure 5: Photograph of one of the quad macros on the die, from a wafer
which had fabrication halted after M3. The full test chip including a 16-lane
subsystem (top) and an additional quad (bottom) is shown at right.
bottom-plate capacitance for a given on-capacitance leading to
wider overall tuning range.
from 1-32Gbps. When used, the ring PLL is cascaded after the
LC-PLL for reference clock clean-up. The reference clock IV. MEASUREMENT RESULTS
clean-up LC-PLL allows the ring PLL to be configured with
high bandwidth to minimize ring oscillator phase noise without The SerDes was fabricated in a 10nm process as a x16 PCI
increasing the contribution of reference clock jitter to the output Express port with four quads. A photo of one of the quads is
jitter. A differential four-stage CMOS ring oscillator is used to shown in Fig. 5. Fig. 6 shows the transmitter eye diagram for
generate a 10GHz clock divided down to 2.5GHz to enable some 32Gb/s operation and the measured insertion loss of the channel
lanes to operate in Gen1/2 while other lanes are in Gen3/4/5. not including 5.5dB additional package loss. Fig. 6 also shows
This second ring oscillator is sized down to save power and area the stressed jitter tolerance measurement for the same channel
since phase noise requirements at Gen1/2 are much lower. Each with all lanes operating at 32Gb/s, and a vertical bathtub plot
data lane can independently choose the 2.5GHz clock or the showing 86mV margin at BER = 1e-12. The stressed jitter
clock generated by the other two PLLs. tolerance measurement applies random jitter, differential-mode
noise, and common-mode noise required by the PCIe
To meet the 165C dynamic temperature range requirement, the specification in addition to sinusoidal jitter. All 16 lanes are
LC-PLL implements a frequency tuning scheme (mode 2) that sending and receiving PRBS31 data to provide the crosstalk and
provides dynamic access to the entire coarse frequency range. noise that would be expected in normal operation. Additional
This scheme employs one differential 3.5fF tuning capacitor jitter tolerance measurement results are shown in Fig. 7 using
modulated at 16GHz using the scheme shown in Fig. 4. The fast slow, typical, and fast parts and measuring over a ±5% voltage
capacitor receives a 4-bit code at 1GS/s which is then decoded range at both 0C and 100C. Fig. 9 shows a summary of
and serialized up to 16GS/s. The 4-bit code is synchronous to
the 8-bit coarse cap code which is also updated at 1GS/s. The
mapping from 4-bit code to 1-bit code is chosen to minimize
worst-case quantization noise jitter over all 16 input words.
Sufficient flipflops are inserted to ensure that the 80 coarse caps
and the single fast cap receive code changes within 10ps of one
another to minimize jitter.

A separate fine cap control mode (mode 1) is also supported in


which the fast cap is disabled and the coarse code is simply
calibrated upon start-up and fixed. This mode uses the same
3.5fF tuning capacitor and achieves sufficient frequency
granularity by connecting a capacitor array close to the center
tap of the inductor. Capacitors connected in this manner exert
less influence of the oscillator frequency despite their relatively
Figure 7: Stressed receiver jitter tolerance measurement at 32Gb/s with
large size. One advantage of large capacitor size is lower 35dB channel over process, voltage, and temperature.
Figure 8: Transmitter compliance measurement results.

Figure 10: Performance summary and comparison.

ACKNOWLEDGMENT
The authors thank the Intel Mixed-Signal IP Group in Toronto,
Hillsboro, and Santa Clara as well as Dror Lazar for his
valuable guidance.
REFERENCES
[1] “PCIe Express Base Specification”, Available: http//:www.pcisig.com
[2] C. Auth et al., "A 10nm High Performance and Low-Power CMOS
Figure 9: Measured LCDCO coarse frequency range. The entire coarse Technology Featuring 3rd Generation FinFET Transistors Self-Aligned
frequency range is available dynamically in mode 2. Quad Patterning Contact over Active Gate and Cobalt Local
Interconnects", International Electron Devices Meeting (IEDM), 2017.
transmitter compliance measurements for the common reference [3] D. Schinkel, E. Mensink, E. Klumperink, Ed Van Tuijl, B. Nauta, "A
clock configuration. Double-Tail Latch-Type Voltage Sense Amplifier with 18ps Setup-Hold
Time," ISSCC Dig. of Tech. Papers, pp.314-315, Feb., 2007.
[4] F. Spagna, “Clock and Data Recovery systems”, IEEE Custom Integrated
Fig. 9 shows the measured frequency range of the LCDCO with Circuits Conference (CICC), 2018.
both extreme values of the fine control. The entire coarse range [5] T. Shibasaki et al., “A 56Gb/s NRZ-Electrical 247mW/lane Serial-Link
shown can be accessed dynamically during normal PLL Transceiver in 28nm,” ISSCC Dig. of Tech. Papers, pp.64-65, Feb. 2016.
operation providing much more range than needed to meet the [6] T. Norimatsu et al., “A 25Gb/s Multistandard Serial Link Transceiver for
165C dynamic temperature range requirement. Jitter increases 50dB-loss Copper Cable in 28nm CMOS,” ISSCC Dig. of Tech. Papers,
pp.60-61, Feb. 2016.
slightly in mode 2 due to additional quantization noise of the fast
[7] P. Upadhyaya et al., “A Fully-Adaptive Wideband 0.5-32.75Gb/s FPGA
switching capacitor. Transceiver in 16nm FinFET CMOS Technology,” Symposium on VLSI
Circuits Dig. Of Tech. Papers, 2016.
Fig. 10 shows the performance summary and comparison to [8] M.S. Jalali et al., “A 4-lane 1.25-to-28.05Gb/s Multi-Standard 6pJ/b 40dB
previously published papers and demonstrates that this work has Transceiver in 14nm FinFET with Independent TX/RX Rate Support,”
the smallest area for a 32Gbaud-capable long reach PHY. This ISSCC Dig. of Tech. Papers, pp.106-107, Feb. 2018.
[9] M.-A. LaCroix et al., “A 60Gb/s PAM-4 ADC-DSP Transceiver in 7nm
paper presents the first SerDes design to demonstrate a PCI CMOS with SNR-Based Adaptive Power Scaling Achieving 6.9pJ/b at
Express Gen 5 link. 32dB Loss,” ISSCC Dig. of Tech. Papers, pp.114-115, Feb. 2019.

You might also like