Professional Documents
Culture Documents
Bichan - Intel - Cicc2020 - A 32Gbs NRZ 37dB SerDes in 10nm CMOS To Support PCI Express Gen 5 Protocol
Bichan - Intel - Cicc2020 - A 32Gbs NRZ 37dB SerDes in 10nm CMOS To Support PCI Express Gen 5 Protocol
I. INTRODUCTION
Figure 1: Receiver block diagram.
Growing data center bandwidth demand increases the need for
fast wireline transmission. One of the key protocols used to
transfer data between SoC’s is PCI Express. The recently
published PCI Express Gen 5 standard allows data transmission
at up to 32 Gb/s [1]. The practical design requirements for a PCI
Express Gen 5 PHY IP go beyond protocol compliance. Other
considerations include: small IP area, die-edge dimension, and
power to allow large numbers of data links in a single SoC, large
dynamic temperature range for extreme environmental
conditions, advanced power management to reduce power when
the link is not in use, low data path latency to improve overall
system performance, and robust performance with internal
adaptive equalization. This paper presents the first SerDes
design to demonstrate a PCI Express 5 link with area of 0.33mm2
per lane, die edge usage per lane of 285um, dynamic junction
temperature range from -40C to 125C, energy efficiency of Figure 2: Transmitter block diagram.
11.4pJ/bit including PLL and clocking, power management
including power gating for all analog blocks, continuous data AC-coupled passive attenuator providing programmable 2-7dB
rate support between 1-32Gb/s, and supporting channel attenuation as well as low-frequency equalization. The middle
topologies with insertion loss up to 37dB at 16GHz with BER < node of the pi-coil connects to ESD diodes sufficient to meet the
1e-12 in 10nm process technology [2]. 250V CDM and 1kV HBM requirements. The attenuator is
followed by a two-stage continuous time linear equalizer
II. RECEIVER (CTLE). The CTLE stages are CML amplifiers with resistive
loads and passive inductor peaking. They have adjustable DC
The receiver analog front end is shown in Fig. 1. It consists of gain and zero frequency with fixed Nyquist gain. These are
an input stage with pi-coil and 100Ω differential termination followed by a variable gain amplifier (VGA) that provides
supporting a common-mode range of 0-0.7V, followed by an broadband gain control using a gm-subtraction structure. This
III. TRANSMITTER
Figure 3: Synthesizer block diagram.
The TX shown in Fig. 2 is a voltage-mode design where the data
structure gives flat gain or attenuation in the range +6dB to - path operates from a 1V supply and the clock path uses 1V
10dB. Finally, a set of four parallel broadband amplifiers with regulated down from 1.8V. The final 2:1 mux is set back six
fixed 6dB gain up to 30GHz drive the decision feedback stages from the final output in order to reduce power at the
equalizer (DFE) linear summing nodes. The front end has total expense of additional supply noise sensitivity. Impedance
Nyquist gain of 16dB and 5mV-rms output-referred noise. control is achieved by disabling some of the 36 units and
achieves granularity better than 5%. Equalization is supported
The CTLE is followed by an analog DFE with clock and data by a 5-tap FIR filter where any four consecutive taps can be
recovery (CDR) using a half-rate Alexander phase detector. The selected at once. The TX supports receiver detection, duty cycle
equalization on the data samples is provided by a 12-tap DFE correction, and a serializer with parallel word widths of
while the equalization on the edge samples is provided by a 3- 8/10/16/20/32/40 bits. The transmitter is able to send a 32Gb/s
tap DFE. Independent DFE tap weights are allowed for each of loopback signal to the receiver to enable debugging and high
the even and odd summing nodes. The data path feeding the volume test with coverage over most of the analog circuitry.
feedback taps uses a quarter-rate implementation for power
efficiency. This implementation is functionally equivalent to a The clocking scheme for the transmitter is shown in Fig. 3 and
half-rate implementation but uses 8GHz quadrature clocks to consists of three PLLs which are shared across four data lanes
save power. The slicers use a double-tail topology [3] with 3mV- with a half-rate clock distribution network. A complementary
rms input-referred noise. The data summing nodes include an LC-DCO generates the 16GHz clock needed to support 32Gb/s
auxiliary slicer sampling in-phase with the data slicer at the +1 operation. A single-ended three-stage CMOS ring oscillator
or -1 levels to enable LMS adaptation. The auxiliary slicer also generates frequencies from 4-16GHz to support all data rates
enables vertical eye margining and background slicer offset
correction. This correction is required to maintain continuous
error-free operation over the full 165C temperature range as the
slicer thresholds can shift after the initial offset calibration. Two
speculative edge slicers are used in each edge summing node to
shift the sampling point and better equalize the first precursor
ISI term [4]. The remaining edge taps are used to reduce bang-
bang jitter by reducing ISI at the edge crossings.
ACKNOWLEDGMENT
The authors thank the Intel Mixed-Signal IP Group in Toronto,
Hillsboro, and Santa Clara as well as Dror Lazar for his
valuable guidance.
REFERENCES
[1] “PCIe Express Base Specification”, Available: http//:www.pcisig.com
[2] C. Auth et al., "A 10nm High Performance and Low-Power CMOS
Figure 9: Measured LCDCO coarse frequency range. The entire coarse Technology Featuring 3rd Generation FinFET Transistors Self-Aligned
frequency range is available dynamically in mode 2. Quad Patterning Contact over Active Gate and Cobalt Local
Interconnects", International Electron Devices Meeting (IEDM), 2017.
transmitter compliance measurements for the common reference [3] D. Schinkel, E. Mensink, E. Klumperink, Ed Van Tuijl, B. Nauta, "A
clock configuration. Double-Tail Latch-Type Voltage Sense Amplifier with 18ps Setup-Hold
Time," ISSCC Dig. of Tech. Papers, pp.314-315, Feb., 2007.
[4] F. Spagna, “Clock and Data Recovery systems”, IEEE Custom Integrated
Fig. 9 shows the measured frequency range of the LCDCO with Circuits Conference (CICC), 2018.
both extreme values of the fine control. The entire coarse range [5] T. Shibasaki et al., “A 56Gb/s NRZ-Electrical 247mW/lane Serial-Link
shown can be accessed dynamically during normal PLL Transceiver in 28nm,” ISSCC Dig. of Tech. Papers, pp.64-65, Feb. 2016.
operation providing much more range than needed to meet the [6] T. Norimatsu et al., “A 25Gb/s Multistandard Serial Link Transceiver for
165C dynamic temperature range requirement. Jitter increases 50dB-loss Copper Cable in 28nm CMOS,” ISSCC Dig. of Tech. Papers,
pp.60-61, Feb. 2016.
slightly in mode 2 due to additional quantization noise of the fast
[7] P. Upadhyaya et al., “A Fully-Adaptive Wideband 0.5-32.75Gb/s FPGA
switching capacitor. Transceiver in 16nm FinFET CMOS Technology,” Symposium on VLSI
Circuits Dig. Of Tech. Papers, 2016.
Fig. 10 shows the performance summary and comparison to [8] M.S. Jalali et al., “A 4-lane 1.25-to-28.05Gb/s Multi-Standard 6pJ/b 40dB
previously published papers and demonstrates that this work has Transceiver in 14nm FinFET with Independent TX/RX Rate Support,”
the smallest area for a 32Gbaud-capable long reach PHY. This ISSCC Dig. of Tech. Papers, pp.106-107, Feb. 2018.
[9] M.-A. LaCroix et al., “A 60Gb/s PAM-4 ADC-DSP Transceiver in 7nm
paper presents the first SerDes design to demonstrate a PCI CMOS with SNR-Based Adaptive Power Scaling Achieving 6.9pJ/b at
Express Gen 5 link. 32dB Loss,” ISSCC Dig. of Tech. Papers, pp.114-115, Feb. 2019.