Professional Documents
Culture Documents
Kurd Et Al 2015 Haswell
Kurd Et Al 2015 Haswell
Kurd Et Al 2015 Haswell
1, JANUARY 2015 49
Abstract—We describe the 4th Generation Intel® Core™ sleep states (250 s “C9” and 500 s–5 ms “C10”) and PCH
processor family (codenamed “Haswell”) implemented on Intel® modifications (33% active and 94% standby power savings)
22 nm technology and intended to support form factors from desk- With the scalable Fully-Integrated Voltage Regulator (FIVR)
tops to fan-less Ultrabooks™. Performance enhancements include
a 102 GB/sec L4 eDRAM cache, hardware support for transac- system, each variant requires a single platform VR (instead of
tional synchronization, and new FMA instructions that double as many as five) in addition to the DDR VR.
FP operations per clock. Power improvements include Fully-In- The rest of the paper is organized as follows. Sections II and
tegrated Voltage Regulators ( 50% battery life extension), new III cover key core architectural additions, Sections IV–VI ad-
low-power states (95% standby power savings), optimized MCP dress I/O, and Sections VII and VIII cover low-power states and
I/O system (1.0–1.22 pJ/b), and improved DDR I/O circuits (40%
active and 100x idle power savings). Other improvements include FIVR. A summary is given in Section IX.
full-platform optimization via integrated display I/O interfaces.
II. INTEL® TRANSACTIONAL SYNCHRONIZATION EXTENSIONS
Index Terms—IA, on-packge I/O, DDR, eDRAM, voltage regu-
lator, FIVR, power management, FMA, Intel ® TSX. Intel® Transactional Synchronization Extensions (Intel®
TSX) applies the paradigm of speculative execution to
multi-threaded mutual exclusion. Intel TSX allows mul-
I. INTRODUCTION tiple threads to execute, in parallel, critical sections protected
by the same lock using a technique known as “lock elision”.
U SER experience drives not only platform velocity but Here, the critical section is executed transactionally but the
also radical market-segment diversification. The Haswell lock is only read and not acquired. This allows other threads
family (Fig. 1) answers this imperative via a set of common to not serialize and exposes parallelism difficult to extract at
building blocks (Platform Controller Hubs (PCHs), memory, compile time. The processor can thus determine dynamically if
CPU cores, media engines, graphics etc.) that span fanless it needs to serialize critical section execution.
Ultrabooks™ and 2-in-1 devices to traditional and All-In-One Intel TSX provides programmers with two software inter-
desktop systems [1]. Fig. 2 shows example die photos for the faces to specify transactional regions. The Hardware Lock Eli-
“ ” and “ ” (cores graphics level) versions, at 1.4B sion (HLE) interface is a pair of legacy compatible prefixes
xtors/ 177 mm , and 1.3B xtors/ 181 mm respectively. called XACQUIRE and XRELEASE. The Restricted Transac-
The 4-core halo version with Iris Pro graphics weighs in at tional Memory (RTM) interface is a pair of new instructions
260 mm . All variants use Intel’s 22 nm Tri-gate process [2], called XBEGIN and XEND. The Intel® 64 Architecture Soft-
with segment-specific speed/leakage targeting, two additional ware Developer Manual provides a detailed specification for
metal layers versus prior generation Ivy Bridge [3], and added these new instructions and the Intel 64 Architecture Optimiza-
high-density metal-insulator-metal (MIM) capacitors. Further tion Reference Manual presents guidelines for program opti-
per-die process optimization is made possible by Haswell’s mization.
Multi-Chip Packaging (MCP) strategy. The Haswell implementation uses the first level 32 KB data
Depending on configuration, core frequencies range from cache (L1) to track the memory addresses accessed (both read
1.0–3.8 GHz. Architectural upgrades detailed below yield and written) during a transactional execution and to buffer any
a baseline 13% specint* and 40% specviewperf* genera- transactional updates performed to memory. The implemen-
tional performance boost. The graphics halo variant (rivaling tation makes these updates visible to other threads only on
entry/mid-level discrete cards) gains further performance via an a successful commit. Further, the hardware implementation
on-package 128 MB eDRAM cache. Meanwhile, the Ultrabook ensures the updates appear to occur instantaneously, when
variants save 95% standby power via new low-latency deep viewed from other threads and does so without requiring any
explicit cross-thread coordination. The implementation also
Manuscript received May 18, 2014; revised September 09, 2014; accepted uses the existing cache coherence protocol to detect con-
October 31, 2014. Date of publication November 25, 2014; date of current ver-
sion December 24, 2014. This paper was approved by Guest Editor Stephen flicting accesses from other threads. Since hardware is finite,
Kosonocky. transactional regions that access excessive state can exceed
The authors are with Intel Corporation, Hillsboro, OR 97124 USA. hardware buffering. Evicting a transactionally written line
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. from the data cache will cause a transactional abort. However,
Digital Object Identifier 10.1109/JSSC.2014.2368126 evicting a transactionally read line does not immediately cause
0018-9200 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
50 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015
and control logic is shared for each 64 bit block (two single links consist of multiple 8 or 16 bit data-clusters with for-
precision floating-point operations or one double precision op- warded clock, valid, ECC and low speed sideband signals.
eration). Clock gating precision specific logic is done to save Each lane’s programmable-strength push-pull CMOS driver
power. With these modifications to the multiplier micro-archi- transmits data on both clock edges, which is received by an
tecture, the unrounded multiply is completed in two cycles and adjustable P:N CMOS inverter driving a 1:4 demultiplexer with
able to feed into the 3-cycle FADD/FRND portion of the FMA differentially-clocked latches (Fig. 4). A digital DLL provides
as shown in Fig. 3. the sampling clocks, and all logic after the demux is standard
digital. Fig. 4 also illustrates the eDRAM OPIO configuration,
IV. EMBEDDED-DRAM AND ON-PACKAGE I/O consisting of four 16-bit data-clusters each way at 3.2 GHz
Haswell’s graphics halo version bolsters performance with along with a request cluster. The PCH OPIO configuration uses
128 MB of on-package L4 cache, fabricated with 0.029 m a single 8-bit data cluster each way at 1 GHz without a request
bitcells on Intel’s 22 nm Tri-Gate CMOS eDRAM process. Four cluster.
of the eight 16 MB macros operate in parallel to deliver a cache
line read or write each 1.6 GHz clock cycle. Retention time is V. DISPLAY I/O INTEGRATION
100 S at 93C, 1.0 V, and the 77 mm die (including charge The Haswell platform supports display via HDMI® 1.4
pumps and associated supply regulators) achieves a density of (3 GT/s; on-platform level shifters needed), DisplayPort® 1.2
17.5 Mb/mm [5]. (HBR2@5.4 GT/s), Digital Visual Interface (DVI), embedded
A new scalable, high-performance On-Package I/O (OPIO) Display Port (eDP, up to 5.4 GT/s), and Flexible Display
interconnect provides communications between CPU and Interface (FDI)—all of which, are driven by the CPU. Legacy
eDRAM (102.4 GB/s@1 W, or 1.22 pJ/b), or CPU and PCH VGA is also supported by bridging FDI through the PCH; the
(4 GB/s@32 mW, or 1.0 pJ/b). The well-matched 1.5 mm use of FDI restricts eDP to x2 mode as per Table I and Fig. 5.
on-package interconnect allows OPIO to forego receiver ter- Due to high-voltage incompatibilities with the CPU’s process
mination, equalization and per-pin de-skew; and the lack of tuning, sideband signals for DP and HDMI specifically are
package pin exposure reduces ESD protection (and hence pad also routed through the PCH using the “DMI” interface. All
capacitance) by 3X. For the required bandwidth, single-ended other sideband and data originate directly on the CPU’s five
OPIO uses 1/7th the silicon area compared to DDR3. OPIO reconfigurable “DDI” physical-layer ports. Haswell’s Display
52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015
TABLE I
SUPPORTED FREQUENCIES FOR SANDYBRIDGE, IVYBRIDGE, AND HASWELL DISPLAY PORTS
output filter inductors to be implemented using only the bottom trolled by current starving delay cells (Udi) through coarse and
metal layers of a standard flip-chip package. fine frequency control bits.
Each FIVR domain is controlled by a FIVR Control Module
(FCM). The FCM contains the circuitry for generating the Pulse IX. SUMMARY
Width Modulation (PWM) signals using double-edge modula-
The Haswell family of products scales from fan-less and
tion, as indicated in Fig. 13 by the dashed box. Separate current
Ultrabooks™ to high-end desktops and servers. Haswell im-
sensor circuitry shown in the figure is used for phase current
proves battery life 50% or more by adding deeper sleep states
balancing, telemetry, over current protection, and the resulting
with fast entry-exit times, reducing standby power by 95%.
digital PWM signals are distributed from the FCM to individual
Battery life is also improved by active power-performance
bridges. The PWM frequency, PWM gain, phase activation, and
optimizations such as independent voltage-frequency domains
the angle of each phase are all programmable in fine increments
with individually controlled voltage-frequency points, allowing
to enable optimal efficiency and minimum voltage ripple across
the power control unit to dynamically allocates the power
a span of different operating points. Spread spectrum is used
budget among the domains to maximize performance. The new
for Electromagnetic Interference (EMI) and Radio Frequency
family of products enables sleeker form factors, improves cost
Interference (RFI) control. The FCM module also contains
and lower power by integrating the voltage regulators using
the feedback control circuitry (compensator). A high-precision
air core inductors and the PCH using low-power on-package
9-bit DAC generates a reference voltage for a programmable,
I/O (OPIO). Performance is improved through the addition of
high bandwidth analog fully differential type-3 compensator.
eDRAM, providing over 100 GB/s of memory bandwidth, and
Sense lines feed the output voltage back to the compensator.
adding the Fused Multiple Add instruction, doubling floating
The endpoint of these sense lines is strategically placed to
point instructions per clock. Memory IO added support for
achieve minimum DC error and optimal transient response at
LPDDR and reduced active/standby power by 40%/100x re-
an important circuit location in the domain. The compensator
spectively through circuit optimizations such as improved high
is programmed individually for each voltage domain based on
voltage level shifters, weak lock DLL and on-die power gates.
its output filter, and can be reprogrammed while the domain is
active to maintain optimal transient response as phase shedding
occurs. REFERENCES
Triangular Wave Synthesizer is a sub-block of PWM. It in- [1] N. Kurd et al., “Haswell: A family of IA 22 nm processors,” in IEEE
ISSCC Dig. Tech. Papers, 2014, pp. 112–113.
cludes Oscillator and switch matrix unit. It receives VH, VL [2] C. Auth et al., “A 22 nm high performance and low-power CMOS tech-
as reference voltages from vh-vl reference generator and gener- nology featuring fully-depleted tri-gate transistors, self-aligned con-
ates a triangular waveform (Vx,i) between VH (high threshold) tacts and high density MIM capacitors,” in IEEE Symp. VLSI Tech.,
2012, pp. 131–132.
and VL (low threshold). Fig. 14 shows behavioral description [3] S. Damaraju et al., “A 22 nm IA multi-CPU and GPU system-on-chip,”
of Triangular Wave Synthesizer. Delay cells of oscillator drive in IEEE ISSCC Dig. Tech. Papers, 2012, pp. 56–57.
switches that connect switch matrix resistance to either VH or [4] T. Karnagel et al., “Improving in-memory database index performance
with Intel® transactional synchronization extensions,” in Proc. 20th
VL based on polarity and generate a triangular waveform (Vx,i). IEEE Int. Symp. High-Performance Computer Architecture, Feb. 2014,
Oscillator frequency or delay in delay line mode can be con- pp. 476–487.
KURD et al.: HASWELL: A FAMILY OF IA 22 nm PROCESSORS 57
[5] F. Hamzaoglu et al., “1 Gb 2 GHz embedded DRAM in 22 nm tri- Christopher Mozak received the B.S.E.E. degree
gate CMOS technology,” in IEEE ISSCC Dig. Tech. Papers, 2014, pp. from Cornell University, Ithaca, NY, USA, in 1994
230–231. and the M.S.E.E. from Stanford University, Stanford,
[6] N. Kurd et al., “Westmere: A family of 32 nm IA processors,” in IEEE CA, USA, in 2000.
ISSCC Dig. Tech. Papers, 2010, pp. 96–97. He joined Intel Corporation, Hillsboro, OR, USA,
[7] P. Hammarlund et al., “Haswell: The fourth-generation Intel core pro- in 1998, where he is currently a Senior Principal
cessor,” IEEE Micro, vol. 34, no. 2, pp. 6–20, 2014. Engineer, initially designing internal cache and
register files for Itanium and Xeon micro-processors.
Since 2004, his focus has been on high-speed IO
design, working on several generations of FSB,
QPI, and DDR products. His interests span from
architecture to circuit design to post-silicon validation, including high-speed
Nasser Kurd (S’93–M’95–SM’10) received the
IO and clocking designs, IO training and low-power analog design.
M.S.E.E. degree from the University of Texas, San
Antonio, TX, USA, in 1995.
He joined Intel Corporation, Hillsboro, OR, USA,
in 1996, where he is currently a Senior Principal
Engineer in the circuit technology group leading Brent Boswell received the B.S.E.E. and M.S.E.E.
next-generation clocking technologies. He has been degrees from Brigham Young University, Provo, UT,
involved in clocking, analog design, and I/O for USA, in 1989 and 1990, respectively.
several microprocessor generations. He was with He is currently a Principal Engineer in the Device
AMD in Austin, TX, USA, from 1994 to 1996. Development Group and has 23 years experience in
Mr. Kurd has served on several conference com- microprocessor design at Intel Corporation, Hills-
mittees, authored several publications, holds 39 granted patents, and has re- boro, OR, USA. He is an expert in floating-point
ceived two Intel Achievement Awards. uarch, RTL coding (FMA, FMUL, FADD, FDIV).
Ashish Khanna received the M.S. degree in Timothy M. Wilson received the M.S.E.E. degree
electrical engineering and computer engineering from the University of Illinois, Champaign, IL, USA,
(EE/CE) from the Rochester Institute of Technology in 2002.
(RIT), Rochester, NY, USA. He now works for Intel in the Devices Develop-
He is an Analog IC Designer in the Fully In- ment Group in Hillsboro, OR. His areas of exper-
tegrated Voltage Regulator (FIVR) team. He has tise include analog/IO design and debug and analog
previous experience in sensor based analog front end process/design interactions.
designs (switch cap based C/V convertor), ADCs
(SAR/RSD/Sigma-Delta based), bandgaps, linear
regulators, current reference generators, etc.
Nasirul Chowdhury received the B.S.E.E. degree Rajesh Kumar received the Master degree from
from Bangladesh University of Engineering and the California Institute of Technology, Pasadena,
Technology, Bangladesh, in 1995, and the M.S.E.E. CA, USA, in 1992 and the Bachelor degree from
degree from the Ohio State University, Columbus, the Indian Institute of Technology in 1991, both in
OH, USA, in 1998. electrical engineering.
He joined Intel Corporation, Hillsboro, OR, USA, He joined Intel Corporation, Hillsboro, OR, USA,
in 1998, where he is currently the technical lead of in 1992, where he is currently a Senior Fellow. He
high-speed serial IO responsible for PCIe3, DMI, and leads circuit and power technology development for
display interfaces for client microprocessors in 22 nm IA-32 microprocessors in the Product Development
and 14 nm processes. Previously, he was involved in Group (PDG) and, in that role, headed up the tech-
analog/IO design for several microprocessor genera- nology development of several microprocessor fam-
tions including Front Side Bus (FSB), Intel Quick Path Interconnect (QPI), and ilies. He is also PDG’s interface to process technology for microprocessors and
DDR interfaces. His interests include PHY architecture, IO top-level planning manages the Circuit Technology Group.
and optimization, circuit design, and post-Si validation.