Kurd Et Al 2015 Haswell

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO.
1, JANUARY 2015 49
Haswell: A Family of IA 22 nm Processors

Nasser Kurd, Senior Member, IEEE, Muntaquim Chowdhury, Edward Burton,
Thomas P. Thomas, Senior Member, IEEE, Christopher Mozak, Brent Boswell, Praveen Mosalikanti,
Mark Neidengard, Member, IEEE, Anant Deval, Member, IEEE, Ashish Khanna, Nasirul Chowdhury,
Ravi Rajwar, Member, IEEE, Timothy M. Wilson, and Rajesh Kumar
Abstract—We describe the 4th Generation Intel® Core™ sleep states (250 s “C9” and 500 s–5 ms “C10”) and PCH
processor family (codenamed “Haswell”) implemented on Intel® modifications (33% active and 94% standby power savings)
22 nm technology and intended to support form factors from desk- With the scalable Fully-Integrated Voltage Regulator (FIVR)
tops to fan-less Ultrabooks™. Performance enhancements include
a 102 GB/sec L4 eDRAM cache, hardware support for transac- system, each variant requires a single platform VR (instead of
tional synchronization, and new FMA instructions that double as many as five) in addition to the DDR VR.
FP operations per clock. Power improvements include Fully-In- The rest of the paper is organized as follows. Sections II and
tegrated Voltage Regulators ( 50% battery life extension), new III cover key core architectural additions, Sections IV–VI ad-
low-power states (95% standby power savings), optimized MCP dress I/O, and Sections VII and VIII cover low-power states and
I/O system (1.0–1.22 pJ/b), and improved DDR I/O circuits (40%
active and 100x idle power savings). Other improvements include FIVR. A summary is given in Section IX.
full-platform optimization via integrated display I/O interfaces.
II. INTEL® TRANSACTIONAL SYNCHRONIZATION EXTENSIONS
Index Terms—IA, on-packge I/O, DDR, eDRAM, voltage regu-
lator, FIVR, power management, FMA, Intel ® TSX. Intel® Transactional Synchronization Extensions (Intel®
TSX) applies the paradigm of speculative execution to
multi-threaded mutual exclusion. Intel TSX allows mul-
I. INTRODUCTION tiple threads to execute, in parallel, critical sections protected
by the same lock using a technique known as “lock elision”.
U SER experience drives not only platform velocity but Here, the critical section is executed transactionally but the
also radical market-segment diversification. The Haswell lock is only read and not acquired. This allows other threads
family (Fig. 1) answers this imperative via a set of common to not serialize and exposes parallelism difficult to extract at
building blocks (Platform Controller Hubs (PCHs), memory, compile time. The processor can thus determine dynamically if
CPU cores, media engines, graphics etc.) that span fanless it needs to serialize critical section execution.
Ultrabooks™ and 2-in-1 devices to traditional and All-In-One Intel TSX provides programmers with two software inter-
desktop systems [1]. Fig. 2 shows example die photos for the faces to specify transactional regions. The Hardware Lock Eli-
“ ” and “ ” (cores graphics level) versions, at 1.4B sion (HLE) interface is a pair of legacy compatible prefixes
xtors/ 177 mm , and 1.3B xtors/ 181 mm respectively. called XACQUIRE and XRELEASE. The Restricted Transac-
The 4-core halo version with Iris Pro graphics weighs in at tional Memory (RTM) interface is a pair of new instructions
260 mm . All variants use Intel’s 22 nm Tri-gate process [2], called XBEGIN and XEND. The Intel® 64 Architecture Soft-
with segment-specific speed/leakage targeting, two additional ware Developer Manual provides a detailed specification for
metal layers versus prior generation Ivy Bridge [3], and added these new instructions and the Intel 64 Architecture Optimiza-
high-density metal-insulator-metal (MIM) capacitors. Further tion Reference Manual presents guidelines for program opti-
per-die process optimization is made possible by Haswell’s mization.
Multi-Chip Packaging (MCP) strategy. The Haswell implementation uses the first level 32 KB data
Depending on configuration, core frequencies range from cache (L1) to track the memory addresses accessed (both read
1.0–3.8 GHz. Architectural upgrades detailed below yield and written) during a transactional execution and to buffer any
a baseline 13% specint* and 40% specviewperf* genera- transactional updates performed to memory. The implemen-
tional performance boost. The graphics halo variant (rivaling tation makes these updates visible to other threads only on
entry/mid-level discrete cards) gains further performance via an a successful commit. Further, the hardware implementation
on-package 128 MB eDRAM cache. Meanwhile, the Ultrabook ensures the updates appear to occur instantaneously, when
variants save 95% standby power via new low-latency deep viewed from other threads and does so without requiring any
explicit cross-thread coordination. The implementation also
Manuscript received May 18, 2014; revised September 09, 2014; accepted uses the existing cache coherence protocol to detect con-
October 31, 2014. Date of publication November 25, 2014; date of current ver-
sion December 24, 2014. This paper was approved by Guest Editor Stephen flicting accesses from other threads. Since hardware is finite,
Kosonocky. transactional regions that access excessive state can exceed
The authors are with Intel Corporation, Hillsboro, OR 97124 USA. hardware buffering. Evicting a transactionally written line
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. from the data cache will cause a transactional abort. However,
Digital Object Identifier 10.1109/JSSC.2014.2368126 evicting a transactionally read line does not immediately cause
0018-9200 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
50 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015
Fig. 2. Haswell quad and dual-die photographs.
Fig. 3. FMA block diagram.
III. FUSED MULTIPLY-ADD (FMA)

Haswell brings a significant floating-point improvement
over the prior processor generation code named Ivy Bridge, by
doubling the maximum number of floating-point operations per
clock cycle, and reducing the latency of dependent multiply-add
chains by 38%. This is made possible by replacing a 5-cycle FP
multiply with a 5-cycle FP Multiply-Add (FMA), and a 3-cycle
FP Add with an FMA, allowing two 256 bit FMAs to begin
each clock cycle. Memory bandwidth from the L0 data cache to
the execution hardware has also doubled, allowing two 256 bit
loads, and one 256 bit store per clock cycle. Additions of the
FMA and 32 byte load/store give Haswell a % performance
improvement in specFP, and a significant improvement in some
Fig. 1. Several configurations of the Haswell family. (a) Processor and PCH
FP kernels compared to Ivy Bridge.
(2-chip). (b) Ultrabook®. (c) Iris™ Pro graphics. The new FMA design made several changes to micro-archi-
tecture and implementation to reduce latency, and lower energy
per floating-point operation. One change involved supporting
an abort. The hardware moves the line to a secondary structure single and double precision but not extended precision in the
for subsequent tracking. This allows the implementation to new FMA block. This change reduces the number of bits and
tolerate any cache associativity limitations on read accesses. the number of partial products in the Wallace tree array, re-
On a transactional abort, the hardware implementation is sulting in lower latency and power. Extended precision is still
responsible for discarding both memory and register updates. supported for legacy FMUL and FADD in a separate block. An-
The implementation minimizes micro-architecture-specific other change involves switching from a radix-8 to a radix-4 mul-
causes for aborts—branch mispredicts, cache misses, TLB tiply to eliminate the delay of the 3x adder. The Wallace tree
misses, and similar events do not cause aborts [4]. is then re-organized to make it balanced. Finally, the datapath
KURD et al.: HASWELL: A FAMILY OF IA 22 nm PROCESSORS 51
Fig. 4. Haswell CPU-eDRAM MCP package with OPIO.
and control logic is shared for each 64 bit block (two single links consist of multiple 8 or 16 bit data-clusters with for-
precision floating-point operations or one double precision op- warded clock, valid, ECC and low speed sideband signals.
eration). Clock gating precision specific logic is done to save Each lane’s programmable-strength push-pull CMOS driver
power. With these modifications to the multiplier micro-archi- transmits data on both clock edges, which is received by an
tecture, the unrounded multiply is completed in two cycles and adjustable P:N CMOS inverter driving a 1:4 demultiplexer with
able to feed into the 3-cycle FADD/FRND portion of the FMA differentially-clocked latches (Fig. 4). A digital DLL provides
as shown in Fig. 3. the sampling clocks, and all logic after the demux is standard
digital. Fig. 4 also illustrates the eDRAM OPIO configuration,
IV. EMBEDDED-DRAM AND ON-PACKAGE I/O consisting of four 16-bit data-clusters each way at 3.2 GHz
Haswell’s graphics halo version bolsters performance with along with a request cluster. The PCH OPIO configuration uses
128 MB of on-package L4 cache, fabricated with 0.029 m a single 8-bit data cluster each way at 1 GHz without a request
bitcells on Intel’s 22 nm Tri-Gate CMOS eDRAM process. Four cluster.
of the eight 16 MB macros operate in parallel to deliver a cache
line read or write each 1.6 GHz clock cycle. Retention time is V. DISPLAY I/O INTEGRATION
100 S at 93C, 1.0 V, and the 77 mm die (including charge The Haswell platform supports display via HDMI® 1.4
pumps and associated supply regulators) achieves a density of (3 GT/s; on-platform level shifters needed), DisplayPort® 1.2
17.5 Mb/mm [5]. (HBR2@5.4 GT/s), Digital Visual Interface (DVI), embedded
A new scalable, high-performance On-Package I/O (OPIO) Display Port (eDP, up to 5.4 GT/s), and Flexible Display
interconnect provides communications between CPU and Interface (FDI)—all of which, are driven by the CPU. Legacy
eDRAM (102.4 GB/s@1 W, or 1.22 pJ/b), or CPU and PCH VGA is also supported by bridging FDI through the PCH; the
(4 GB/s@32 mW, or 1.0 pJ/b). The well-matched 1.5 mm use of FDI restricts eDP to x2 mode as per Table I and Fig. 5.
on-package interconnect allows OPIO to forego receiver ter- Due to high-voltage incompatibilities with the CPU’s process
mination, equalization and per-pin de-skew; and the lack of tuning, sideband signals for DP and HDMI specifically are
package pin exposure reduces ESD protection (and hence pad also routed through the PCH using the “DMI” interface. All
capacitance) by 3X. For the required bandwidth, single-ended other sideband and data originate directly on the CPU’s five
OPIO uses 1/7th the silicon area compared to DDR3. OPIO reconfigurable “DDI” physical-layer ports. Haswell’s Display
TABLE I
SUPPORTED FREQUENCIES FOR SANDYBRIDGE, IVYBRIDGE, AND HASWELL DISPLAY PORTS
of the level shifter. The VssHi voltage is generated using an

on-die digital linear regulator with 97% peak-power efficiency.
Since the pulsed level shifter tends to under or overshoot the
target VssHi voltage across PVT, the VssHi regulator must be
able to source or sink current. The linear regulator uses a Class
AB output stage for high power efficiency, where the 2 Vt drop
is implemented using a resistor DAC.
The DDR PHY can be configured to support DDR3 or
LPDDR3, as well as different platform configurations such
as DIMMs in traditional form factors or memory down for
Ultrabooks™. Traditional DIMM configurations generally
stack the memory channels front to back while memory down
prefers side by side organizations. To support this with a single
design and minimal board routing layers, the assignment of
Data[Channel][Byte] to physical IO buffers is programmable
with on die multiplexors in the write, read and control paths.
As shown in Fig. 7, the high voltage DDR power is gated off
to reduce leakage power to 1 mW in low-power states. While
some prior platforms implemented power gates on the mother-
board, these were expensive and only saved power in very deep
Fig. 5. Haswell display port interface. power states such as S3 (Standby). Integration on die reduced
cost and enabled power gating in shallower package C-States
Engine (DE), which handles data routing, format conversion, with only a 100 nS wakeup time, reducing C-State power by
and port configuration/initialization, can drive up to three over 100x. The power gate is stacked to handle the high voltage
displays in parallel. levels and achieve the desired leakage goal. To maintain good
ESD performance, the supply clamps are placed on the gated
VI. DDR supply (VddqG) near the IO buffers. To avoid the clamp from
The Haswell memory controller supports two channels of falsely triggering during the relative fast wake up events, the
DDR3/3L or LPDDR3, using a 1.2 to 1.5 V supply. Similar to clamp is modified to keep the RC timer alive using the ungated
prior generations, it is built in a process that only supports “thin- supply. Generally speaking, this RC timer must be on the order
gate” transistors with a maximum junction voltage of 1 V. of 1 S to ensure the clamp does not turn off during HBM ESD
The transmitter is a CMOS voltage mode driver with a cascoded events. When the power is gated, the RC timer is parked a Vt
output to handle the high voltage levels as shown in Fig. 6. The drop above the VddG rail. When VddG is turned on, the ramp
most difficult part of the transmitter, from a performance and time must be slow enough such that the clamp’s keeper cir-
power point of view, is the PMOS predriver that must level shift cuit can maintain the RC timer at a high level and not trigger
from low to high voltage. Prior designs set the level shifter VOL the clamp.
level using bias circuits, diode or Vt drops, resulting in signifi- The ability to clock gate or power gate is frequently limited
cant process dependency and tradeoffs between performance vs. by the latency required to wake the circuits back up. To avoid
power [6]. Haswell uses a new pulsed level shifter design with significant performance degradation, higher exit latency gener-
an explicit supply, VssHi, setting the VOL output level. This ally translates into both lower residency in a given power state
improved both area and power while eliminated the power vs. and longer wait times before entering that state. Essentially, the
performance tradeoffs inherent in the prior designs. The pulsed granularity of power management is directly proportional to the
nature of the level shifter allows fast switching speeds using exit latencies and fine grain gating is only possible with low exit
almost full rail signals while maintaining the DC VOL level at latencies. Furthermore, it is common that protocol and/or perfor-
VssHi to minimize gate oxide stress. As shown in Fig. 6, NB/PB mance requirements dictate a fixed exit latency limit for a given
pulse low to a Vt drop below VssHi and improve the writability power state. If a circuit cannot meet that limit, it must be left
Fig. 6. DDR cascode output driver and pre-driver level shifter.
In this state, the analog bias is maintained on a capacitor and pe-

riodically refreshed. Since only the bias logic is operating, the
majority of the DLL can be disabled and total power reductions
of over 70% are demonstrated. The exit time from this weakly
locked state is very fast since the analog bias is already very
close to the correct value and only requires 10 cycles for the
feedback loop to refine it.
The basic DLL block diagram, including the weak lock logic,
is shown in Fig. 8. For this particular case, the DLL uses a cur-
rent starved inverter delay cell topology that provides full swing
signals and the delay line can be powered down by simply gating
the input clock. When the part wants to enter a low-power mode,
it asserts the WeakLockEn signal and this masks the clocks
going into the delay line and phase detector. This will com-
pletely power down the delay line, phase distribution and PI
with only leakage power remaining. The only circuits that con-
tinue to consume static power are the charge pump and Nbias
generator. For this implementation, these circuits are left run-
ning to minimize exit latency. However, it is possible that for
a different DLL circuit architecture or exit latency requirement,
Fig. 7. DDR high-voltage power gate and ESD solution. one or both of these could also be powered down.
VII. LOW-POWER STATES

on; this results in lost opportunities and higher average power.
In order to achieve these very fast relock times, the DLL enters a Addressing power-limited market segments required Haswell
weakly locked state instead of being completely powered down. to tune for maximum frequency at lowered voltages, affecting
Fig. 8. Haswell DDR DLL with weak-lock feature.
device-vs-wire delay balance, circuit sizing, and power parti-

tioning. Dynamic power- and clock-gating now pervades all
units, and the stand-alone Ring Bus permits opportunistically
disabling whole die areas (e.g., cache snoop servicing may uti-
lize DDR and L3 cache while gating cores, Display, etc.). Dif-
ferent supply partitions (core, System Agent, analog I/Os, etc.)
receive different low-power-state voltages according to their re-
spective supply sensitivities, mandating proper manufacturing
test content for calibration.
Previous Intel CPUs allowed the OS (including device
drivers) to request “C-states” from C0 (executing code and
graphics) to C6 (core and graphics gated off). Haswell adds four
“package C-states”: C7 (display on, cores/graphics/L3/DDR
gated), C8 (System Agent gated, FIVR input supply dimmed),
C9 (FIVR input off), and C10 (platform VR supplying FIVR Fig. 9. Ultrabook™ IREM images illustrating new deeper C states: (a) package
input also off). Perforce, package C8 precludes display C7; (b) package C9.
I/O, highlighting the value of displays equipped with Panel
Refresh. Haswell maintains critical state, few IOs and small
amount of logic that detects wake up events to exit out of the (up to 16) buck converters whose proximity to time-varying
package states on an “always on” supply that is driven from on-die loads confers an inherent advantage over platform-level
the platform. C-state exit latency increases and idle power VRs. High switching frequency and phase count improve
decreases monotonically with increasing C-state index. For C1, regulation, reducing the required L and C output filtering to
the exit latency is less than 1 s while for C10 is 300 s–3 ms levels serviceable with standard-package-trace inductors and
depending on the VR controller actions and voltage ramp rates. on-package caps (on-die MIM in most cases). This in turn re-
Fig. 9 shows IREM emission photographs of activity in C7 duces the “inertia” of power-state transitions, promoting higher
and C9. Using these new states, Haswell Ultrabook™ achieves low-power residency and improved burst-mode behavior. Up
to 13 independently-tuned FIVRs are used, depending on the
20x idle power reduction versus Ivy Bridge.
Haswell variant.
All FIVRs share a single input supply rail, tolerating
VIII. FULLY-INTEGRATED VOLTAGE REGULATORS (FIVR) 500 mV of noise and therefore reducing input-side decap
Haswell owes much of its 50% battery life improvement requirements by 10x. By redistributing input current as work-
(vs. Ivy Bridge) to FIVRs: 32 A/mm , 140 MHz multi-phase loads require, FIVR can make available over 2–3X the total
Fig. 12. FIVR cutaway view.

Fig. 10. Measure graphics impedance profile for Haswell compared to prior
generation.
Fig. 13. FIVR bridge driver and current sensor.

Fig. 11. FIVR controller.
and PMOS cascode power switches. The cascode configuration

power (even after internal loss) of a platform VR with split allows the power switches to be implemented with standard
core and graphics rails and comparable filtering passive count. 22 nm logic devices while still handling an input voltage of up
This more than doubled graphics resources in small platforms to 1.8 V. This avoids the cost of extra processing steps for high
(e.g., 15 W graphics execution unit count increased from 16 voltage devices, while achieving excellent switching charac-
to 40), while improved worst-case droop actually boosted teristics. The bridge drivers are controlled thru high-voltage
same-process-node operating frequency by 30% over Ivy level-shifters and support zero-voltage switching (ZVS) and
Bridge. As shown in Fig. 10, 7.5 milliohm VR impedance in zero-current-switching (ZCS) soft-switching operation. The
prior generation is reduced to 0.5 milliohm impedance in the gates of the cascode devices are connected to the “half-rail”
required frequency range. Another advantage of integrating Vin/2. This is also the negative supply of the PMOS bridge
the voltage regulator is, with FIVR, platform builders gain the driver as well as the positive supply of the NMOS bridge
freedom to parlay this into form-factor/BOM cost reductions, driver. The area occupied by the power switches and drivers is
and the ability to retain their single-rail power delivery solution small, so they are distributed across the die, immediately above
for future feature-enhanced CPUs [7]. the connection to their associated package inductor which
FIVR Controller: Haswell FIVRs are multi-phase, each minimizes routing losses. This is illustrated in Fig. 12, which
FIVR is independently programmable to achieve optimal op- shows the location of the package inductors under the die for
eration given the requirements of the domain it is powering. a four core LGA part. The driver circuitry is interleaved with
The settings are optimized by the Power Control Unit (PCU), the power switches in an array which minimizes parasitics to
which specifies the input voltage, output voltage, number of allow for very high switching frequencies. This also allows
operating phases, and a variety of other settings to minimize the size of the bridge to be easily scaled based on the current
the total power consumption of the die. A simplified block requirement and optimization points for each power domain.
diagram representing the circuitry for a single FIVR domain In order to keep the buck output filter small enough to fit on
is shown in Fig. 11. The buck regulator bridges are formed by the die and package it is necessary for FIVR to switch at a
replacing the power gates in previous products with NMOS high frequency—140 MHz in most cases. This allows the buck
Fig. 14. FIVR triangular wave synthesizer.
output filter inductors to be implemented using only the bottom trolled by current starving delay cells (Udi) through coarse and
metal layers of a standard flip-chip package. fine frequency control bits.
Each FIVR domain is controlled by a FIVR Control Module
(FCM). The FCM contains the circuitry for generating the Pulse IX. SUMMARY
Width Modulation (PWM) signals using double-edge modula-
The Haswell family of products scales from fan-less and
tion, as indicated in Fig. 13 by the dashed box. Separate current
Ultrabooks™ to high-end desktops and servers. Haswell im-
sensor circuitry shown in the figure is used for phase current
proves battery life 50% or more by adding deeper sleep states
balancing, telemetry, over current protection, and the resulting
with fast entry-exit times, reducing standby power by 95%.
digital PWM signals are distributed from the FCM to individual
Battery life is also improved by active power-performance
bridges. The PWM frequency, PWM gain, phase activation, and
optimizations such as independent voltage-frequency domains
the angle of each phase are all programmable in fine increments
with individually controlled voltage-frequency points, allowing
to enable optimal efficiency and minimum voltage ripple across
the power control unit to dynamically allocates the power
a span of different operating points. Spread spectrum is used
budget among the domains to maximize performance. The new
for Electromagnetic Interference (EMI) and Radio Frequency
family of products enables sleeker form factors, improves cost
Interference (RFI) control. The FCM module also contains
and lower power by integrating the voltage regulators using
the feedback control circuitry (compensator). A high-precision
air core inductors and the PCH using low-power on-package
9-bit DAC generates a reference voltage for a programmable,
I/O (OPIO). Performance is improved through the addition of
high bandwidth analog fully differential type-3 compensator.
eDRAM, providing over 100 GB/s of memory bandwidth, and
Sense lines feed the output voltage back to the compensator.
adding the Fused Multiple Add instruction, doubling floating
The endpoint of these sense lines is strategically placed to
point instructions per clock. Memory IO added support for
achieve minimum DC error and optimal transient response at
LPDDR and reduced active/standby power by 40%/100x re-
an important circuit location in the domain. The compensator
spectively through circuit optimizations such as improved high
is programmed individually for each voltage domain based on
voltage level shifters, weak lock DLL and on-die power gates.
its output filter, and can be reprogrammed while the domain is
active to maintain optimal transient response as phase shedding
occurs. REFERENCES
Triangular Wave Synthesizer is a sub-block of PWM. It in- [1] N. Kurd et al., “Haswell: A family of IA 22 nm processors,” in IEEE
ISSCC Dig. Tech. Papers, 2014, pp. 112–113.
cludes Oscillator and switch matrix unit. It receives VH, VL [2] C. Auth et al., “A 22 nm high performance and low-power CMOS tech-
as reference voltages from vh-vl reference generator and gener- nology featuring fully-depleted tri-gate transistors, self-aligned con-
ates a triangular waveform (Vx,i) between VH (high threshold) tacts and high density MIM capacitors,” in IEEE Symp. VLSI Tech.,
2012, pp. 131–132.
and VL (low threshold). Fig. 14 shows behavioral description [3] S. Damaraju et al., “A 22 nm IA multi-CPU and GPU system-on-chip,”
of Triangular Wave Synthesizer. Delay cells of oscillator drive in IEEE ISSCC Dig. Tech. Papers, 2012, pp. 56–57.
switches that connect switch matrix resistance to either VH or [4] T. Karnagel et al., “Improving in-memory database index performance
with Intel® transactional synchronization extensions,” in Proc. 20th
VL based on polarity and generate a triangular waveform (Vx,i). IEEE Int. Symp. High-Performance Computer Architecture, Feb. 2014,
Oscillator frequency or delay in delay line mode can be con- pp. 476–487.
[5] F. Hamzaoglu et al., “1 Gb 2 GHz embedded DRAM in 22 nm tri- Christopher Mozak received the B.S.E.E. degree
gate CMOS technology,” in IEEE ISSCC Dig. Tech. Papers, 2014, pp. from Cornell University, Ithaca, NY, USA, in 1994
230–231. and the M.S.E.E. from Stanford University, Stanford,
[6] N. Kurd et al., “Westmere: A family of 32 nm IA processors,” in IEEE CA, USA, in 2000.
ISSCC Dig. Tech. Papers, 2010, pp. 96–97. He joined Intel Corporation, Hillsboro, OR, USA,
[7] P. Hammarlund et al., “Haswell: The fourth-generation Intel core pro- in 1998, where he is currently a Senior Principal
cessor,” IEEE Micro, vol. 34, no. 2, pp. 6–20, 2014. Engineer, initially designing internal cache and
register files for Itanium and Xeon micro-processors.
Since 2004, his focus has been on high-speed IO
design, working on several generations of FSB,
QPI, and DDR products. His interests span from
architecture to circuit design to post-silicon validation, including high-speed
Nasser Kurd (S’93–M’95–SM’10) received the
IO and clocking designs, IO training and low-power analog design.
M.S.E.E. degree from the University of Texas, San
Antonio, TX, USA, in 1995.
He joined Intel Corporation, Hillsboro, OR, USA,
in 1996, where he is currently a Senior Principal
Engineer in the circuit technology group leading Brent Boswell received the B.S.E.E. and M.S.E.E.
next-generation clocking technologies. He has been degrees from Brigham Young University, Provo, UT,
involved in clocking, analog design, and I/O for USA, in 1989 and 1990, respectively.
several microprocessor generations. He was with He is currently a Principal Engineer in the Device
AMD in Austin, TX, USA, from 1994 to 1996. Development Group and has 23 years experience in
Mr. Kurd has served on several conference com- microprocessor design at Intel Corporation, Hills-
mittees, authored several publications, holds 39 granted patents, and has re- boro, OR, USA. He is an expert in floating-point
ceived two Intel Achievement Awards. uarch, RTL coding (FMA, FMUL, FADD, FDIV).
Muntaquim Chowdhury received the Bachelors

degree in electrical engineering from the Moscow
Power Engineering Institute, Moscow, Russia, and Praveen Mosalikanti received the B.E. (Honors) de-
the Ph.D. degree in computer engineering from gree in electrical and electronics engineering from
Washington State University, Pullman, WA, USA. the Birla Institute of Technology and Science, Pilani,
He is a Senior Principal Engineer at Intel Corpora- India, in 1997, and the M.S. degree in electrical and
tion, Hillsboro, OR, USA, and currently Device De- computer engineering from the University of Massa-
velopment Group Chief Technologist. He has been chusetts, Amherst, MA, USA, in 1999.
working at Intel since 1992 where he has led Archi- He joined Intel Corporation, Hillsboro, OR, USA,
tecture and Logic Development for multiple genera- in 1999 and has been working on various analog
tions of Intel CPUs. circuits since 2002. He currently manages the PLL
design team. His research interests include PLLs,
DLLs, phase interpolators, voltage regulators, and
high-speed IOs. He holds six patents.
Edward Burton received the B.S. degree in physics

from Brigham Young University, Provo, UT, USA.
He joined Intel’s super-computer router design Mark Neidengard (S’93–M’02) received the B.S.
group in 1992, and moved to Intel’s Oregon micro- and M.S. degrees in computer science from the Cal-
processor design group a year later (assuming power ifornia Institute of Technology, Pasadena, CA, USA,
delivery and package ownership for the P6 project in 1997 and 1998, respectively, and the Ph.D. degree
that produced the Pentium Pro). He is presently a in electrical and computer engineering from Cornell
Senior Principal Engineer working on Intel’s power University, Ithaca, NY, USA, in 2002.
conversion, power conditioning and packaging He is a Technical Lead for Analog Circuit Design,
technologies. Prior to his work at Intel, he designed with emphasis on clocking circuits and robust design
DRAM controllers and small logic chips (ECL, methods. He has also worked in the areas of con-
Bipolar, BiCMOS and CMOS technologies) for Signetics (1983–1992). current supercomputing, asynchronous circuits, logic
Mr. Burton has received six Intel achievement awards. He has authored sev- optimization, and design support software. He holds
eral publications and holds 49 granted patents covering a broad range of tech- four patents and two Trade Secrets, with two other patents pending.
nology. He serves as Board Chairman for a 501c3 non-profit caring for homeless
children in Ethiopia.
Anant Deval (M’14) received the M.S. degree in

electrical engineering from Arizona State University,
Tempe, AZ, USA, in 1997 and the B.E. degree in
Thomas P. Thomas (M’96–SM’06) received the electronics and communication engineering from the
B.Tech. degree in electronics and communication University of Delhi, India, in 1995.
engineering from the Indian Institute of Technology, He is a Principal Engineer in the Intel Architecture
Madras, India, in 1991, and the M.S. degree in Group. He led power management design on Haswell
electrical engineering from the Oregon Graduate and is currently involved in delivering the next gener-
Institute School of Science and Engineering, Hills- ation of power delivery and power management solu-
boro, OR, USA, in 1993. tions on SOCs. He drove the design of the first Power
He is a Principal Engineer in the Platform Engi- Control Unit (PCU) on Nehalem and is interested in
neering Group at Intel Corporation, Hillsboro. Prior energy-efficient design, clocking strategies, on-die power delivery circuits, and
to joining Intel in 1993, he designed GaAs circuits at platform power optimizations. He holds several patents and has received two
TriQuint Semiconductor. Intel Achievement Awards.
Ashish Khanna received the M.S. degree in Timothy M. Wilson received the M.S.E.E. degree
electrical engineering and computer engineering from the University of Illinois, Champaign, IL, USA,
(EE/CE) from the Rochester Institute of Technology in 2002.
(RIT), Rochester, NY, USA. He now works for Intel in the Devices Develop-
He is an Analog IC Designer in the Fully In- ment Group in Hillsboro, OR. His areas of exper-
tegrated Voltage Regulator (FIVR) team. He has tise include analog/IO design and debug and analog
previous experience in sensor based analog front end process/design interactions.
designs (switch cap based C/V convertor), ADCs
(SAR/RSD/Sigma-Delta based), bandgaps, linear
regulators, current reference generators, etc.
Nasirul Chowdhury received the B.S.E.E. degree Rajesh Kumar received the Master degree from
from Bangladesh University of Engineering and the California Institute of Technology, Pasadena,
Technology, Bangladesh, in 1995, and the M.S.E.E. CA, USA, in 1992 and the Bachelor degree from
degree from the Ohio State University, Columbus, the Indian Institute of Technology in 1991, both in
OH, USA, in 1998. electrical engineering.
He joined Intel Corporation, Hillsboro, OR, USA, He joined Intel Corporation, Hillsboro, OR, USA,
in 1998, where he is currently the technical lead of in 1992, where he is currently a Senior Fellow. He
high-speed serial IO responsible for PCIe3, DMI, and leads circuit and power technology development for
display interfaces for client microprocessors in 22 nm IA-32 microprocessors in the Product Development
and 14 nm processes. Previously, he was involved in Group (PDG) and, in that role, headed up the tech-
analog/IO design for several microprocessor genera- nology development of several microprocessor fam-
tions including Front Side Bus (FSB), Intel Quick Path Interconnect (QPI), and ilies. He is also PDG’s interface to process technology for microprocessors and
DDR interfaces. His interests include PHY architecture, IO top-level planning manages the Circuit Technology Group.
and optimization, circuit design, and post-Si validation.
Ravi Rajwar (S’91–M’02) received the Ph.D. de-

gree in computer science from the University of Wis-
consin-Madison, Madison, WI, USA.
He joined Intel Corporation, Hillsboro, OR, USA,
in 2002, where he is a Principal Engineer in the
Product Development Architecture Group. He is
currently working on various aspects of CPU and
SoC architecture and development.

Kurd Et Al 2015 Haswell

Uploaded by

Copyright:

Available Formats

You might also like

Kurd Et Al 2015 Haswell

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kurd Et Al 2015 Haswell

Uploaded by

Copyright:

Available Formats

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO.

Haswell: A Family of IA 22 nm Processors

Fig. 2. Haswell quad and dual-die photographs.

Fig. 3. FMA block diagram.

III. FUSED MULTIPLY-ADD (FMA)

Fig. 4. Haswell CPU-eDRAM MCP package with OPIO.

of the level shifter. The VssHi voltage is generated using an

Fig. 6. DDR cascode output driver and pre-driver level shifter.

In this state, the analog bias is maintained on a capacitor and pe-

VII. LOW-POWER STATES

Fig. 8. Haswell DDR DLL with weak-lock feature.

device-vs-wire delay balance, circuit sizing, and power parti-

Fig. 12. FIVR cutaway view.

Fig. 13. FIVR bridge driver and current sensor.

and PMOS cascode power switches. The cascode configuration

Fig. 14. FIVR triangular wave synthesizer.

Muntaquim Chowdhury received the Bachelors

Edward Burton received the B.S. degree in physics

Anant Deval (M’14) received the M.S. degree in

Ravi Rajwar (S’91–M’02) received the Ph.D. de-

You might also like